0% found this document useful (0 votes)
6 views160 pages

Block 1

The document outlines a course on Statistical Inference offered by Indira Gandhi National Open University, focusing on sampling distributions and properties of good estimators. It is divided into two volumes, with Volume 1 covering foundational concepts such as the central limit theorem, sampling distributions, and various properties of estimators. The course aims to equip learners with skills in estimation and hypothesis testing, supported by a structured curriculum and recommended readings.

Uploaded by

Aaryan Arun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views160 pages

Block 1

The document outlines a course on Statistical Inference offered by Indira Gandhi National Open University, focusing on sampling distributions and properties of good estimators. It is divided into two volumes, with Volume 1 covering foundational concepts such as the central limit theorem, sampling distributions, and various properties of estimators. The course aims to equip learners with skills in estimation and hypothesis testing, supported by a structured curriculum and recommended readings.

Uploaded by

Aaryan Arun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 160

MST-016

STATISTICAL
Indira Gandhi National Open University
School of Sciences INFERENCE

Sampling Sampling
distribution distribution
under under
H0 H1

ᵝ α
H0: θ = θ0 H1: θ = θ1>θ0

ndom Selectio
Ra n

Sample

Population

Sampling distribution
of test statistic
R2

(1- α) Non-rejection
region
Rejection/ Rejection/
Critical region R3 R5 Critical region
R
α/2 6 α/2 R9

Critical value Critical value

R4

FOUNDATION OF STATISTICAL VOLUME 1


INFERENCE
MST-016
STATISTICAL
Indira Gandhi National Open University
School of Sciences INFERENCE

Volume

1
FOUNDATION OF STATISTICAL INFERENCE
BLOCK 1
Sampling Distributions 5

BLOCK 2
Properties of Good Estimator 159

Appendix 243
Curriculum and Course Design Committee
Prof. Sujatha Varma Prof. Rakesh Srivastava
Former Director, SOS Department of Statistics
IGNOU, New Delhi M.S. University of Baroda, Vadodara (GUJ)

Prof. Diwakar Shukla Prof. Sanjeev Kumar


Department of Mathematics and Statistics Department of Statistics
Dr. H.S. Gaur Central University, Sagar (MP) Banaras Hindu University, Varanasi (UP)

Prof. Gulshan Lal Taneja Prof. Shalabh


Department of Mathematics Department of Mathematics and Statistics
M.D. University, Rohtak (HR) Indian Institute of Technology, Kanpur (UP)

Prof. Gurprit Grover Prof. V. K. Singh (Retd.)


Department of Statistics Department of Statistics
University of Delhi, New Delhi Banaras Hindu University, Varanasi (UP)

Prof. H. P. Singh Prof. Manish Trivedi, SOS, IGNOU


Department of Statistics
Vikram University, Ujjan (MP) Dr. Taruna Kumari, SOS, IGNOU

Prof. Rahul Roy Dr. Neha Garg, SOS, IGNOU


Mathematics and Statistics Unit
Indian Statistical Institute, New Delhi Dr. Rajesh, SOS, IGNOU

Prof. Rajender Prasad Dr. Prabhat Kumar Sangal, SOS, IGNOU


Division of Design of Experiments,
IASRI, Pusa, New Delhi Dr. Gajraj Singh, SOS, IGNOU

Course Preparation Team


Course Editors Course Writer
Prof. Ram Kishan (Units 1-9) Dr. Prabhat Kumar Sangal (Units 1-9)
Department of Statistics School of Sciences,
D.A.V. (PG) College, IGNOU, New Delhi
Maa Shakambhari University, Saharanpur (UP)

Formatted and CRC prepared by Ms Preeti, SOS, IGNOU


Course Coordinator: Dr. Prabhat Kumar Sangal
Programme Coordinators: Dr. Neha Garg and Dr. Prabhat Kumar Sangal

Print Production
Mr. Tilak Raj
Assistant Registrar
MPDD, IGNOU, New Delhi

November, 2024
© Indira Gandhi National Open University, 2024
ISBN-978-81-266-
All rights reserved. No part of this work may be reproduced in any form, by mimeograph or any other means,
without permission in writing from the Indira Gandhi National Open University
Further information on the Indira Gandhi National Open University may be obtained from the University’s Office at
Maidan Garhi, New Delhi-110068 or visit University’s website http://www.ignou.ac.in
Printed and published on behalf of the Indira Gandhi National Open University, New Delhi by the Director, School
of Sciences.
Printed at: Raj Printers, A-9, Sector B-2, Tronica City Loni, Ghaziabad, UP-201102
VOLUME 1: FOUNDATION OF STATISTICAL
INFERENCE
Dear learners, welcome to the course titled, “Statistical Inference”. In many situations, we
have to extract some information from all elements or units of a large population, in general, it
may be time-consuming and expensive. Also, if the elements or items of a population are
destroyed under investigation then gathering information from all elements of the population is
not making sense because all elements will be destroyed under investigation. Therefore, in
most of the cases in daily life, business and industry, the information about the population is
gathered through a random sample. The results of a properly taken sample enable the
investigator to arrive at generalisations that are valid for the entire population. The process of
generalising sample results to the population is called Statistical Inference. The statistical
inference may be divided into two categories:
(i) If the population characteristics (known as parameters) are unknown and we guess the
true value of the unknown parameters on the basis of a random sample drawn from the
population, then this technique is known as “Estimation”.
(ii) If some information is available about the population characteristics and we verify whether
the information is true or not on the basis of a random sample drawn from the population,
then this technique is known as “Testing of hypothesis”.
The aim of this course is to develop the skills of learners to estimate and testing of hypothesis
about the population characteristics such as means, proportions, variances, etc.
This theory course is worth four credits and is comprised of eighteen units that are organised
into four blocks into two volumes each of two credits which are given as follows:
Volume 1: Foundation of Statistical Inference
Volume 2: Estimation and Parametric Tests
Volume 1 contains the first two blocks of the course.
The first block in this volume is titled “Sampling Distributions” and it contains five units that
broadly cover the various sampling distributions of commonly used statistics such as mean,
proportion, variance, difference of two means, difference of two proportions and ratio of two
variances. The most important theorem “central limit theorem” and law “law of large numbers”
of Statistics are also described with their applications. This block describes some important
sampling distributions which are widely used in statistical inference and known as exact
sampling distributions such as chi-square, t and F and how to read their tables.
The second block of this volume is titled “Properties of Good Estimator” and it contains four
units that broadly cover various properties of a good estimator. In this block, we shall discuss
and explain the concepts of unbiasedness, consistency, asymptotically normal consistency,
efficiency, most efficient estimator, mean squared error, minimum variance unbiased
estimator, sufficiency and minimal sufficient estimator.

Expected Learning Outcomes


After completing this volume, you should be able to:
 define statistical inference and basic terms used in statistical inference;
 describe the central limit theorem and law of large numbers with their applications;
 explain the concept of the sampling distribution and various sampling distributions of
mean, difference of two means, proportion, difference of two proportions, variance and
ratio of two variances;
 explain the chi-square, t and F-distributions with their probability curve, summary
measures, properties and applications and method of obtaining their tabulated values;
 discuss the various properties of a good estimator such as unbiasedness, consistency,
efficiency, mean squared error, sufficiency, etc.
 elaborate the concept of asymptotically normal consistent estimator, most efficient
estimator, minimum variance unbiased estimator and minimal sufficient estimator;

If you feel like reading more than what this course contains, you may like to consult the
following books:

Suggested Further Readings


1. Abramovich, Felix (2023). Statistical Theory: A Concise Introduction, Second Edition.
Chapman and Hall/CRC.
2. Aczel, Amir and, Jayavel Sounder Pandian (2008).Complete Business Statistics, Seventh
Edition. The Mcgraw-Hill Irwin Series.
3. Black Ken (2010). Business Statistics. John Wiley & Sons, Inc.
4. Goon, A.M., Gupta, M.K. and B. Dasgupta (2013). An Outline of Statistical Theory, Vol II,
World Press, Calcutta.
5. Gupta, S. C. and V. K. Kapoor (2020). Fundamentals of Mathematical Statistics, Sultan
Chand & Sons.
6. Mood, A.M. (1974); Introduction to the Theory of Statistics, Tata McGraw-Hill Book
Company, New Delhi.
7. Rohatgi, V. K. (2015). An Introduction to Probability and Statistics, Wiley-India.
8. Sahoo, Prasanna (2013). Probability and Mathematical Statistics. University of Louisville,
Louisville, KY 40292 USA.
9. Wegner Trevor (2016). Applied Business Statistics. Juta and Company Ltd.
Your feedback pertaining to this course will help us undertake maintenance and timely revision
of the course. You may give your feedback using the following link:

Feedback Link:

https://docs.google.com/forms/d/e/1FAIpQLSdfldMvjNLVADrE49VleFxpX9paYwS-
0_UVkX3crO2z3G9DrA/viewform?usp=sf_link

We hope that you would enjoy reading the self-learning material of this course. Wishing you a
happy learning experience and all the best in this endeavour!

Course Preparation Team


MST-016
STATISTICAL
Indira Gandhi National Open University
School of Sciences INFERENCE

Block

1
SAMPLING DISTRIBUTIONS
UNIT 1
Basic Concepts of Sampling Distributions 9
UNIT 2
Sampling Distributions of Sample Means 47
UNIT 3
Sampling Distributions of Sample Proportions and Variances 81
UNIT 4
Sampling Distributions Associated with Normal Populations-I 109
UNIT 5
Sampling Distributions Associated with Normal Populations-II 131

5
BLOCK 1: Sampling Distributions
Unit 1: Basic Concepts of Sampling Distribution

Unit 2: Sampling Distributions of Sample Means

Unit 3: Sampling Distributions of Sample Proportions and Variances

Unit 4: Sampling Distributions Associated with Normal Populations-I

Unit 5: Sampling Distributions Associated with Normal Populations-II

BLOCK 2: Properties of Good Estimator


Unit 6: Unbiasedness

Unit 7: Consistency

Unit 8: Efficiency and Mean Squared Error

Unit 9: Sufficiency and Minimal Sufficiency

BLOCK 3: Methods of Estimation


Unit 10: Method of Maximum Likelihood Estimation

Unit 11: Other Methods of Point Estimation

Unit 12: Interval Estimation for Means

Unit 13: Interval Estimation for Proportions

Unit 14: Interval Estimation for Variances

BLOCK 4: Testing of Hypothesis: Parametric Tests


Unit 15: Basic Concepts of Testing of Hypothesis

Unit 16: Tests for Means

Unit 17: Tests for Proportions

Unit 18: Tests for Variances

6
BLOCK 1: SAMPLING DISTRIBUTIONS
To draw inferences about the population characteristics (known as parameters) on the basis
of a sample, we require the sampling distribution of the statistic (function of sample
observations). This block of the course provides a brief discussion on the sampling distributions
of various statistics. It comprises five units. The detail on each unit is as follows:
Unit 1: Basic Concepts of Sampling Distribution make you familiar with the basic terms
required for understanding the sampling distribution. In this unit, you will study some important
terms such as population and sample, parameter and statistic, simple random sampling,
estimator and estimate, etc. The concept and role of sampling distributions in statistical
inference are described in this unit. The most important theorem “central limit theorem” and
law “law of large numbers” are also described with their applications.
One of the most important sample statistics which is used to draw a conclusion about the
population mean is the sample mean. In Unit 2: Sampling Distributions of Sample Means,
we present the sampling distributions of means (single mean and difference of two means)
under various situations. We describe the sampling distribution of mean when the population
is normally distributed or not. Also in this unit, the sampling distribution of difference of two
means when samples are independent and paired are discussed with examples.
Sometimes, we deal with the data collected in the form of counts and the collected data
classified into two categories or groups according to an attribute. In such situations, we use
sample proportion instead of mean. Similarly, in many practical situations, we have also been
concerned with variability. Especially when a little variance in diameter is crucial when
manufacturing things that fit together, like pipes and ball bearings. Therefore, in Unit 3:
Sampling Distributions of Sample Proportions and Variances, we discuss the sampling
distributions of sample proportion and difference of two proportions as well as the sampling
distributions of sample variance and ratio of two sample variances.
Unit 4: Sampling Distributions Associated with Normal Populations-I and Unit 5:
Sampling Distributions Associated with Normal Populations-II make you familiar with the
chi-square, t and F-distributions with their probability curve, summary measures, properties
and applications. In this unit, we also discuss how to read the tabulated values of these
sampling distributions.
Expected Learning Outcomes
After completing this block, you should be able to:
 define statistical inference;
 define the basic terms such as population and sample, parameter and statistic, simple
random sampling, estimator and estimate, etc. that are used in statistical inference;
 explain the concepts of the sampling distribution and standard error;
 describe the important theorem of sampling distributions “central limit theorem” and law
“law of large numbers” with their applications;
 explain the sampling distributions of mean when the population is normally distributed
and not;
 explore the sampling distributions of the difference of two means when samples are
independent and paired;
 describe the shape of the sampling distributions of sample proportion and the difference
of two sample proportions; 7
 explain the sampling distributions of sample variance and ratio of two sample variances;
 explain the chi-square, t and F-distributions with their probability curve, summary
measures, properties, applications and relation to other distributions;
 identify the conditions under which the chi-square, t and F-distributions can be used; and
 describe the method of obtaining the tabulated value from the t, chi-square and F-
distribution tables.

Notations and Symbols


SAQ/TQ : Self Assessment Question/Terminal Question
Fig./Figs. : Figure/Figures
X1, X2, …, Xn : A random sample of size n
X : Sample mean
S2 and Sp2 : Sample variance and pooled sample variance

SD and SE : Standard deviation and standard error


E(X) and Var(X) : Mathematical expectation and variance of X
µ and σ2 : Mean and variance of a population
Z ~N (0, 1) : Standard normal variate
P and p : Population and sample proportions
df : Degrees of freedom
χ(2n ) : Chi-square with n degrees of freedom

a b : Beta function
B ( a,b ) =
a+b
a : Gamma function
t (n),(1− α) and t (n),α : The left and right-tailed tabulated values of the t-statistic with n
degrees of freedom
2
χ(n),(1− α) and χ(n),α : The left and right-tailed tabulated values of the chi-square statistic
2

with n degrees of freedom


F(n1,n2 ),(1− α) and F(n ,n1 2 ),α
: The left and right-tailed tabulated values of the F-statistic with
(n1, n2) degrees of freedom

8
UNIT 1
BASIC CONCEPTS OF
SAMPLING DISTRIBUTION

Structure
1.1 Introduction 1.6 Law of Large Numbers
Expected Learning Outcomes 1.7 Summary
1.2 Basic Terminology 1.8 Terminal Questions
1.3 Introduction to Sampling 1.9 Solutions /Answers
Distribution
1.4 Concept of Standard Error
1.5 Central Limit Theorem

1.1 INTRODUCTION
In many situations, we have to extract some information from all units or items
of a group/population. But, if
• the whole population is too large to study,
• the units of the population are destructive in nature,
• there are limited resources and manpower available, etc.
then gathering the information from all units is not practically inconvenient and
sometimes units are destroyed under investigation. For example, as you know
many of us use Facebook and if you are interested to know the average age of
the Facebook user, then you have to survey every person in the world who
uses Facebook. But it is not possible to survey everyone in the world.
Therefore, in most of the cases of daily life, business, industry, etc., the
information about the whole group/population is gathered through a random
sample. The results of a properly taken sample enable the investigator to
arrive at generalisations that are valid for the entire group/population. The
process of generalising sample results to the population is called Statistical
Inference.
To draw inferences about the population characteristics (known as
parameters) on the basis of a sample, we require the sampling distribution
of the statistic (function of sample observations). This unit as the title is
devoted to explaining the basic concepts required to understand the sampling
distribution as well as statistical inference.
9
Unit Writer- Dr. Prabhat Kumar Sangal, School of Sciences, IGNOU, New Delhi
Block 1 Sampling Distributions

This unit is divided into 9 sections. Section 1.1 is introductive. In Section 1.2,
we defined the basic terminology used in statistical inference such as
population and sample, parameter and statistic, simple random sampling,
estimator and estimate, etc. The concept and role of sampling distributions in
statistical inference are described in Section 1.3. In Section 1.4, you will study
the concept of standard error. The most important theorem “central limit
theorem” and law “law of large numbers” of Statistics are described with
their applications in Sections 1.5 and 1.6, respectively. The unit ends by
providing a summary of what we have discussed in this unit in Section 1.7 and
the terminal questions, and the solution of the SAQs/TQs are given in Sections
1.8 and 1.9, respectively.
In the next unit, we shall discuss the sampling distributions of sample means.

Expected Learning Outcomes


After studying this unit, you should be able to:
 define statistical inference;

 define the basic terms such as population and sample, parameter and
statistic, estimator and estimate, etc. used in statistical inference;

 explain the concept of the sampling distribution and standard error;

 describe the most important theorem of Statistics “central limit theorem”;

 apply the central limit theorem in the real world; and

 explain the concept of the law of large numbers with its application.

1.2 BASIC TERMINOLOGY


Before starting the discussion on the sampling distribution/statistical inference,
you should understand the basic definitions of some of the important terms
which are very helpful to understand the fundamentals of statistical inference.
Population
In a general sense “population” means a group of people who live in a
particular geographical area or the entire group of individuals or objects
that share a common characteristic and are of interest to a researcher.
For example, the group of people who live in New Delhi, the group of teachers
working in IGNOU, students enrolled in the MSCAST programme in IGNOU,
etc.
In Statistics, population need not consist only people but also the group of
elements or units under consideration by the analyst. Thus, we can define a
population as
“A population is a collection/group of individuals /items /units/
observations under study.”
For example, the collection of laptops of a company, the group of universities
who are offering M.Sc. in Applied Statistics, the learners in a counselling
session, etc. are considered as populations in Statistics.
In statistical inference, a population need not consist only of people or units
10
Unit 1 Basic Concepts of Sampling Distribution
but also consider the measurements of the units. Thus, we can also define it
as
“A population is a group of measurements in the quantitative or
qualitative form of the characteristic under study.”
For example, salaries of employees in a company, weights of newly born
babies in a hospital, haemoglobin levels of patients, marks of learners in
MST-016 paper, the diameters of ball bearings produced by a company, etc.
The total number of elements/items/units/observations in a population is
known as population size and is denoted by N. The characteristic under study
may be denoted by X or Y.
Sample
In general, collecting the information from all units of a large population is
time-consuming and costly. Also, if the units of a population are destroyed
under investigation, then gathering the information from all the units does not
make sense. For example, test the blood of a patient, test the quality of
crackers, etc. In such situations, a small part of the population is selected from
the population which is called a sample. Thus, we can define a sample as
“A sample is a part/fraction/subset of a population.”
For example, a syringe full of blood taken from the vein of a patient is a
sample of all blood in the patient's circulation at the moment. Similarly, a group
of 26 learners of the MSCAST is a sample of the population of learners of the
programme. The number of units selected in the sample is known as sample
size and it is denoted by n. It is extremely important to choose a sample that is
truly representative of the population so that the inferences derived from the
sample can be generalised back to the population of interest.
Sample Mean and Sample Variance
If X1 , X2 , ..., Xn is a random sample of size n taken from a population, then the
sample mean ad is defined as
X1 + X2 + ... + Xn 1 n
X =
n
∑ Xi
n i=1

And sample variance is defined as


1 n
( )
2
=S2 ∑ Xi − X
n − 1 i=1
n

∑(X )
2
Here, we divide i − X by (n – 1) rather than n as our definition of the
i =1

variance. The reason for taking (n – 1) in place of n will become clear in Unit 6
of this course.
Statistical inference is a technique of drawing conclusions about the
population from data (sample). It is based on the assumption that the sample
should be random. We have various techniques to select a random sample. In
MST-013: Survey Sampling and Design of Experiments-I, you have studied
various sampling methods so that we can maintain the randomness. In
statistical inference, we use simple random sampling because for randomness
it is sufficient. Let us look at this sampling. 11
Block 1 Sampling Distributions

Simple Random Sampling


Simple random sampling is the simplest and most common method of
Sampling is the statistical
sampling used in statistical inference. In simple random sampling, the sample
process of selecting a
is drawn in such a way that each unit of the population has an equal and
subset (called a “sample”)
of a population of interest.
independent chance of being included in the sample. If a sample is drawn by
In other words, the
this method, then it is known as a simple random sample or a random
procedure of drawing a
sample. The simple random sample of size n is denoted by X1, X2, …, Xn or
sample from the population
Y1, Y2, …, Yn and the observed values of this sample are denoted by x1, x2,
is called sampling. …, xn or y1, y2, …, yn. Some authors use x1, x2, …, xn to represent the random
sample instead of X1, X2, …, Xn. But throughout the course, we use capital
letters to represent the random sample, that is, X1 , X2 , ..., Xn or Y1 , Y2 , ..., Yn .

In simple random sampling, elements or units are selected or drawn one by


one, therefore, it may be classified into two types:
1. Simple Random Sampling without Replacement (SRSWOR)
In simple random sampling, if the units are selected one by one in such a
way that a unit drawn at a time is not replaced back to the population
before the subsequent draws, is called SRSWOR. In this sampling, the
same unit cannot appear more than once in the sample.
2. Simple Random Sampling with Replacement (SRSWR)
In simple random sampling, if the units are selected or drawn one by one
in such a way that a unit drawn at a time is replaced back to the population
before the subsequent draw is called SRSWR. In this sampling, the same
element or unit can appear more than once in the sample.
Now a question may arise, if we draw a sample of a specified (given) size
from a population then how many samples are possible? To understand
the answer to this, we should know two things:
• First, whether the order of the unit in the sample matters or not, and
• Second, whether the drawn unit in the sample is replaced before the
subsequent draw or not.
Based on these, the following four cases may arise:
• Ordered sampling with replacement
• Unordered sampling with replacement
• Ordered sampling without replacement
• Unordered sampling without replacement
To see the effect of ordering and replacement on the number of samples, we
considered an example in which the size of the population is very small.
Suppose the statistical discipline of a university consisting of four typists (A, B,
C and D) constitute a population. We select a sample of two typists from this
population of typists for typing a manuscript. We now discuss how many
samples may be possible in different cases.
Ordered Sampling with Replacement
In this case,

12
• Order matters. The order of selecting the typists in the sample is
Unit 1 Basic Concepts of Sampling Distribution
important. It means that if we select typist A first and then B (i.e. AB) differs
from selecting B first and then A (i.e. BA) and each of these samples is
regarded as a different possible sample that can be selected from this
group of typists.
• We replace the unit. A unit drawn at a time is replaced back to the
population before the subsequent draw. With replacement means a
sample of selecting the typist A and then A again (AA) is possible.
In this case, the total number of possible samples will be 16 which are listed
below:
For listing all possible
AA (repetition allows) CA(order matters) samples, we list samples
systematically. First, we list
AB CB(order matters)
all of the possible samples
AC CC (repetition allows) with the first element of the
AD CD population i.e. A as the first
BA (order matters) DA(order matters) typist, then all of the
possible samples with the
BB (repetition allows) DB(order matters)
second element of the
BC DC(order matters) population i.e. B, C and so
BD DD (repetition allows) on. In this way, we can be
sure that we have all of the
In this case, we can determine the total number of possible samples of any possible random samples.
size (n) that can be selected from a population of any size (N) using the
following formula:
Total number of possible samples = Nn .
The ordered sampling with replacement is also known as theoretical
sampling and it is used to develop the theories of sampling distribution. This
case is also known as simple random sampling with replacement
(SRSWR).
Unordered Sampling with Replacement
In this case,
• Order does not matter. The order of selecting the typists in the sample is
not important. It means that if we select typist A first and then B (i.e. AB) is
the same as selecting B first and then A (i.e. BA). These samples are
considered as one and are not considered separate samples.
• We replace the unit. A unit drawn at a time is replaced back to the
population before the subsequent draw. With replacement means a
sample of selecting the typist A and then A again (AA) is possible.
In this case, the total number of samples will be 10 which are listed as follows:

AA (repetition allows) CA (order does not matter)


AB CB (order does not matter)
AC CC (repetition allows)
AD CD
BA (order does not matter AB=BA) DA (order does not matter)
BB (repetition allows) DB (order does not matter)
BC DC (order does not matter)
BD DD (repetition allows)
13
Block 1 Sampling Distributions

We can determine the total number of samples when order does not matter,
and repetition is allowed using the following formula:
Total number of possible samples = N+n −1
Cn = N+n −1
CN−1 .

Ordered Sampling without Replacement


In this case,
• Order matters. The order of selecting the typists in the sample is
important. It means that if we select typist A first and then B (i.e. AB) differs
from selecting B first and then A (i.e. BA) and each of these samples is
regarded as a different possible sample that can be selected from this
group of typists.
• We do not replace the unit. A unit drawn at a time is not replaced back to
the population before the subsequent draw. This means that the same
participant can never be sampled twice. So, the samples of AA, BB, CC
and DD from the population in this example are not possible.
In this case, the total number of possible samples will be 12 which are listed
as follows:

AA (repetition does not allow) CA(order matters)

AB CB(order matters)

AC CC (repetition is not allowed)

AD CD

BA (order matters) DA(order matters)

BB (repetition is not allowed) DB(order matters)

BC DC(order matters)

BD DD (repetition is not allowed)

We can determine the total number of possible samples when order matters,
and repetition is not allowed using the following formula:
Total number of possible samples = N × (N − 1) × (N − 2) × ... × (N − n + 1) .

Unordered Sampling without Replacement


In this case,
• Order does not matter. The order of selecting the typists in the sample is
not important (order does not matter). It means that if we select typist A
first and then B (i.e. AB) is the same as selecting B first and then A (i.e.
BA). These samples are considered as one and are not considered
separate samples.
• We do not replace the unit. A unit drawn at a time is not replaced back to
the population before the subsequent draw. This means that the same
participant can never be sampled twice. So, the samples of AA, BB, CC
and DD from the population in this example are not possible.

In this case, the total number of possible samples will be 6 which are listed as
follows:
14
Unit 1 Basic Concepts of Sampling Distribution

AA (repetition is not allowed) CA (order does not matter)


AB CB (order does not matter)
AC CC (repetition is not allowed)
AD CD
BA (order does not matter) DA (order does not matter)
BB (repetition is not allowed) DB (order does not matter)
BC DC (order does not matter)
BD DD (repetition is not allowed)

We can determine the total number of samples when order does not matter,
and repetition is not allowed using the following formula:
Total number of possible samples = N Cn .

We may select diverse samples. But in practice, order does not matter
because we do not care about the order in which units are selected. Also, we
usually do not allow one individual to be chosen twice. Therefore, we often do
the unordered sampling without replacement. Therefore, it is also called
experimental sampling. This is commonly called simple random sampling
without replacement (SRSWOR). However, in statistical inference ordered
simple random sampling with replacement (commonly called SRSWR) is used
to develop the theory of statistical inference. On the other hand, in a large
population, the probability of selecting the same individual twice is negligible,
and it can be demonstrated that the results obtained from sampling with
replacement are very close to the results obtained using sampling without
replacement. The main advantage of sampling with replacement is that the
sample observation will be independent, and this simplifies the analysis.
After understanding the various situations, let us take an example for
illustration purposes.
Example 1: There are five sales associates at a car showroom. The number
of cars they sold last week is as follows:

Sales Associate Cars Sold

Vihaan 2

Rohan 5

Ritika 4

Hassan 6

Anita 10

A researcher wants to select a sample of 2 associates. How many samples of


size 2 are possible with replacement? Also, write them.

Solution: Here, we are given that


Population size = N = 5, Sample size = n = 2
Since we know that all possible samples of size n taken from a population of
size N with replacement (order matter and replacement allow) are Nn. 15
Block 1 Sampling Distributions

Therefore, possible samples in our case N = 5 = 25. These 25 samples are


n 2

given in Table 1.1 along with the car sold.


Table 1.1: Possible Samples of Sales Associates with Cars Sold

Sample Sample
Sample Sample in Term Sample Sample in Term
Observations Observations
Number of Associates Number of Associates
(car sold) (car sold)
1 (Vihaan, Vihaan) (2, 2) 14 (Ritika, Hassan) (4, 6)
For listing all possible
2 (Vihaan, Rohan) (2, 5) 15 (Ritika, Anita) (4, 10)
samples, we list samples
systematically. First, we list 3 (Vihaan, Ritika) (2, 4) 16 (Hassan, Vihaan) (6, 2)
all of the possible samples
4 (Vihaan, Hassan) (2, 6) 17 (Hassan, Rohan) (6, 5)
with the first element of the
population i.e. A as the first 5 (Vihaan, Anita) (2, 10) 18 (Hassan, Ritika) (6, 4)
typist, then all of the
possible samples with the 6 (Rohan, Vihaan) (5, 2) 19 (Hassan, Hassan) (6, 6)
second element of the 7 (Rohan, Rohan) (5, 5) 20 (Hassan, Anita) (6, 10)
population i.e. B, C and so
on. In this way, we can be 8 (Rohan, Ritika) (5, 4) 21 (Anita, Vihaan) (10, 2)
sure that we have all of the 9 (Rohan, Hassan) (5, 6) 22 (Anita, Rohan) (10, 5)
possible random samples.
10 (Rohan, Anita) (5, 10) 23 (Anita, Ritika) (10, 4)

11 (Ritika, Vihaan) (4, 2) 24 (Anita, Hassan) (10, 6)

12 (Ritika, Rohan) (4, 5) 25 (Anita, Anita) (10, 10)

13 (Ritika, Ritika) (4, 4)

Let us try the following Self Assessment Questions to check your


understanding of the number of possible samples.

SAQ 1
A hospital administrator wishes to estimate the mean weight of babies born in
her hospital. She collects the birth records of a day and the observes weights
(in pounds) of 4 babies (B1, B2, B3, B4) as 6, 8, 7, and 6 pounds. How many
samples of size 2 are possible with replacement? Also, write them.

After understanding the concept of population, sample and how many samples
are possible in different cases, let us move to the next concept.
Parameter
The characteristics of a population can be described with some measures
such as population mean, variance, etc. These measures are known as
parameters of the population. Thus, we can define a parameter as:
“A parameter or population parameter is a numerical value that
summarises or measures or represents a specific characteristic of an
entire population.”

The parameters are derived from data collected from the entire population and
taken as fixed constants.
For example, suppose the course coordinator of the MST-016 course
calculates the average marks of all the learners in the MST-016 course then
the obtained average mark is a parameter because it is based on all learners
16
of the MST-016 course. Similarly, population variance, population coefficient of
Unit 1 Basic Concepts of Sampling Distribution
variation, population correlation coefficient, etc. are all parameters. Population
parameters are usually denoted by Greek letters, such as the population mean
and variance are represented by Greek letters µ and σ2, respectively.
Generally, the parameters of a population are typically unknown and are
estimated from sample data.
We know that the population of measurements such as height, marks, etc can
be described with the help of distribution such as normal, Poisson, etc. and the
distribution is fully determined with the help of its constants such as, in case of
a normal distribution, we need to know µ and σ2 to determine the normal
distribution, in case of Poisson distribution, we need to know λ, etc. These
constants are also known as the parameters.
Statistic
As a parameter describes the characteristic of the population in a similar way
a statistic describes the characteristic of the sample. We can define a statistic
as:
“A sample statistic or statistic is a numerical measure that summarizes
or describes a characteristic of a sample.”
A statistic is calculated using sample data. For example, suppose the course
coordinator of the MST-016 course calculates the average marks of the
learners in the MST-016 course by selecting some learners randomly instead
of all learners then the obtained sample average mark is a statistic because it
is based on the sample of the learners of the course. Similarly, sample
variance, sample coefficient of variation, sample correlation coefficient, etc.
are all statistics (plural of statistic). The statistic is usually denoted by
Roman/Latin letters, such as the sample mean and variance are represented
by X and S2, respectively. Generally, the statistic is used to estimate the
population parameter. We may also define a statistic as
Any quantity calculated from sample values that does not contain any
unknown parameter is known as a statistic.
For example, if X1 , X2 , ..., Xn is a random sample of size n taken from a
population with mean µ and variance σ2 (both are unknown) then the sample
1 n X−µ
mean X = ∑ Xi is a statistic whereas and X / σ are not statistics
n i=1 σ/ n
because both are functions of unknown parameters. If both µ and σ2 are
X −μ
known then or X / σ become a statistic because both µ and σ2 are
σ/ n
known and are treated as constants. We use different symbols for parameters
and statistics as follows:

Population parameter Sample statistic

Mean μ (Greek letter “mu”) X (called “X-bar”)

Standard Deviation σ (Greek letter “sigma”) S (Latin letter “S”)

Variance σ2 S2

Proportion P p
17
Block 1 Sampling Distributions

Estimator and Estimate


Generally, population parameters are unknown and the whole population is
Any statistic used to too large to find out the parameters due to cost and time constraints. In such
estimate an unknown situations, we take a sample from the population. Since the sample drawn
population parameter is from a population always contains some information about the population,
known as estimator and a therefore, we guess or estimate the value of the parameter under study based
particular value of the on a random sample drawn from that population.
estimator is known as
A statistic that is used to estimate an unknown parameter is known as an
estimate of the
estimator. An estimator is a function of the sample observations. Common
parameter. The estimated
estimators are the sample mean and sample variance which are used to
value of sample mean and
estimate the unknown population mean and variance, respectively. The
sample variance are
estimator itself is a random variable because it is a function of the random
denoted by x and s2 ,
sample observations. Its value varies from sample to sample due to the
ti l
randomness inherent in the sampling process. A particular value of the
estimator based on the observed value of the sample is known as an estimate
of the parameter.
For example, let us suppose that a pharmacologist wants to test a new systolic
blood pressure (SBP) medicine. For that, he selected a random sample of 100
SBP participants to try the medicine and prescribe the medicine for one
month. After one month, he measured the SBP of the participants. Then, he
calculated the mean of the SBP measurements. Suppose the mean SBP is
found as 125 mm Hg. Then sample mean is called the estimator, and this
number (125 mm Hg) allows him to estimate that a more general population is
likely to maintain 125 mm Hg SPB while taking the medicine is an estimate.
Consider another example, suppose we want to estimate the average height
of students in a college. If we estimate the average height by selecting some
students randomly from the college, then the sample average is the estimator
and its particular value, say, 165 cm is the estimate of the average height of all
students in the college.
In this course, we use capital letters for the estimator and small letters for the
estimated value. For example, we represent the estimator sample mean by X
and its particular value as x as an estimate. In general, the estimator is
denoted by Tn = t(X1, X2, …, Xn) where ‘n’ denotes the sample size and the
estimator T is a function of the random sample X1, X2, …, Xn.
Now, you check your understanding of the above discussion by answering the
following Self Assessment Question.
SAQ 2
If a quality control inspector takes a random sample of 10 ball bearings from
the process of manufacturing the ball bearings and measures the internal
diameter of these. The obtained results (in inches) are as follows:
29, 31, 30, 32, 30, 29, 30, 30, 29, 30
(i) Obtain the standard deviation of the variation in the diameter.
(ii) Find the estimator and estimate in this case.

After understanding the concept of basic terminology, you are now ready to
learn the main concept of this unit, that is, sampling distribution. Let us discuss
it in the next section.
18
Unit 1 Basic Concepts of Sampling Distribution

1.3 INTRODUCTION TO SAMPLING


DISTRIBUTION
As we have discussed in Section 1.1, if the population is too large or the units
or items of the population are destructive or there are limited resources such
as manpower, money, etc., then it is not possible practically to examine every
unit of the population to obtain the necessary information or characteristics/
parameters of the population. For example, suppose we want to know the
average life of electric LED bulbs which are manufactured by a company. The
company manufactures a lot of bulbs say 5,000 per day. Then gathering
information about the average life of all bulbs does not make sense because
the bulbs are destroyed under investigation. In another example, if we want to
know the average income of the persons living in a big city then collecting the
information from each person is very much time and manpower consuming. In
such situations, one can draw a sample from the population under study and
utilize sample observations to extract the necessary information about the
population. The results obtained from the sample are projected in such a way
that they are valid for the entire population. Therefore, the sample works like a
“Vehicle” to reach (drawing) valid conclusions about the population. This
technique is known as statistical inference. We can define it as:
“The process of projecting the sample results for the whole population is
known as statistical inference.”
For drawing inferences about the population, we analyse the sample data, that
is, we calculate the value of a sample statistic such as the sample mean,
sample proportion, sample variance, etc. Generally, if we want to draw
inferences about the population characteristic such as population mean, we
use the sample mean, about population variance, we use sample variance,
etc. For example, suppose a researcher of health science wants to know the
average cholesterol levels of the persons living in a city. For practical reasons,
he cannot reach out to each and every person in the city. So, he randomly
selected 10 persons (Sample-I) from the city and measured their cholesterol
levels. The observed values of the cholesterol levels are given in column 2 of
Table 1.2.
Table 1.2: Cholesterol Levels of 10 Individuals in Sample-I and Sample-II

Cholesterol Level (in mg/dl)


S. No.
Sample-I Sample-II
1 180 200
2 200 200
3 190 180
4 220 200
5 180 190
6 190 180
7 220 240
8 200 220
9 190 200
10 180 210
Total 1950 2020
Sample Mean 195 202
19
Block 1 Sampling Distributions

From the table, the average (mean) cholesterol level of these selected
persons is 195 mg/dl. Now, if you use this sample to make an inference about
the population's average cholesterol level then we say that the sample
average cholesterol level of 195 mg/dl is an estimate of the average
cholesterol level of the persons living in this city (population). However, we do
not know the precision (the estimate is accurate or not) of the estimate
because the actual cholesterol level of the persons in the city is unknown.
If another random sample of 10 persons (Sample-II) is selected and measured
their cholesterol levels (shown in column 3 of Table 1.2) then we get the
average (mean) cholesterol level of these selected persons is 202 mg/dl. This
is different from Sample-I because it contains different persons who have
different cholesterol levels. Thus, if you use this sample mean to estimate the
population mean then you get a different estimate of the average cholesterol
level for the whole population. If other samples of 10 persons are selected, it is
unlikely that exactly the same mean would be observed, therefore, we may
expect different estimates of the population mean every time. Now, the
following questions may arise:
• How well does a sample describe its population?
• How can we tell which sample gives the best estimate of the population
parameter?
• What is the probability of selecting a sample with specific
characteristics?
These questions can be answered once we establish the sampling
distribution of the statistic such as mean, variance, proportion, etc. The
sampling distribution of a statistic is important because it enables us to draw
conclusions about the corresponding population parameter.
For a better understanding of the concept of the sampling distribution, we
consider a population of very small size so that we can easily obtain unknown
population parameters, say, mean (µ) and standard deviation (σ) using all
population observations. And see how the sampling distribution helps us to
draw conclusions about the population.
Consider a small production industry which has five employees. The monthly
salary (in thousands) of each employee is given as follows:
Table 1.3: Monthly Salaries of Employees

Monthly Salary
Employee
(in thousands)

Lavnik 25

Avishi 30

Aman 15

Tanishq 25

Harsh 10

We can calculate the mean of the monthly salary (population) as


25 + 30 + 15 + 25 + 10
=μ = 21
5
20
Unit 1 Basic Concepts of Sampling Distribution
Similarly, we can obtain the standard deviation of the monthly salary
(population) as

1 N (25 − 21)2 + (30 − 21)2 + (15 − 21)2 + (25 − 21)2 + (10 − 21)2
σ
= ∑ (Xi − =
N i=1
μ)2
5

16 + 81 + 36 + 16 + 121
= = 54 7.35
=
5

We can plot the graph of the population (monthly salary) to know the form of
the population as follows in Fig. 1.1.

Fig. 1.1: Monthly salary of the employees of a small industry.

From the above figure, we observe that the monthly salary of the employees
does not follow a known distribution (especially normal) and it is left-skewed.
Let us assume that we do not know the average salary of the employees. So
we decide to estimate the population mean on the basis of a sample of size
n = 2. As you have studied in Section 1.2, there are Nn = 52 = 25 possible
simple random samples with replacement of size 2. All possible samples of
size n = 2 are given in the second column of Table 1.4. We also calculate the
means of each sample which are given in the last column of the same table.
Table 1.4: Samples and Sample Means

Sample Sample in Term of Sample Observation


Sample Mean
number Employees (monthly salary)

1 (Lavnik, Lavnik) (25, 25) 25


2 (Lavnik, Avishi) (25, 30) 27.5
3 (Lavnik, Aman) (25, 15) 20
4 (Lavnik, Tanishq) (25, 25) 25
5 (Lavnik, Harsh) (25, 10) 17.5
6 (Avishi, Lavnik) (30, 25) 27.5
7 (Avishi, Avishi) (30, 30) 30
8 (Avishi, Aman) (30, 15) 22.5
9 (Avishi, Tanishq) (30, 25) 27.5
10 (Avishi, Harsh) (30, 10) 20
21
Block 1 Sampling Distributions
Sample in Term of Sample Observation
Sample Sample Mean
Employees (monthly salary)
11 (Aman, Lavnik) (15, 25) 20
12 (Aman, Avishi) (15, 30) 22.5
13 (Aman, Aman) (15, 15) 15
14 (Aman, Tanishq) (15, 25) 20
15 (Aman, Harsh) (15, 10) 12.5
16 (Tanishq, Lavnik) (25, 25) 25
17 (Tanishq, Avishi) (25, 30) 27.5
18 (Tanishq, Aman) (25, 15) 20
19 (Tanishq, Tanishq) (25, 25) 25
20 (Tanishq, Harsh) (25, 10) 17.5
21 (Harsh, Lavnik) (10, 25) 17.5
22 (Harsh, Avishi) (10, 30) 20
23 (Harsh, Aman) (10, 15) 12.5
24 (Harsh, Tanishq) (10, 25) 17.5
25 (Harsh, Harsh) (10, 10) 10

From the above table, you can see that the value of the sample statistic
(sample mean) varies from sample to sample and out of 25 samples there is
no sample whose mean is equal to the population mean. It means that out of
25 samples, no one estimates the population mean exactly. So, what do we do
now? How can we estimate it? The sampling distribution helps us in such
situations. We now try to understand the concept of it.
Since the sample mean varies from sample to sample, therefore, it is treated
as a random variable. As you have studied in MST-012, a random variable has
a distribution, so the sample statistic (sample mean) has a distribution. To
obtain the distribution of a statistic (sample mean) we arrange the values of it
in ascending or descending order and calculate the frequency of each value
as shown in Table 1.5. We can also obtain the probability distribution (using
the relative frequency approach of probability) described in MST-012 of the
occurrence of each value which is given in the last column of Table 1.5.
Table 1.5: Sampling Distribution of Sample Means

S. No. X Frequency(f) Probability(p)


1 10 1 1/25 = 0.04
We can also construct the 2 12.5 2 2/25 = 0.08
sampling distribution of the 3 15 1 1/25 = 0.04
sample median instead of 4 17.5 4 4/25 = 0.16
mean in the same way but 5 20 6 6/25 = 0.24
the sampling distribution 6 22.5 2 2/25 = 0.08
median is generally not
7 25 4 4/25 = 0.16
normal.
8 27.5 4 4/25 = 0.16
9 30 1 1/25 = 0.04

So the arrangement of all possible values of the sample mean with their
corresponding probabilities is called the sampling distribution of mean. Thus,
22 we can define the sampling distribution in general as follows:
Unit 1 Basic Concepts of Sampling Distribution
“The probability distribution of all possible values of a sample statistic
that would be obtained by drawing all possible samples of the same size
from the population is called the sampling distribution of that statistic.”
To get an idea of the shape of the sampling distribution of mean, we plot the
graph (frequency bar) of the values of the sample mean taking the sample
mean on the X-axis and corresponding frequencies on the Y-axis as shown in
Fig. 1.2(b).

Distribution
of all the
possible
sample
means (n = 2)
from this
population.

(a) Population distribution (b) Sampling distribution of mean

Fig. 1.2: Population distribution and sampling distribution of mean for n = 2.

From Fig. 1.2, we observe that the shape of the sampling distribution has been
changed.
As the probability distribution such as normal, Poisson, binomial, etc. allows us
to gain an understanding of the summary (mean, standard deviation, etc),
likelihood and probabilities of different values occurring in the outcome, in a
similar way, a sampling distribution is a probability distribution that does the
same job. The sampling distribution of the sample means itself has mean,
variance, etc. Therefore, we can calculate the mean of sample means as
1 k k
Mean of the sample means = (X) = ∑ Xi fi where, K = ∑ fi
K i=1 i =1

1
= (10 × 1 + 12.5 × 2 + ... + 30 × 1) = 21= μ
25
We can also calculate the mean of the sample means as
k
1 2 1
E(X) =X =∑ Xipi =10 × + 12.5 × + ... + 30 × =21
i =1 25 25 25

You saw! The mean of the sampling distribution of means is equal to the
population mean. This is what we wanted. However, we did not get the same
using a single sample. We find it as the mean of the sampling distribution of
means.
Thus, we can say that to find the exact estimate of the unknown population
parameter, we first find the sampling distribution of the corresponding statistic
and then compute the mean of the obtained sampling distribution of that
statistic.
We now calculate the standard deviation of the sampling distribution of mean
as
23
Block 1 Sampling Distributions

1 k
( )
2
SD(X)
= ∑ fi Xi − X
K i=1

1 
( )
SD X =
25 
1× (10 − 21) + 2 × (12.5 − 21) + ... + 1× ( 30 − 21) 
2 2 2

1
= (121 + 144.5 + ... +=
121) 5.20
25

( )
SD X = 5.20

We now compare the distribution of the population with the sampling


distribution of mean. In Fig. 1.2, we plot them.
Let us now see what happens as we increase the sample size. So we decide
to estimate the population mean on the basis of a sample of size n = 3. In this
case, there are Nn = 53 = 125 possible simple random samples with
replacement of size 3. We can list all possible samples of size n = 3 and for
each sample, we calculate the sample means which are shown in Table 1.6.
Table 1.6: Samples and Sample Means

Sample
Sample Sample in Term of
Observation Sample Mean
number Employees
(monthly salary)

1 (Lavnik, Lavnik, Lavnik) (25, 25, 25) 25


2 (Lavnik, Lavnik, Avishi) (25, 25,30) 26.67
3 (Lavnik, Lavnik, Aman) (25, 25,15) 21.67
4 (Lavnik, Lavnik, Tanishq) (25, 25, 25) 25
5 (Lavnik, Lavnik, Harsh) (25, 25, 10) 20
6 (Lavnik, Avishi, Lavnik) (25, 30, 25) 26.67
7 (Lavnik, Avishi, Avishi) (25, 30, 30) 28.33
8 (Lavnik, Avishi, Aman) (25, 30, 15) 23.33
9 (Lavnik, Avishi, Tanishq) (25, 30, 25) 26.67
10 (Lavnik, Avishi, Harsh) (25, 30, 10) 21.66
… … … …
125 (Harsh, Harsh, Harsh) (10, 10, 10) 10

We now prepare the sampling distribution of mean as in the case when n = 2.


X Frequency(f) Probability(p)

10 1 1/125 = 0.008

11.67 3 3/125 = 0.024

13.33 3 3/125 = 0.024

15 7 7/125 = 0.056

16.67 15 15/125 = 0.12

18.33 12 12/125 = 0.096

20 15 15/125 = 0.12

21.67 24 24/125 = 0.192

23.33 15 15/125 = 0.12


24
Unit 1 Basic Concepts of Sampling Distribution

25 11 11/125 = 0.088

26.67 12 12/125 = 0.096

28.33 6 6/125 = 0.048

30 1 1/125 = 0.008

We now calculate the mean of the sample distribution of mean for n = 3 as


1 k
Mean of the sample means = (X) = ∑ Xi fi
K i=1

1
= (10 × 1 + 11.67 × 3 + ... + 30 × 1) = 21= μ
125

Similarly, we calculate the standard deviation of the sampling distribution of


mean when n = 3 as

1 k
( )
2
SD(X)
= ∑ fi Xi − X
K i=1

1 
SD X=( ) 125 
1× (10 − 21) + 3 × (11.66 − 21) + ... + 1× ( 30 − 21) =
2 2 2
 4.24

We now plot the sampling distribution of mean when the sample size n = 3 in
Fig. 1.3.

Fig. 1.3: Sampling distribution of mean when sample size n = 3

From Fig. 1.3, you can observe that it is similar to the normal distribution.
From the above discussion, we observe some features of the sampling
distribution:
 The shape of the sampling distribution of mean may be quite different
from the shape of the population.
 The sample means “pile up” around the population mean and “tail off”
towards the extremes. For this example, the population mean is µ = 21,
and the sample means are clustered around a value of 21. It should not
surprise you that the sample mean tends to approximate the population
mean.
 The sampling distribution of sample mean tends to bell-shaped normal
probability distribution as sample size n increases. 25
Block 1 Sampling Distributions

 The standard deviation of the sampling distribution of mean decreases


as sample size increases for n = 2 it is 5.20 and for n = 3 it is 4.24.
Factors that influence sampling distribution
The form of sampling distribution depends on the following factors:
• the distribution of the population,
• the statistic being considered,
• the sampling procedure employed, and
• the sample size used.
In the above example, you saw that when the population size is 5 there are
only 25 and 125 samples of size n = 2 and 3, respectively. In more realistic
situations, when the population is large then the number of possible samples
increases dramatically. For example, when the population has 1,000 units then
there will be (1000)2 = 1000000 possible simple random samples of size 2 and
it is virtually impossible to obtain every possible random sample. In such
situations, we draw a random sample from the population to draw inferences
about the population parameters and use the concept of theoretical sampling
distribution which are already developed.
It is now time for you to try the following Self Assessment Question to make
sure that you understand the sampling distribution.

SAQ 3
A Municipal Corporation Office has four typists. An officer gave the same
sample page of a manuscript to all four typists. The number of errors made by
each typist is shown in Table 1.7 as given below:
Table 1.7: Number of Errors per Typist

Typist Number of Errors

A 4

B 2

C 3

D 1

Answer the following questions.


(i) Is the distribution of population values normally distributed (bell-
shaped)?
(ii) What is the population mean and standard deviation?
(iii) How many samples of size n = 2 are possible with replacement?
(iv) List all possible samples of 2 from the population and compute their
means.
(v) Organize the means into a sampling distribution.
(vi) Does the distribution of the sample mean computed in part (v) show
some tendency towards being bell-shaped?
(vii) What is the mean of the sampling distribution? What observations can
26
Unit 1 Basic Concepts of Sampling Distribution
be made about the population and the sampling distribution?
(viii) Compare the dispersion (SD) in the population with that in the
distribution of the sample mean.

1.4 CONCEPT OF STANDARD ERROR


In Section 1.3, we describe the sampling distribution of a statistic (mean) and
observe that the mean of the sampling distribution of mean is equal to the
population mean. But in the real world, the population size is too large, and
you see that when the population size is large then the number of possible
samples increases dramatically, and it is virtually impossible to actually obtain
every possible random sample and then observe the sampling distribution. In
such situations, we draw a random sample from the population to draw
inferences about the population parameters and use the concept of theoretical
sampling distribution. For example, consider the example of estimating the
average cholesterol levels of the persons living in a city. Due to cost and time
constraints, the researcher selected 10 individuals randomly from the same
city and obtained the average (mean) cholesterol level of these selected
individuals was 195 mg/dl. Can you expect that it is somewhere close to or
equal to the average cholesterol level of the whole population? I think the
answer is no. Then the question may arise:
• How can we judge that an estimate observed from a single sample is a
reliable estimate of the population parameter?

• How can know the precision (the estimate is accurate or not) of the
estimate when it is not possible to compute the exact value?

The standard error does such jobs for us. It measures how far the sample Standard deviation and
mean (average) of the data is likely to be from the true population mean. The standard error of the mean
standard error provides a measure of how much distance is expected on are both statistical
average between a sample mean and the population mean. It gives the measures of variability.
accuracy of a sample mean by measuring the sample-to-sample variability of While the standard
the sample means. We now formally define the standard error (SE) as deviation of a sample
depicts the spread of
“The standard deviation of a sampling distribution of a statistic (mean,
observations within the
proportion, standard deviation) is known as standard error”.
given sample regardless of
The standard error is not denoted by any symbol whereas it is denoted by its the population mean where
abbreviation SE. as the standard error of the
mean measures the degree
The standard error is a standard deviation of the sampling distribution, so it
of dispersion of sample
serves the same two purposes for the sampling distribution as the standard
means around the
deviation of the data which are listed as follows:
population mean.
1. The standard error provides a measure of how much discrepancy is
expected from one sample to another. When the standard error is small,
then all of the sample means are close together and have similar values. If
the standard error is large, then the sample means are scattered over a
wide range and there are major differences from one sample to another.

2. Standard error measures how well an individual sample mean represents


the entire population. Specifically, it gives an indication of the reasonable
deviation that can be expected between a sample mean and the
27
Block 1 Sampling Distributions
population mean.
The standard error is an extremely valuable measure because it specifies
precisely how well a sample mean estimates its population mean, that is, how
much error you should expect between the sample mean and population
mean. However, you do not expect a sample to provide a perfectly accurate
picture of the population. There is always some discrepancy, or error, between
a sample statistic and the corresponding population parameter. But with the
help of standard error, we are able to calculate exactly how much error to
expect.
By the definition of the standard error, “the standard deviation of a sampling
distribution of a statistic is known as standard error”. In the previous section,
we calculate the standard deviation of the sampling distribution of the sample
mean as follows:

1 k
( )
2 k
SE(X)
= ∑ fi Xi − X where, K = ∑ fi
K i=1 i =1

1 
( )
SE X =
25 
1× (10 − 21) + 2 × (12.5 − 21) + ... + 1× ( 30 − 21) 
2 2 2

1
= (121 + 144.5 + ... +=
121) 5.20
25

Therefore, it is the standard error of the mean.


The computation of the standard error using the sampling distribution is a
tedious process. Therefore, there is an alternative method to compute the
standard error of the mean from a single sample. We can calculate the
magnitude of the standard error using the following formula:
σ
( )
SE X =
n
We can calculate the standard error of the mean using the standard deviation
of the population (the amount of variability among the units of the population)
and the sample size.
But in real life, the population standard deviation is generally unknown. In such
situations, it is more common to estimate the standard error by substituting the
sample standard deviation in place of the population standard deviation in the
formula for standard error. Therefore, the formula for standard error becomes
as
S
( )
SE X =
n
where S is the standard deviation of the sample and we can calculate it using
the following formula:

1 n
( )
2
=S ∑ Xi − X
n − 1 i=1

These are the expressions of the standard error of the mean. We give the
expressions of standard error for other statistics such as proportion, standard
deviation, etc. in the subsequent units.
28
Unit 1 Basic Concepts of Sampling Distribution
Let us try one example relating to standard error.
Example 2: The diameter of a steel ball bearing produced by a semi-
automatic machine is known to be distributed normally with a mean of 12 cm
and a standard deviation of 0.1 cm. If we take a random sample of size 10
then find the standard error of the sample mean for estimating the population
mean of the diameter of all ball bearings produced.
Solution: Here, we are given that
μ = 12, σ = 0.1, n = 10
Since, population standard deviation (σ) is given, therefore, we can calculate
the standard error of the sample mean as
σ 0.1
SE =
X ( ) =
n
= 0.03
10

We now discuss the applications of standard error.


Applications of Standard Error
Standard error plays a very crucial role:
1. The standard error is used to find the accuracy or precision or reliability of
the sample estimate of a population parameter. The precision of a sample
estimate can be defined as the reciprocal of the standard error of the
statistic. We can calculate the precision as
1
Precision =
Standard error of the estimate

2. It is used to construct confidence intervals (CI) within which the population


parameter may be expected to lie with a certain level of confidence. The
standard error also determines the probable limits or confidence limits. We
will discuss confidence intervals in more detail in Units 12, 13 and 14.
3. The standard error is also used to test whether the difference between the
sample statistic and the population parameter is significant or is due to
sampling fluctuations. It means that the standard error is also applicable in
the testing of hypothesis. We will discuss testing of hypothesis in more
detail in the last block of this course.
Since the precision is reciprocal to the standard error, we try to decrease it.
Factors that Decrease Standard Error
The formula for calculating of standard error of the sample mean is given as
follows:
σ
SE X =( ) n

From the above formula, the magnitude of the standard error depends on two
factors:
1. The standard deviation of the population from which the sample is
selected; and
2. The size of the sample.
We examine each of these factors one at a time. 29
Block 1 Sampling Distributions

Effect of population standard deviation


The standard error depends directly on the population standard deviation.
Therefore, as the population standard deviation decreases, the standard error
also decreases. It means that if the population units deviate from the
population mean then the less possibility that the sample means will deviate
from the population mean. To visualise the effect of SD of the population, we
consider different populations having SD as shown in the table given below
and we select samples of the same size 9 (n = 9) from each population. We
compute the standard error for each sample as follows:
Sample Standard Deviation of
Standard Error
Size the Population (σ)
9 1 0.33

9 4 1.33

9 9 3

9 16 5.33

9 25 8.33

9 36 12

9 49 16.33

9 64 21.33

9 81 27

9 100 33.33

We now plot the standard error corresponding to the standard deviation of the
population as shown in Fig. 1.4.

Fig. 1.4: The standard error with the population standard deviation.

From the above figure, we observe that as the standard deviation of the
population decreases, the standard error or the average distance of the
sample mean deviating from the population mean also decreases.
Effect of sample size (n)
As we noted in the previous section, as we increase the sample size n from 2
to 3 then standard error decreases. Also, from the formula for calculating
standard error, we observe that there is an inverse relationship between the
sample size and the standard error. As we increase the sample size, the
30
Unit 1 Basic Concepts of Sampling Distribution
standard error decreases. It means that if we collect more data in a sample,
our estimate of the population mean will be more accurate. To illustrate the
general relationship between standard error and sample size, we visualize the
effect of the sample size in Fig. 1.5. For that, we select samples of different
sizes from a single population with a standard deviation equal to 9 and
compute the standard error for each sample as follows:

Sample Standard Deviation of


Standard Error
Size Population (σ)

1 9 9
4 9 4.5
9 9 3
16 9 2.25
25 9 1.8
36 9 1.5
49 9 1.3
64 9 1.13
81 9 1
100 9 0.9

We now plot the standard error corresponding to the sample size as shown in
Fig. 1.5.

Fig. 1.5: The standard error with the sample size.

From the above figure, we observe that the standard error of the mean will
approach zero with the increasing number of observations in the sample. It
happens because when the sample size becomes large then the sample
becomes more and more representative of the population, and the sample
mean approaches the actual population mean.
Note: The standard errors discussed above were calculated under the
assumption that sampling is done either from an infinite population or from a
finite population with replacement. But, in real-life sampling problems, most
sampling plans do not permit an element to be selected twice in a given
sample (i.e. sampling with replacement). Consequently, if the population is not
large in comparison to the size of the sample and sampling is done without
replacement then we multiply the above standard errors by the correction
N−n
factor .
N −1 31
Block 1 Sampling Distributions

where N is the population size. This correction factor is known as a finite


population correction factor.
Therefore, in this case, the standard error of the sample mean is given by
N−n σ
SE ( X ) =
N −1 n

In practice, if the sample size n is less than or equal to 10% of the population
size N, this correction factor may be ignored for all practical purposes.
You should try the following Self Assessment Question before studying next
section.

SAQ 4
A juice company packages the juice in 500 mL pouches. Suppose you are
incharge of monitoring that pouches that they are being filled correctly and you
randomly select a sample of 25 pouches from the thousands of pouches filled
during a shift. Given that the standard deviation of the juice-packaging process
is 12 mL,

(i) Calculate the standard error of the sample mean.

(ii) If you select a sample of 100 pouches, what will be the standard error?

(iii) What will be the size of the standard error of the mean as the sample
size is increased from 25 to 100?

We now introduce you to some of the most powerful data laws, principles, and
rules that are based on statistical theories. These will help you become a
better data scientist or data analyst in general. The central limit theorem and
law of large numbers are the foundation of statistical inference. Understanding
these fundamental concepts can help you analyse and interpret data more
effectively, enabling you to make confident decisions based on solid evidence.
In the coming session, we will discuss the central limit theorem and in the next
session, the law of large numbers.

1.5 CENTRAL LIMIT THEOREM


In the previous sections, you saw that when the population size is large then
the number of possible samples increases dramatically, and it is virtually
impossible to actually obtain every possible random sample. In such situations
a question may arise, is it possible to determine exactly what the sampling
distribution of the sample means without considering all samples? The answer
to the question is given by the central limit theorem. It gives the form of the
sampling distribution of mean under some conditions.

The first concept of the central limit theorem was given by French
(1667-1754) Mathematician Abraham De Moivre in 1733. At that time, the central limit
Abraham De Moivre was a theory proposed by De Moivre was not popular. But another well-known
French mathematician. He French mathematician, Pierre-Simon Laplace, revived the idea in 1812. The
was the first who gave the central limit theorem discoveries made by Laplace at that time caught the
central limit theorem. interest of numerous academics and thinkers. The central limit theorem was
extended later in 1901 by Russian mathematician Aleksandr Lyapunov. He
32
Unit 1 Basic Concepts of Sampling Distribution
disproved the idea in general and provided mathematical evidence for its
validity.
The central limit theorem is one of the most important theorems of Statistics. It
states that
The sampling distribution of the mean approaches to a normal
distribution as the size of the sample increases, regardless of the shape
of the original population distribution.
We can explain the central limit theorem as

If a random sample of size n is taken from a population with mean µ and finite We can apply the central
variance σ2 then the sampling distribution of the sample mean tends to a limit theorem to almost all
normal distribution with mean µ and variance σ2/n as the sample size tends to types of probability
be large (n ≥ 30) whatever may be the form of the parent population, that is, distributions, but the
population must have a
 σ2 
X ~ N  μ,  when n ≥ 30 finite variance. That
 n  restriction rules out the
Cauchy distribution
We do not intend to prove this theorem here but merely show graphical
because it has infinite
evidence of its validity in Fig. 1.6. Here, we will also try to show how large
variance.
must the sample size be for which we can apply the central limit theorem.

In this figure, we are trying to understand the sampling distribution of sample


mean X for different populations and for varying sample sizes. We present
this figure in four parts A, B, C and D. Part ‘A’ of this figure shows four different
populations as normal, uniform, binomial and exponential.
The rest of the parts B, C and D represent the shape of the sampling
distributions of mean for the sample size n = 2, n = 5 and n = 30, respectively
drawn from the populations shown in the first row (Part-A).

From the first column of this figure, we observed that when the parent
population is normal then all sampling distributions of mean for varying sample
sizes n = 2, 5 and 30 are also normal, having the same mean but their
variances decrease as n increases.

The second column of this figure represents the uniform population. Here, we
observe that the sampling distribution of mean is symmetrical and does not
follow any standard distribution when n = 2 and tends to be normal when
n = 30.

However, the third column of this figure represents the binomial population
(discrete). Again, when n = 2, the sampling distribution of mean is symmetrical
and for n = 5 it is quite bell-shaped and tends to be normal when n = 30.

The last column of this figure represents the exponential population which is
highly skewed. Here, we also observe that when the sample size n = 2 then
the sampling distribution of mean does not follow any standard distribution but
for n = 30, the distribution of mean is symmetrical bell-shaped normal.

From Fig. 1.6, we also observe that the rate at which the distribution
approaches a normal distribution depends on the shape of the population. If
the population itself is normally distributed, the sampling distribution of mean is
also normal for any sample size n, as stated earlier. On the other hand, for
33
Block 1 Sampling Distributions

population distributions that are very different from a normal distribution, a


relatively large sample size is required to achieve a good normal
approximation for the distribution.

Fig. 1.6: Sampling distribution of sample means for various populations when n = 2,
n = 5 and n = 30.
34
Unit 1 Basic Concepts of Sampling Distribution
Here, we also conclude that if we draw a random sample of large size n ≥ 30
from the population then the sampling distribution of mean can be
approximated by a normal probability distribution, whatever the form of the
parent population.
From the above discussion, we draw the following results:
1. When the population/study variable is normally distributed, then the
sampling distribution of mean will be normally distributed, for any sample
size n.
2. When the distribution of the population/study variable is not normal, a
sample size of 30 or more is needed to use a normal distribution to
approximate the distribution of the sample means.
3. If the population distribution is fairly symmetrical, the sampling distribution
of the mean is approximately normal for samples as small as 5.
Let us see the applications of the central limit theorem using an example.
Example 3: The average breaking strength of a certain brand of steel cable is
2500 pounds, with a standard deviation of 160 pounds. A sample of 40 cables
is randomly selected and tested. What is the sampling distribution of mean?
Also, find the probability that the average breaking strength of the sample
cables:
(i) more than 2550 pounds
(ii) less than 2480 pounds
(iii) between 2450 pounds and 2550 pounds

Solution: Here, we are given that


The average breaking strength of the steel cable is 2500 pounds with a
standard deviation of 160 pounds. These are either population or sample
information but till that no sample is taken so it is the mean of the whole steel
cables (population), therefore,
µ = 2500 pounds, σ = 160 pounds and n = 40
To find out the required probability, we require the probability distribution as
you have seen in MST-012 but the distribution of the breaking strength of the
steel cable is not given however the sample size is 40 large (n > 30).
Therefore to find the sampling distribution, we use the central limit theorem
which tells us the distribution of the same. According to the central limit
theorem, when the sample size is large (n ≥ 30), the sampling distribution of
the sample means will follow a normal distribution with mean

( )
E X = μ= 2500

and variance
σ 2 (160)2
( )
Var X
= =
n 40
= 640

Therefore, the sampling distribution of the average breaking strength of the


steel cables is normally distributed with a mean of 2500 pounds and
a variance of 640 pounds2. 35
Block 1 Sampling Distributions

(i) The probability that the average breaking strength of the sample cables
is more than 2550 pounds is given by

P  X > 2550  [see Fig. 1.7]

To get the value of this probability, we convert the sample mean X into
a standard form. Since the sampling distribution of the average breaking
strength (X) of the steel cables is normally distributed with mean of 2500
pounds and a variance of 640 pounds2, therefore, we transform
the sample mean (X) into the standard normal Z-score as

X − E(X) X − 2500 X − 2500


=Z = =
Var(X) 640 25.30

Therefore, subtract 2500 from each term and then divide each term by
25.30 in the above inequality. Thus, the probability expression becomes
 X − 2500 2550 − 2500 
P > =P [ Z > 1.98]
 25.30 25.30 
1 − P [ Z ≤ 1.98]
=

X 2500
= X 2550 From the standard normal table (Table IV) given in the Appendix of this
Z 0=Z 1.98 volume, we get
Fig. 1.7 1 − P [ Z ≤ 1.98] = 1 − 0.9761 = 0.0239
P  X > 2550  =

It means that 2.39% of steel cables have an average breaking strength


of more than 2550 pounds.
(ii) We can obtain the probability that the average breaking strength of the
sample cables is less than 2480 pounds as
 X − 2500 2480 − 2500 
P =
X < 2480  P  < 
 25.30 25.30 

= P [ Z < −0.79] [see Fig. 1.8]

= 0.2148 [using Table III]


=X 2480
= X 2500
=Z 0.79
= Z 0 It means that 21.48% of the steel cables have an average breaking
strength less than 2480 pounds.
Fig. 1.8

(iii) We can compute the probability that the average breaking strength of the
sample cables lies between 2450 pounds and 2550 pounds as

 2450 − 2500 X − 2500 2550 − 2500 


P 2450 < X < 2550
=  P  < < 
 25.30 25.30 25.30 

= P [ −1.98 < Z < 1.98] [see Fig. 1.9]

= P [ Z < 1.98] − P [ Z < −1.98]


X 2450
= X 2500
= X 2550
Z=
−1.98 Z =
0 Z=
1.98 = 0.9761 − 0.0239 = 0.9522
Fig. 1.9
It means that 95.22% of the steel cables have an average breaking
strength between 2450 pounds and 2550 pounds.
36
Unit 1 Basic Concepts of Sampling Distribution
Now, you can try the following Self Assessment Question which will make you
more user-friendly with the use of the central limit theorem.

SAQ 5
A survey of a city found that the mean number of days per month that people
who suffer from migraine headaches is 12.4 days with a standard deviation 3
days. Find the probability that if a random sample of 50 people who suffer from
migraine headaches is selected, the mean of the sample will be between 12
and 13 days.

1.6 LAW OF LARGE NUMBERS


We have already discussed in Section 1.2 of this unit that the population
parameters are generally unknown and for estimating parameters, we draw all
possible random samples of the same size from the population and calculate
the values of sample statistic such as sample mean, sample proportion,
sample variance, etc. for all samples and with the help of these values we
prepare the sampling distribution of that statistic which helps us to draw
inferences about the population parameters.
But in the real-world the population size is too large and you see that when the
population size is large then the number of possible samples increases
dramatically and it is virtually impossible to actually obtain every possible
random sample and then observe the sampling distributions. In such
situations, we use the concept of theoretical sampling distribution such as the
central limit theorem and we draw a random sample from the population to
draw inferences about the population parameters. A very crucial question then
arises: “Using a random sample of finite size, can we make a reliable
inference about population parameters?” The answer is “yes”, reliable
inference about population parameters can be made by using only a finite
sample and we shall demonstrate this by “law of large numbers”. The law of
large numbers is a powerful principle that helps statisticians and analysts
make sense of data. After understanding how it works and how it applies to
real-world situations, you can make better decisions and draw more accurate
conclusions from your data.
The Law of large numbers was initially known as “the Golden Theorem” or (1654-1705)
“Bernoulli’s theorem”. It was first coined by the Swiss mathematician Jacob Jacob Bernoulli (also
Bernoulli in 1713. The theorem later became known as the “Law of large known as James' or
numbers”. There is also a more general version of the law of large numbers Jacques) was a Swiss
mathematician. He mainly
for averages, proved more than a century later by the Russian contributed to Analytic
mathematician Pafnuty Chebyshev. geometry, Probability
theory, Variable calculus.
This law of large numbers can be stated in words as: However, his most
important contribution was
“The law of large numbers states that as a sample size increases, its in the field of probability,
mean gets closer to the average of the whole population. In other words, where he derived the first
version of the law of large
as the sample size increases, the average of the observed results will
numbers in his work Ars
become more and more representative of the true parameter value.” Conjectandi.

This is the same intuition behind the idea that if we collect more data, our
sample of data will be more representative of the population.
37
Block 1 Sampling Distributions

Here, try to understand the law of large numbers in action with the help of an
example.
Suppose the distribution of the weight of all the young men living in a city is
close to a normal distribution with a mean weight of 65 kg and a standard
deviation of 5 kg. To understand the law of large numbers, we calculate the
sample mean weight ( X ) for varying sample sizes n = 1, 2, 3, …. Fig. 1.10
shows the behaviour of the sample mean weight X of men chosen at random
from this city. The graph plots the values of X along the vertical axis and
sample size along the horizontal axis as sample size varying n =1, 2, 3…
First, we start with a sample of size n = 1, that is, we select a man randomly
from the city. Suppose the selected man has a weight of 70 kg, therefore, the
line of the graph starts from this point. We now select the second man
randomly and suppose his weight is 62 kg. So for n = 2, the sample mean is
70 + 62
=X = 66 kg
2

Fig. 1.10: Law of Large numbers

This is the second point on the graph. Now, we select the third man randomly
from the city and suppose his weight is 55 kg. Therefore, for n = 3, the sample
mean is
70 + 62 + 55
=X = 62.33 kg
3
This is the third point on the graph.

This process will continue. From the graph, we can see that the mean of the
sample changes as we make more observations, eventually, however, the
mean of the observations gets closer and closer to the population mean of 65
kg as the sample size increases.
Hence from Fig. 1.10, we can observe that every additional data point
gathered has the potential to move the sample to the true population mean.
We use the law of large numbers in a variety of fields including statistics,
market research, finance, healthcare, insurance, engineering, etc. Here we
mention some examples of the use of the law of large numbers:
1. In market research, let us suppose you are working for a marketing
38 company, and you want to know what percentage of people in a certain city
Unit 1 Basic Concepts of Sampling Distribution
prefer your brand of mobile. It is not possible for you to survey every single
person in the city, so you take a random sample of 100 people and ask
them which brand of mobile they prefer. If you find that 40% of them prefer
your brand. Does that mean that 40% of the entire population prefers your
brand? Not necessarily there is always a chance that your sample was
biased in some way.
However, if you were to take a larger sample, say, 10,000 people and found
that 45% of them prefer your brand, that would be a stronger indication that
a majority of the population really does prefer your brand. And if you were
to take an even larger sample say, 100,000 people and found that 48% of
them prefer your brand, that would be an even stronger indication that your
brand is really popular in the city.
2. In financial analysis, we can also use the law of large numbers. For
example, suppose you are a stock analyst and you want to know the
average rate of return for a certain stock. You could calculate the rate of
return for a small sample of investors and get a rough estimate, but it would
not be very accurate. However, if you take a larger sample of investors to
calculate the rate of return, you will get a more accurate estimate of the true
rate of return.
3. Similarly, the law of large numbers is also used in medicine to study the
effectiveness of treatments. If you want to know whether a certain
medication is effective for a particular condition then you could study a
small sample of patients and get a rough idea. However, if you study a
larger sample of patients, you will get a more accurate estimate of the
effectiveness of the medication.
After studies where we can use the law of large numbers, we should note the
following important points:
• The law of large numbers does not tell that the sample mean will always
reflect the true population characteristics, especially for small samples.
• If a given sample mean deviates from the true population mean, then the
law of large numbers does not guarantee that successive samples will
move the observed average towards the population mean.
I hope you have understood the concept of the law of large numbers and how
you can apply it to real-world situations. Hence, you can make better decisions
and draw more accurate conclusions from your data.
With the help of the law of large numbers, we can also determine the minimum
sample size to get a reliable inference about the population.
This law states that for any two arbitrary small numbers ɛ (ε > 0) and ɳ (0 < η
σ2 You have also studied the
< 1) , there exists an integer n = 2 such that if a random sample of size n or
εη law of large numbers in
Unit 18 of MST-012:
larger is drawn from the population and the sample mean (X) is calculated for
Probability and Probability
this sample, then the sample mean (X) arbitrarily close to the population
Distribution. Where you
mean (µ) where σ2 is the finite variance of the population. In statistical studied the same in the
inference ɛ is called margin of error and 1 – η is called confidence level. context of probability.
Hence, to get a reliable inference about the population, we can determine the
39
Block 1 Sampling Distributions

\minimum sample size with the help of the law of large numbers. For that, the
σ2
minimum sample size should be n ≥ 2 .
εη

Let us take an example to understand the same.


Example 4: A hospital administrator wants to estimate the mean weight of
babies born in his/her hospital. How large a sample of birth records should be
taken if he/she wants a 99 per cent confidence that the estimate is within the
range of 0.4 pounds? As per the available records, the population SD is 0.5
pounds.
Solution: Here, we are given that
ɛ = margin of error = 0.4, confidence level = 0.99 and σ = 0.5
Also, for 99% confidence, 1 − =
η 0.99 ⇒ =
η 0.01

We can calculate the minimum sample size for estimating the mean weight of
babies born in her hospital as

( 0.5 )
2
σ2
n≥ 2 = 2
ε η ( 0.4 ) × 0.01

= 156.25 ~ 157
Hence, the hospital administrator should take a random sample of at least 157
babies.
Now, try the following Self Assessment Question for your practice.

SAQ 6
A pathologist wants to estimate the mean time required to complete a certain
analysis on the basis of a sample study so that he may be 99% confident that
the mean time may remain within ± 2 days of the mean. As per the available
records, the population standard deviation is 5 days. What must be the size of
the sample for this study?

We now end this unit by giving a summary of what we have covered in it.

1.7 SUMMARY
In this unit, we have covered the following points:
• The statistical procedure which is used for drawing conclusions about the
population parameter on the basis of the sample data is called “statistical
inference”.
• A population is a group of measurements in the quantitative or qualitative
form of the characteristic under study.
• A parameter or population parameter is a numerical value that summarises
or measures or represents a specific characteristic of an entire population.
• A sample statistic or a statistic is a numerical measure that summarizes or
describes a characteristic of a sample.

40
• Any statistic used to estimate an unknown population parameter is known
Unit 1 Basic Concepts of Sampling Distribution
as “estimator” and the particular value of the estimator is known as
“estimate” of the parameter.
• The probability distribution of all possible values of a sample statistic that
would be obtained by drawing all possible samples of the same size from
the population is called the “sampling distribution” of that statistic.
• The standard deviation of the sampling distribution of a statistic is known
as “standard error”.
• The central limit theorem states that the sampling distribution of the
sample means tends to a normal distribution as the sample size tends to
be large (n > 30).
• The law of large numbers states that as a sample size increases, the
sample mean gets closer to the average of the whole population.

1.8 TERMINAL QUESTIONS


1. In the ball bearings question (SAQ 2), Estimate the standard error of the
sample mean.
2. Describe the Central Limit theorem.

1.9 SOLUTIONS / ANSWERS


Self Assessment Questions (SAQs)
1. Here, we are given that
Population size = N = 4
Sample size = n = 2
Since we know that all possible samples of size n taken from a
population of size N with replacement (order and replacement allowed)
are Nn. Therefore, possible samples in this case Nn = 42 = 16. These 16
samples are given in Table 1.8 along with the weights (in pounds) of the
babies.
Table 1.8: Possible Samples of Babies
For listing all possible
Sample in Sample Sample in Sample
Sample Observations Sample Observations samples, we list samples
Term of Term of
Number Number systematically. First, we list
Associates (car sold) Associates (car sold)
all of the possible samples
1 (B1, B1) (6, 6) 9 (B3, B1) (7, 6)
with the first element of the
2 (B1, B2) (6, 8) 10 (B3, B2) (7, 8) population i.e. B1 as the
3 (B1, B3) (6, 7) 11 (B3, B3) (7, 7) first typist, then all of the
possible samples with the
4 (B1, B4) (6, 6) 12 (B3, B4) (7, 6)
second element of the
5 (B2, B1) (8, 6) 13 (B4, B1) (6, 6) population i.e. B2, B3 and
6 (B2, B2) (8, 8) 14 (B4, B2) (6, 8) so on. In this way, we can
be sure that we have all of
7 (B2, B3) (8, 7) 15 (B4, B3) (6, 7)
the possible random
8 (B2, B4) (8, 6) 16 (B4, B4) (6, 6)

2. We know the formula of sample variance as


1 n
( )
2
=S2 ∑ Xi − X
n − 1 i=1 41
Block 1 Sampling Distributions

To calculate it, first, we have to calculate the sample mean ( X ). We


calculate the sample mean as
X1 + X2 + ... + Xn 29 + 31 + 30 + 32 + 30 + 29 + 30 + 30 + 29 + 30
=X =
n 10
300
X
= = 30
10
We now calculate the sample standard deviation as

1 n
( )
2
=S ∑ Xi − X
n − 1 i=1

(29 − 30)2 + (31 − 30)2 + (30 − 30)2 + (32 − 30)2 + (30 − 30)2
+(29 − 30)2 + (30 − 30)2 + (30 − 30)2 + (29 − 30)2 + (30 − 30)2
=
9

1+ 1+ 0 + 4 + 0 + 1+ 0 + 0 + 1+ 0 8
= = = 0.94
9 9
In this case, we are estimating the population standard deviation from
the sample standard deviation, therefore, the sample standard deviation
is the estimator for the population standard deviation and 0.94 pounds is
the estimated value of the population standard deviation.
3. First of all, we check the shape of the population. For that, we pot the
population values (number of errors) as follows:

Fig. 1.11: Distribution of the population of typing error.

The above figure shows that it is a uniform distribution instead of bell-


shaped (normal distribution).
We can calculate the population mean (average number of errors) as
4 + 2 + 3 +1
=μ = 2.5
4
Similarly, we can compute the standard deviation of the population as

1 N (4 − 2.5)2 + (2 − 2.5)2 + (3 − 2.5)2 + (1 − 2.5)2


σ
= ∑
N i=1
μ)2
(Xi − =
4
= 1.12

Here, the population size (N) is 4. Therefore, there are Nn = 42 = 16


possible simple random samples with replacement of size 2. All possible
samples of size n = 2 are given below and for each sample, the sample
mean is calculated as shown in Table 1.9.
42
Unit 1 Basic Concepts of Sampling Distribution
Table 1.9: Samples and Sample Means

Sample Sample in
Number Sample Sample Mean
Terms of
Observation (X)
Typist

1 (A, A) (4, 4) 4.0


2 (A, B) (4, 2) 3.0
3 (A, C) (4, 3) 3.5
4 (A, D) (4, 1) 2.5
5 (B, A) (2, 4) 3.0
6 (B, B) (2, 2) 2.0
7 (B, C) (2, 3) 2.5
8 (B, D) (2, 1) 1.5
9 (C, A) (3, 4) 3.5
10 (C, B) (3, 2) 2.5
11 (C, C) (3, 3) 3.0
12 (C, D) (3, 1) 2.0
13 (D, A) (1, 4) 2.5
14 (D, B) (1, 2) 1.5
15 (D, C) (1, 3) 2.0
16 (D, D) (1, 1) 1.0

To obtain the sampling distribution of all sample means, we arrange


these values in ascending order and calculate the frequency of each
value as shown in Table 1.10. We can also obtain the probability
distribution using the relative frequency approach of probability in the
last column of Table 1.10.
Table 1.10: Sampling Distribution of Sample Means

X Frequency(f) Probability(p)

1.0 1 1/16 = 0.0625

1.5 2 2/16 = 0.1250

2.0 3 3/16 = 0.1875

2.5 4 4/16 = 0.2500

3.0 3 3/16 = 0.1875

3.5 2 2/16 = 0.1250

4.0 1 1/16 = 0.0625

After finding the sampling distribution of mean, we check the shape of


the sampling distribution of mean. For that, we put the sample means on
the X-axis with the corresponding frequency on the Y-axis as shown in
Fig. 1.12.

From the figure, we observe that it is a bell-shaped (normal) distribution.


43
Block 1 Sampling Distributions

If we compare the population distribution with the sampling distribution of


mean then we observe that even though the shape of the population is
uniform the sampling distribution of the mean is normal.

Fig. 1.12: Sampling distribution of mean.

The sampling distribution of the sample means itself has mean,


variance, etc. Therefore, we can compute the mean of this distribution as
1 k k
Mean of samples means= X= ∑ ii
K i=1
X f where, K = ∑
i =1
fi

1
= (1.0 × 1 + 1.5 × 2 + ... + 4.0 × 1=) 2.5= μ
16
Similarly, we can calculate the standard deviation of the sample
distribution of mean as

1 k k

∑ fi ( xi − μ) where, K = ∑ fi
2
SD(X)
=
K i=1 i =1

1 
1× (1.0 − 2.5 ) + 2 × (1.5 − 2.5 ) + ... + 1× ( 4.0 − 2.5 ) 
2 2 2
=
16  

1 10
= ( 2.25 + 2 + ... + 2.25=) = 0.791
16 16
Hence, we can conclude that the mean of the sampling distribution of
mean is the same as the population mean whereas the dispersion of the
sampling distribution (0.791) is less in comparison to the population
(1.12).

4. Here, we are given that


n = 25, σ = 12
Since, population standard deviation (σ) is given, therefore, we can
calculate the standard error of the sample mean as
σ 12
( )
SE =
X =
n
= 2.4
25
From the above result, we conclude that the variation in the sample
means for samples of n = 25 is much less than the variation in individual
44
Unit 1 Basic Concepts of Sampling Distribution
pouches of juice (σ = 15).
When we select a sample of 100 pouches, then the standard error will
be
σ 12
( )
SE =
X =
n
= 1.2
100
When we increased the sample size from 25 to 100, then we observed
that the standard error became half. Thus, we conclude that if we want
to half the standard error, we must increase the sample size 4 times.
5. Here, we are given that
The mean number of days per month that people who suffer from
migraine headaches is 12.4 days with the standard deviation is 3 days.
These are either population or sample information but till that, no sample
is taken so it is the information of the whole group (population),
therefore,
µ = 12.4 days, σ = 3 days and n = 50
To find the required probability, we require the probability distribution but
is not given however the sample size is 50 large sample (n > 30).
Therefore, to find the sampling distribution we use the central limit
theorem. According to the central limit theorem, the sampling distribution
of mean number of days per month that people who suffer from migraine
headaches follows a normal distribution with mean

( )
E X = μ= 12.4 days

and variance
σ2
( )
Var X =
n
(3)2
= = 0.18
50
We have to find the probability that the sample mean will lie between 12
and 13 days is given by
P 12 < X < 13  [see Fig. 1.13]

To get the value of this probability, we convert the variate X into a


standard normal Z score by the transformation:

X − E(X) X − 12.4 X − 12.4


=Z = =
Var(X) 0.18 0.42

Therefore, subtract 12.4 from each term and then divide each term by= X 2450
= X 2500
= X 2550
0.42 in the above inequality. Thus, the probability expression becomes: Z=−1.98 Z =
0 Z=1.98
Fig. 1.13
 12 − 12.4 X − 12.4 13 − 12.4 
P < <  = P [ −0.95 < Z < 1.43]
 0.42 0.42 0.42 

= P [ Z < 1.43] − P [ Z < −0.95] = 0.9236 − 0.1711 = 0.7525

It means that 75.25% of the mean of the sample will lie between 12 and 45
Block 1 Sampling Distributions

13 days.
6. Here, we are given
1−= η 0.01, ε = 2 , σ = 5
η 0.99 ⇒ =

By the law of large numbers, we can calculate the minimum size as


σ2 52
n≥ 2
= = 625
ε η 22 × 0.01
That is, n ≥ 625

Hence, at least 125 units must be drawn in a sample.


Terminal Questions (TQs)
1. Here, we are given that
n = 10
Since, the population standard deviation (σ) is not given, therefore, we
can estimate the standard error of the sample mean using the formula
given as follows:
S
( )
SE X =
n

To use the above formula, first, we have to calculate the sample


standard deviation. But we have already calculated it in SAQ 2 as 0.94.
Therefore, we can compute the estimate of standard error as
S 0.94
( )
SE =
X =
n
= 0.30
10

2. Refer to Section 1.5.

46
UNIT 2

SAMPLING DISTRIBUTIONS OF
SAMPLE MEANS

Structure
2.1 Introduction When Populations are Normally
Distributed
Expected Learning Outcomes
When Populations are not Normally
2.2 Sampling Distribution of
Distributed
Mean
2.7 Sampling Distribution of
2.3 Sampling Distribution of
Difference of Two Means
Mean When Population is
When Samples are Paired
Normally Distributed
When Population of Differences is
When Population Variance is
Normally Distributed
Known
When Population of Differences is
When Population variance is
not Normally Distributed
Unknown
2.8 Summary
2.4 Sampling Distribution of
Mean When Population is 2.9 Terminal Questions
Not Normally Distributed
2.10 Solutions/Answers
2.5 Sampling Distribution of
Difference of Two Means
2.6 Sampling Distribution of Tools You Will Need

Difference of Two Means The following terms are


When Samples are considered essential
Independent background material for
this Unit. If you doubt your
knowledge of any of these
2.1 INTRODUCTION terms, you should review
the appropriate Unit or
In the previous unit, we discussed the concept of the sampling distribution of a
section before proceeding:
statistic. One of the most important sample statistics which is used to draw a
conclusion about the population mean is the sample mean. There are so many • Basic concepts of
problems in business, manufacturing, economics, biology, sociology, sampling distribution
environmental sciences, etc. where it becomes necessary to draw inferences (Unit 1).

about population parameters such as population mean, difference of two


population means, etc. For example, 47

Unit Writer- Dr. Prabhat Kumar Sangal, School of Sciences, IGNOU, New Delhi
Block 1 Sampling Distributions

• An economist may want to estimate the average capital income of a


particular region and test whether it will be the same as the national
average income,
• A dietitian wishes to know whether a new diet plan for weight reduction is
effective or not,
• A market analyst wants to compare the average purchasing habits of
males and females,
• A researcher is interested to know which one of the two different types of
drugs has a better effect on controlling high blood pressure, etc.
Generally, the sample mean is used to draw inferences about the population
mean and for that, we require the sampling distribution of means (single mean
and difference of two means) for providing accurate and reliable information
about the whole population using sample data. Therefore, this unit is devoted
to explaining the sampling distribution of a single mean and difference of two
means.
This unit is divided into 10 sections. Section 2.1 is introductive in nature and
presents the need of sampling distribution of means. In Sections 2.2 to 2.4, we
describe the sampling distribution of mean when the population is normally
distributed or not. Sections 2.5 to 2.7 are devoted to describing the sampling
distribution of difference of two means when samples are independent and
paired. The unit ends by providing a summary of what we have discussed in
this unit in Section 2.8. The terminal questions and the solution of the SAQs/
TQs are given in Sections 2.9 and 2.10, respectively.
In the next unit, we shall discuss the sampling distributions of proportions as
well as variances.
Expected Learning Outcomes
After studying this unit, you should be able to:
• describe the need of sampling distributions of mean and difference of two
means;
• explain the sampling distribution of mean when the population is normally
distributed or not; and
• explore the sampling distributions of difference of two means when
samples are independent and paired.

2.2 SAMPLING DISTRIBUTION OF MEAN


In the inferential statistics process, we select random sample(s) from the
population, compute a statistic with the help of the sample observations and
analysis the sample/statistic, and make predictions and draw conclusions
about the population parameter based on the statistic/sample. For drawing
inferences about the population on the basis of a sample statistic, it is
essential to know the distribution of the statistic. One of the most important
sample statistics which is used to draw inferences about the population mean
on the basis of a sample is the sample mean. Some situations are given as
48 follows:
Unit 2 Sampling Distributions of Sample Means
• An economist may want to estimate the average capital income of a
particular region and test whether it will be the same as the national
average income or not,
• A product manager of a company may want to estimate the average life of
electric bulbs manufactured by the company,
• A pathologist may want to estimate the mean time required to complete a
certain analysis,
• A quality control inspector is interested to know whether the mean
diameter of the ball bearing has shifted significantly from the specification,
etc.
In such situations, we require the sampling distribution of mean for providing
information about the whole population using sample data and conclude as
accurately and reliably as possible.
You have already tested the flavour of the sampling distribution of mean with
the help of an example in Section 1.3 of the previous unit. For constructing the
sampling distribution of mean, we draw all possible samples with replacement
of the same size from the population and calculate the sample mean for each
sample. After calculating the sample mean for each sample, we construct the
probability distribution for the values of the sample mean. This probability
distribution is known as the sampling distribution of mean. Therefore, we can
define the sampling distribution of mean as:
“The probability distribution of all possible values of the sample mean
that would be obtained by drawing all possible samples of the same size
from the population is called the sampling distribution of sample mean
or simply say sampling distribution of mean.”
Now, the question may arise in your mind “What is the shape of the sampling
distribution of mean, is it normal, exponential, binomial, Poisson, etc.?” In Unit
1, you have noticed that the shape of the sampling distribution of the mean
mainly depends on two factors:
• the form/shape/distribution of the population from which the samples are
taken (See Example 2 and SAQ 2 of Unit 1),
• the size of the sample (See Example 1 of Unit 1).
To discuss how the form/shape/distribution of the population affects the
sampling distribution of mean, we divide our discussion into two parts:
1. When population is normally distributed; and
2. When population is not normally distributed.
Let us discuss one at a time.

2.3 SAMPLING DISTRIBUTION OF MEAN WHEN


POPULATION IS NORMALLY DISTRIBUTED
In many situations, it is reasonable to assume that the population from which
we select a random sample has a normal or nearly normal distribution. For
example, people's heights, IQ level, blood pressure, weight of newly born
49
babies, test scores, incomes, shoe size, etc. are generally follow a normal
Block 1 Sampling Distributions

distribution. When a population follows a normal distribution, then the


sampling distribution of mean is also bell-shaped for any sample size.
However, the exact shape of the sampling distribution of mean also depends
whether the variance/standard deviation of the population is known or
unknown. Therefore, we consider both cases in the next sub-sections.
2.3.1 When Population Variance is Known
There are many cases when a population for which we have to draw
conclusions on the basis of samples follows normal distribution and the
variance of the population is known. In such situations, the sampling
distribution of mean follows normal distribution regardless of the sample size
n. To demonstrate it, let us consider an example.
A play-school teacher is interested to know the listening power of the children.
To assess this, she took a test of a group of 9 children and noted how many
digits they repeated from memory after hearing them once. The obtained
scores (out of 5) are as follows:

Child Name Avaya Ishaan Ria Zara Kavya Parth Rayaan Nyra Yash

Score 2 5 4 3 1 3 2 4 3

Here, you can easily calculate the mean and variance of the test scores
(population) as we have calculated in Unit 1. We get mean = 3 and variance =
1.33. To know whether the shape of the population (test scores) is normal or
not, we first plot the graph of the scores of the children as shown in Fig. 2.1.

Fig. 2.1: The distribution of the scores of the children in the listening task.

From the above figure, you can observe that the test scores of the children
follow an approximate bell-shaped normal distribution.
To obtain the shape of the sampling distribution of mean in this case, let us
consider all possible samples with replacement of size n = 2 from the above
population of the test scores of the children. There are Nn = 92 = 81 possible
simple random samples with replacement of size 2 which are given in the
second column of Table 2.1. We also calculate the mean of each sample
50 which are given in the last column of the same table.
Unit 2 Sampling Distributions of Sample Means
Table 2.1: Samples and Sample Means

Sample Sample in Term of Sample Observation


Sample Mean
number Children (scores)

1 (Avaya, Avaya) (2, 2) 2


2 (Avaya, Ishaan) (2, 5) 3.5
3 (Avaya, Ria) (2, 4) 3
4 (Avaya, Zara) (2, 3) 2.5
5 (Avaya, Kavya) (2, 1) 1.5
6 (Avaya, Parth) (2, 3) 2.5
7 (Avaya, Rayaan) (2, 2) 2
8 (Avaya, Nyra) (2, 4) 3
9 (Avaya, Yash) (2, 3) 2.5
10 (Ishaan, Avaya) (5, 2) 3.5
… … … …
81 (Yash, Yash) (3, 3) 3

Here, we have given some of these possible samples to reduce the space.
You can prepare the same as we have discussed in Unit 1. To obtain the
distribution of the sample means, we arrange the values of the sample mean
in ascending order and calculate the frequency corresponding to each value
as discussed in Unit 1. The obtained sampling distribution is shown in Table
2.2.
Table 2.2: Sampling Distribution of Means

S. No. X Frequency(f) Probability(p)

1 1 1 1/81 = 0.012

2 1.5 4 4/81 = 0.049

3 2 10 10/81 = 0.123

4 2.5 16 16/81 = 0.198

5 3 19 19/81 = 0.235

6 3.5 16 16/81 = 0.198

7 4 10 10/81 = 0.123

8 4.5 4 4/81 = 0.049

9 5 1 1/81 = 0.012

Total 81 1

To get an idea of the shape of the sampling distribution of mean, we plot the
graph (frequency bar) of the values of the sample mean taking the sample
mean on the X-axis and corresponding frequencies on the Y-axis as shown in
Fig. 2.2(a).
From Fig. 2.2(a), you can observe the shape of the sampling distribution. You
may notice that even for a small sample size n = 2, the sampling distribution of
sample means when samples are taken from the normally distributed
population and population variance is known is normally distributed. 51
Block 1 Sampling Distributions

Let us see what happens as we increase the sample size. We now select all
possible samples with replacement of size n = 3 instead of 2 and then
calculate the sample mean for each sample and also prepare the graph to
observe the shape of the sampling distribution as discussed. The graph of the
values of the sample means is shown in Fig. 2.2(b).

Fig. 2.2: Sampling distribution of mean when population is normal for sample size n = 2,
3, 5, 30.

From Fig. 2.2(b), you can see that the sampling distribution of mean is also
normal. Similarly, we can prepare the sampling distribution of various sample
sizes and plot the graphs. We also prepared the graphs for n = 5 and 30 in
Fig. 2.2 (c) and (d), respectively. From Fig 2.2 (c) and 2.2(d), you can see that
the sampling distribution of mean is also normal, and the sample means are
distributed more tightly around the population mean as the sample size is
increased.

In general, when the population follows a normal distribution and


population variance is known, then the sampling distribution of mean is
also normal for any sample size and the sample means are distributed
more tightly around the population mean as the sample size is increases.

After knowing the shape of the sampling distribution of mean when the
population is normally distributed and population variance is known, you may
be interested to know the mean and variance of the sampling distribution.

Let us find the mean, variance and standard error of the sampling distribution
of mean.

In practice, only one random sample is selected, and the concept of the
sampling distribution is used to draw the inference about the population
52 parameters. If X1, X2, …, Xn is a random sample of size n taken from a normal
Unit 2 Sampling Distributions of Sample Means
population with mean µ and known variance σ then we can obtain the mean
2

and variance of the sampling distribution of mean as follows:

 X + X2 + ... + Xn 
Mean=
of X E=
X E 1

( )
n  [by definition of mean]

1
= [E(X1 ) + E(X2 ) + ... + E(Xn )]  E[X ± Y]= E [ X] ± [ Y ]
n
Since X1, X2, …, Xn are randomly drawn from the same population, so they
also follow the same distribution as the population. Therefore,
E(X1=
) E(X2=
) ...
= E(Xn=
) μ

and
Var(X1=
) Var(X2=
) ... ) σ2
= Var(Xn=

Thus,

1  1
E(X)=   n ( nμ) μ
 μ + μ + ... + μ= =
n   
n − times 

( )
E X =μ

and variance
If X and Y are independent
1 1
2 [
= Var  (X1 + X2 + ...=
Var(X)

+ Xn )  Var(X1 ) + Var(X2 ) + ... + Var(Xn )] random variables and a and
 n  n b are two constants then

 2 Var ( aX + bY )
1 2 1
σ + σ2   = n2 ( nσ )
+ ... + σ 2
= 
n2 = a2 Var ( X ) + b2 Var ( Y )
 n − times 

σ2
( )
Var X =
n
Hence, we conclude that if the samples are drawn from a normal population
with mean µ and known variance σ2 then the sampling distribution of mean X
is also normally distributed with mean µ and variance σ2/n, that is,
 σ2 
( )
If Xi ~ N μ,σ 2 then X ~ N  μ, 
 n 
and the quantity
X −μ
Z= ~ N ( 0,1) The standard deviation of a
σ
sampling distribution of a
n statistic (mean, proportion,
follows the standard normal distribution. standard deviation) is
known as standard error.
In other words, we can say that if we take all possible samples of a constant
size from a normal population and compute a statistic Z for each sample, then The standard error of the

the sampling distribution of Z would be standard normal. mean measures the degree
of dispersion of sample
The standard error of sample mean can be obtained by the definition of means around the
standard error as population mean.
σ
SE
= X =( )
SD X ( ) Var
= ( )
X
n 53
Block 1 Sampling Distributions

The above discussion was based on the assumption that the population was
infinite or extremely large or we were drawing samples with replacement. But if
the population is finite and we use sampling without replacement then a
statistical adjustment can be made. The adjustment is called the finite
correction factor (FPC). Without it, the central limit theorem does not hold
under those sampling conditions, and the standard error of the mean will be
too big. Under these conditions, the sample mean is distributed normally with
mean
( )
E X =μ

and variance
N − n σ2
( )
Var X =
N −1 n
The proof of this formula is beyond the scope of this course.
Therefore, the standard error of the sample mean is given by

N − n σ2
SE X
= ( ) Var X
= ( ) N −1 n
After understanding the shape of the sampling distribution when the population
is normal and population variance is known. Now you have curiosity to know
what the sampling distribution will be when variance is unknown whether it is
normal or not. Let us discuss the same in the next sub-section.

2.3.2 When Population Variance is Unknown


In most real-world situations, the variance of the population is rarely known,
especially when the population is too large or destructive in nature. For
example, height of all people of a country, sound of the crackers, blood test of
a human being, income of the people of a city, etc. In such cases, the
population variance is generally unknown.
When the population from which we draw the samples is normally distributed,
and the variance of the population is unknown then the sampling distribution of
means is bell-shaped but not normally distributed. In such situations, we use
sample variance (S2) in place of population variance (σ2). However, due to the
discrepancy between sample variance(S2) and population population variance
(σ2) especially when sample variance is calculated from a very small sample,
X−µ
then the distribution of the statistic follows the t-distribution instead of
S/ n
the standard normal. Therefore, the quantity
X −μ
t= ~ t (n−1)
S
n
follows the t-distribution with n – 1 degrees of freedom.
In other words, we can say that if we take all possible samples of a constant
size from a normal population and compute a statistic t for each sample, then
the sample distribution of the t-statistic would be t distribution with n – 1
54 degrees of freedom.
Unit 2 Sampling Distributions of Sample Means
The t-distribution is similar to the standard normal distribution. Its probability
density curve is bell-shaped and symmetric about t = 0 line as the standard
normal curve, but it has a lower peak and heavier tails (more observations
near the tail) than the standard normal curve. We will describe the t-
distribution in detail in Unit 4.
Therefore, if the population is distributed normally, and the population variance
is unknown, then the t-distribution is used for making inferences about the
population mean.
If sample size n is sufficiently large (≥ 30) then by the central limit theorem, the
sampling distribution of mean also follows a normal distribution with mean µ
and variance S2/n.
Therefore, we can obtain the standard error of the sample mean as follows:
S
SE
= ( )
X SD
= X ( ) Var
= X( ) n
Let us take a real-world situation and try to judge the shape of the sampling
distribution of mean whether it is normal or not and also see applications of the
sampling distribution of mean with the help of an example.
Example 1: The life of an automobile battery is known to be normally
distributed with an average life of 96 months with a standard deviation of 12
months. If a researcher selects 16 batteries randomly then
(i) What will be the sampling distribution of the average life of the batteries?
(ii) Find the mean, standard deviation and standard error of the sampling
distribution of the average life of the batteries.
(iii) What will be the probability that the sample mean of 16 randomly selected
batteries is greater than 100 months?
(iv) If it is assumed that the population standard deviation is not known and
the researcher obtained the sample standard deviation as 13 months,
then how will this affect the sampling distribution and standard error? Do
you think this will affect the probability? Justify your answer.
Solution: Here, we are given that
μ = 96, σ = 12, n = 16
Since it is given that the life of the automobile batteries follows the normal
distribution, therefore, the sampling distribution of the average life of the
batteries follows either the normal or the t-distribution depending upon whether
the standard deviation of the population is known or unknown. In this case, the
population standard deviation is known, therefore, the sampling distribution
also follows a normal distribution, even though our sample size is less than 30.
If the life of the battery was not normally distributed, we would need a larger
sample size before using a normal model for the sampling distribution.
The mean of the sampling distribution of the average life of the batteries will
be the same as the mean of the lives of the batteries. Therefore, the mean of
the sampling distribution is

( )
E X = μ= 96 55
Block 1 Sampling Distributions

We can compute the standard deviation of the sample means as


σ 12
SD X
= ( ) = = 3
n 4

Since the standard error is the standard deviation of the sampling distribution
of mean, therefore,

SE
= ( )
X SD
= X 3 ( )
Now, we can answer part (iii) by computing the probability that a randomly
chosen group of 16 batteries has an average life greater than 100 months, that
is
P  X > 100 

Since the average life of the batteries is normally distributed with a mean of 96
months and a standard deviation of 3 months, therefore, we use the normal
distribution and convert the sample average life to the standard normal Z-
score (as discussed in Unit 1) as follows:

=Z =
( )
X − E X X − 96
SD X ( ) 3

Therefore, we can transform the above expression of the probability to the Z-


scores as follows:
 X − 96 100 − 96 
P  X > 100  =P  >  =P [ Z > 1.33]
 3 3 
From the standard normal Table (Table IV) given in the Appendix of this
volume, we get
1 − P [ Z ≤ 1.33] =
P  X > 100  = 1 − 0.9082 =
0.0918

Therefore, the probability that the sample average life of the batteries will be
greater than 100 is 0.0918.
What do we conclude? This probability is low. We conclude that there would
be only a 9.18% (small) chance that the average life of the battery is greater
than 100 months if we selected a sample of 16 batteries.
Let us consider part (iv).
In this case, the population standard deviation is not known, therefore, the
distribution of the sample average life of the batteries follows the t-distribution
with n – 1 degrees of freedom instead of normal.
The mean of the sampling distribution of the average life of the batteries will
be the same as the life of the battery. Therefore, the mean of the sampling
distribution is

( )
E X = μ= 96

We can compute the standard deviation and standard error when the
population standard deviation is not known as
S 13
56
SD =
X ( ) =
n
= 3.25
16
Unit 2 Sampling Distributions of Sample Means

SE
= ( )
X SD
= X 3.25 ( )
So we conclude that the standard error is slightly greater than the previous
case when the population standard deviation is known.

We now calculate the required probability P  X > 100  in this case.

Since the average life of the batteries now follows the t-distribution with n –1
degrees of freedom, therefore, we can convert it into the t-distribution as
follows:

=t
X −E X
=
( )
X − 96
SD X ( )
3.25

Therefore, we can transform the above expression of probability as follows:


 X − 96 100 − 96 
P  X > 100  =P > =P [ t > 1.23]
 3.25 3.25 

To find the value of the above expression, we will use the t-table (Table V)
(given in the Appendix of this volume) because the distribution of the statistic t
follows the t-distribution.
The main difference between the standard normal distribution table and the
t-table is that the body of the standard normal distribution table represents the
probability corresponding to the value of Z whereas the body of the t-table
represents the value or point corresponding to the probability. Here, we have
to find the probability beyond the point 1.23 at 16 –1 = 15 degrees of freedom.
First, we check whether this value lies in the row corresponding to 15 degrees We discuss more about the
of freedom or not. If it lies in the row for df 15 then we read the corresponding t-table in Unit 5.
probability value (α) from the corresponding column heading. If it does not lie
in the row for 15 df then we read the probability value (α) from the column
heading corresponding to just greater and just smaller than 1.23. Thus, from
the t-table, we get
0.10 and P [ t > 0.866] =
P [ t > 1.341] = 0.20

Therefore, we can find the required probability as

0.10 < P  X > 100  =P [ t > 1.23] < 0.20

Therefore, we conclude that there are 10% to 20% chance that the average
life of the batteries is greater than 100 months if we select a sample of 16
batteries when the population variance is unknown.
Note: With the help of the t-table, we cannot calculate the exact probability.
But computer packages or software such as R, SPSS, SAS, MINITAB,
R-code
STATA, EXCEL, etc. help us to calculate it exactly. From R, we find it as
pt(q= 1.23, df=15,
P  X > 100  =P [ t > 1.23] lower.tail = FALSE)

= 0.11882 = 0.11882

Hence, we conclude that the probability of the average life of the battery is
greater than 100 months if we selected a sample of 16 batteries when the
population variance is unknown and has been increased. 57
Block 1 Sampling Distributions

After the above discussion, you understand that the population standard
deviation (known or unknown) plays an important role in deciding the shape of
the sampling distribution of mean. Now, you can try the following Self
Assessment Question which will make you more user-friendly with the concept
of the sampling distribution of mean when the population is normal with known
or unknown variance.
SAQ 1
Suppose the length of time that a caller is placed on hold when telephoning a
customer service centre is normally distributed with a mean of 40 seconds and
a standard deviation of 5 seconds. If a researcher monitors 20 calls, then
(i) What is the sampling distribution of mean length of hold time?
(ii) Find the mean, standard deviation and standard error of the sampling
distribution.
(iii) Find the probability that the mean length of time on hold in a sample of 20
calls will be within 2 seconds of the population mean.

After understanding the shape of the sampling distribution when the population
is normal and the population variance is known or unknown. I think now you
are interested to know what the sampling distribution will be when the
population is not normal. Let us discuss the same in the next section.

2.4 SAMPLING DISTRIBUTION OF MEAN WHEN


POPULATION IS NOT NORMALLY
DISTRIBUTED
In the previous section, we discussed the shapes of the sample distribution of
mean when we take samples from a normally distributed population and
population variance is known and unknown. However, in real-life applications,
it is quite common that the data does not follow normal distribution. For
example, the distribution of income, stock market returns, road accidents, and
several natural phenomena like rainfall and earthquake magnitudes do not
follow normal distribution. Many disciplines, including economics, biology,
sociology, and environmental science, deal with non-normal data or the shape
of a population that does not follow the normal distribution. In such situations,
the shape of the sampling distribution (when samples are drawn from non-
normal populations) generally does not specify or follow a standard distribution
when the sample size is small. However, the central limit theorem helps us
identify the shape of the sampling distribution when the sample size is large
(≥ 30). According to that when the sample size is large then the sampling
distribution of mean converges to normal distribution whatever the form of the
population i.e. normal or non-normal.
Let us consider an example, to demonstrate the sampling distribution of the
mean when the sample is taken from a non-normal population.
Suppose the department of statistics of a university has 10 faculty members.
The teaching experiences of the faculty members are given in the following
58 table:
Unit 2 Sampling Distributions of Sample Means
Table 2.3: Teaching Experience of the Faculty Members

Teaching Experience
Faculty
(in years)

F1 5

F2 6

F3 5

F4 5

F5 5

F6 6

F7 6

F8 12

F9 12

F10 15

That is, We first plot the graph of the population (teaching experience) to know
the form of the population as follows in Fig. 2.3.
5

4
Frequency

0
5 6 12 15 16
Teaching Experiance

Fig. 2.3: Teaching experience of the faculty members.

From the above figure, we observe that the teaching experience of the faculty
members does not follow a normal distribution. It is right skewed. To know
what the shape of the sampling distribution of the mean will be, let us consider
all possible samples with replacement of size n = 2 from the above population
of teaching experience. In this case, there are Nn = 102 = 100 possible simple
random samples with replacement of size 2. For each sample, we calculate
the sample mean and prepare the sampling distribution as follows in Table
2.4:
Table 2.4: Sampling Distribution of Sample Means

S. No. X Frequency(f)

1 5.0 16

2 5.5 24

3 6.0 9

4 8.5 16
59
Block 1 Sampling Distributions

5 9.0 12

6 10.0 8

7 10.5 6

8 12.0 4

9 13.5 4

10 15.0 1

Total 100

We plot the sampling distribution of mean in Fig 2.4(a). From the figure, you
can notice that for a small sample size n = 2, the shape of the sampling
distribution of mean does not follow a standard distribution (is not normal).

Fig. 2.4: Sampling distribution of mean when population is not normal for n = 2, 3, 5, 30.

Let us see what happens as we increase the sample size, whether it


converges to normal distribution or not. We also plot the sampling distribution
of mean when the sample size n = 3, 5, 30 in Fig. 2.4(b) to 2.4(d).
From Fig. 2.4, you can observe that as we increase the sample size, the
sampling distribution of mean approaches normal distribution even though the
population is not normal and according to the central limit theorem, as n
becomes 30 or more, the sampling distribution of the mean follows normal
distribution.
I think you understand the shape of the sampling distribution of mean in
different cases. We also explain various forms of the sampling distribution
under different conditions using the flow chart given in Fig. 2.5 which helps
60 you to quickly judge the shape of the sampling distribution of mean.
Unit 2 Sampling Distributions of Sample Means

Fig. 2.5: Sampling distribution of mean.


You will study the
Since the sampling distribution is the base of the statistical inference, parametric tests in Block
therefore, if we have to draw inferences for a population mean on the basis of 4 of this course and the
a small sample (n < 30) then the shape of the population should be normal (for non-parametric tests in
parametric approach). If the population is not normal or we do not know the the course MST-021:
form of the population then a large sample (n ≥ 30) will be required otherwise Classical and Bayesian
we have to apply the non-parametric techniques to draw the inference. inference of the third
semester.
Note: Till now, we have discussed the sampling distribution of mean. One
question that may arise in your mind is “Can we find the sampling
distribution of the median? The answer is yes. You can obtain the same by
following the same procedure as we have discussed for the sample mean.
Instead of calculating the sample mean, we just compute the sample median
of each possible sample. If the population from which we draw the samples is
normally/non-normally distributed with mean µ and variance σ2, then the
sampling distribution of median follows a normal distribution, only for the large
sample, with mean
 =μ
E  X 
and standard error
  = 1.253 σ
SE  X  n

where X represents the sample median.
Thus, the standard error of the median is larger than that of the mean. It is
thus less efficient. Generally, the sampling distribution of median is more
complicated in comparison to the mean, therefore, we just gave the idea of the
sampling distribution of the median instead of describing it in detail.
Let us apply the above concepts in real-life applications.
Example 2: A researcher wishes to study the effects of acid rain in a particular
area. It is known that the mean pH (acidity) level of the water is 6.4. The
researcher does not know the distribution of the pH level in that region. The
researcher collected water samples from 35 lakes and observed the standard
deviation of the sample as 0.72.
(i) What is the shape of the sampling distribution of mean of the pH level of
the water?
(ii) What is the mean and standard error of the sampling distribution? 61
Block 1 Sampling Distributions

Solution: Since the researcher does not know the distribution of the pH level
in that region. However, the researcher collected samples from 35 lakes,
which is greater than 30. Since the sample size is sufficiently large so we can
apply the central limit theorem to find the shape of the sampling distribution of
mean. According to the theorem, it is normally distributed.
The mean of the sampling distribution of mean will be the same as the mean
of the pH (acidity) level in that region. Therefore, the mean of the sampling
distribution of mean is

( )
E X = μ= 6.4

We can compute the standard error when the population standard deviation is
not known as
S
SE
= ( )
X SD
= X ( ) n
0.72 0.72
= = = 0.12
35 5.92

Now, you can try Self Assessment Questions for your practice.

SAQ 2
Consider Example 2 and compute the probability that the sample mean pH
(acidity) level of the water is below 6.4 for the water samples of 35 lakes
collected by the researcher. Also, conclude the results.

We hope that you understood the shape of the sampling distribution of mean
under different situations. We now study the shape of the sampling distribution
of difference of two means. Let us discuss the same in the next section.

2.5 SAMPLING DISTRIBUTION OF DIFFERENCE


OF TWO MEANS
There are so many situations in business, health science, social sciences,
economics, etc. where we are interested in comparing the means of two
populations to draw the inference that there exist any differences between
their corresponding population means. Some of the situations are given as
follows:
• A market analyst wants to estimate the difference between the average
purchasing habits of males and females,
• A medicine researcher wants to test whether a new medicine is really
more effective for controlling systolic blood pressure than old medicine,
• A student of the MSCAST programme is interested in testing the mean
starting salaries earned by data scientists is greater than the software
engineers,
• A company director wants to know whether any difference in the
performance of the employees has occurred due to the training
programme,
• A dietician wants to test whether there is any significant difference before
62 and after a particular diet, etc.
Unit 2 Sampling Distributions of Sample Means
In such situations, someone compares two unknown population means on the
basis of samples. For that, it is necessary to know how the sample statistic
(i.e. the difference between two sample means) is related to its true but
unknown population parameter (i.e. the difference between two population
means). The relationship can be described by the sampling distribution of
difference of two means.
To study the shape of the sampling distribution of the difference of two means,
we use a similar rationale to that developed for the sampling distribution of a
single mean. Therefore, we can define the sampling distribution of difference
of two means as
“The probability distribution of all values of the difference of two sample
means would be obtained by drawing all possible samples from both the
populations is called the sampling distribution of difference of two
sample means.”
However, when we deal with two groups/samples then the first question may
arise whether they are independent or paired/dependent. Therefore, to
describe the shape of the sampling distribution of two means, first of all, we
should understand the difference between them and how to judge whether two
samples are independent or paired/dependent. It is important because
different statistical methods are used for comparing two population means
where the groups are independent and paired.
Independent Samples
Two sets of observations are called independent when the observations are
taken independently from two different groups.
For example, suppose a cardiologist wants to test the effects of two medicines
(A and B) for controlling high blood pressure (Hypertension). If he applies
medicine A to group 1 of a certain number of patients and medicine B to group
2 of a certain number of patients with similar health conditions, then the
observed results are independent because medicines A and B are applied to
the different groups of patients.
Paired/Dependent Samples
In the case of independent samples, the observations of one sample are not
dependent on the other sample. There are many situations where independent
samples do not give the correct picture of the situation. For example, suppose
a chemical engineer is interested in comparing the efficiency of the petrol of
two companies. Since fuel efficiency varies widely from the make and model of
the car. If we compare the efficiency of the petrol of two companies using two
independent samples of cars of different make and models then we may
observe large variability in the petrol efficiency, and it is very difficult to detect
the difference arising from different petrol use because the efficiency may vary
due to petrol, model and company of the cars. Therefore, in such situations, it
would make more sense to select pairs of cars of the same make, model and
driven under similar circumstances and compare the fuel efficiency of the two
cars in each pair. Such samples where the observations are collected in pairs
or observations are made on the same units at two different times are called
dependent or paired samples. Therefore, 63
Block 1 Sampling Distributions

we can define two sets of observations as paired/dependent as follows:


Two sets of observations are called paired/dependent when the
observations are collected in pairs or observations are taken on the
same subject at two different times.
Generally, such types of observations are recorded to assess the
effectiveness of a particular training, diet, treatment, medicine, etc. In such
situations, the observations are generally recorded “before and after” the
insertion of training, treatment, etc. as the case may be. For example, if we
wish to test a new fitness program for weight reduction, then the weights of
individuals before and after the fitness program will form two different samples
in which observations will be paired as per individual. Similarly, in the test of
blood sugar in the human body, the fasting sugar level before the meal and
sugar level after the meal, both are recorded for a patient as paired
observations, etc.
After understanding the difference between independent and paired samples,
we now come to the sampling distribution of difference of two means when
samples are independent or paired in the next sections.

2.6 SAMPLING DISTRIBUTION OF DIFFERENCE


OF TWO MEANS WHEN SAMPLES ARE
INDEPENDENT
Suppose we study the same characteristics from two different groups such as
income of people of two cities, height of players of two basketball teams, etc.
and we consider the observations of both independent groups as Population-I
and Population-II.

Suppose Population-I has mean μ1 and variance σ12 and Population-II has
2
mean μ2 and variance σ 2 .

Population-I Population-II
Mean- Mean- μ2
f μ1
2
Variance- σ 2
2
Variance- σ1

To obtain the sampling distribution of the difference of two sample means, we


take all possible samples of the same size say, n1 from Population-I and then
the sample mean, say, X is calculated for each sample. Similarly, all possible
samples of the same size n2 are taken from Population-II and the sample
mean, say, Y is calculated for each sample. If the populations are too large,
then we consider only several samples as we afford instead of all from each
population. Then we consider all possible differences of sample means X and
64 Y. The difference of these means may or may not differ from sample to
Unit 2 Sampling Distributions of Sample Means
sample and is considered as a random variable. Therefore, we construct the
probability distribution of these differences as in the case of a single sample.
The probability distribution thus obtained is known as the sampling distribution
of the difference of two sample means when the samples are independent.
To study the shape of the sampling distribution of the difference of two means,
we use a similar rationale to that developed for the sampling distribution of a
single mean. In this case, the appropriate sample statistic is (X − Y) , and its
associated population parameter is (μ1 − μ2 ) . As we have discussed in the
case of a single sample, the shape of the sampling distribution of the
difference of two means also depends on the nature of the population, that is,
whether they follow a normal distribution or not. Therefore, we consider both
cases and discuss one at a time in the next Sub-sections.

2.6.1 When Populations are Normally Distributed


If the populations from which we draw the independent samples are normally
distributed, then the shape of the sampling distribution of the difference of two
means is also nearly normal but depends on whether the variances/standard
deviations of the populations are known or unknown. Therefore, we discuss all
cases as follows:
When Population Variances are Known
In this case, the sampling distribution of the difference of two means is
normally distributed whatever the sample sizes. Therefore, we can conclude
that the difference in sample means can be modelled using a normal
distribution when variances are known regardless of the size of the samples.
You may be interested to know the mean and variance of the sampling
distribution of difference of two means.
In practice, two random samples are selected independently from the
populations, and the concept of the sampling distribution of difference of two
means is used to draw the inference about the population parameters.
If X and Y represent the means of the independent samples of size n1 and n2
( ) ( )
are taken from normal populations N μ1 , σ12 and N μ2 , σ 22 then the sampling

distribution of difference of two means (X − Y) is normally distributed. We can


obtain the mean and variance of the sampling distribution of difference of two
(
sample means X − Y as )
( )
X ~ N μ1 , σ12 and Y ~ N (μ2 , σ 22 )

As we discussed in the case of a single sample the sampling distribution X If X and Y are independent

and Y also normal as random variables then


E ( X − Y )= E ( X ) − E ( Y )
 σ2   σ2 
X ~ N  μ1 , 1  and Y ~ N  μ2 , 2  and
 n1   n2 
Var ( X − Y )
Thus, we can obtain the mean of the sampling distribution of (X − Y) as = Var ( X ) + Var ( Y )

( ) ( ) ( )
E X − Y = E X − E Y = µ1 − µ2
65
Block 1 Sampling Distributions

and variance

(
Var X −= )
Y Var X + Var Y ( ) ( )
σ12 σ 22
= +
n1 n2

Therefore, the standard error of difference of two means is given by

(
SE X −=
Y ) Var X − Y ( )
σ12 σ 22
= +
n1 n2

Therefore, the sampling distribution of difference of two means when the


population is normally distributed and population variances are known is also
 σ2 σ2 
normally distributed with mean (µ1− µ2) and variance  1 + 2  , that is
 n1 n2 
 σ2 σ2 
(X − Y) ~ N  μ1 − μ1, 1 + 2  and the quantity
 n1 n2 

Z=
( X − Y ) − (μ 1 − μ2 )
~ N ( 0,1)
σ12 σ 22
+
n1 n2

follows the standard normal distribution.


Let us consider the next case.
When Population Variances are Unknown and Unequal
In real-world situations, the population variances are generally unknown. In
such situations, when the populations from which we draw independent
samples are normally distributed but anyone or both variances of the
populations are unknown, the sampling distribution of the difference of two
means is bell-shaped but not normally distributed. In such situations, we use
sample variances ( S12 and S22 ) in place of population variances ( σ12 and σ 22 ).
Due to this replacement, the shape of the sampling distribution of the
difference of two means slightly changes from the normal distribution and the
quantity

t=
( X − Y ) − (μ 1 − μ2 )
S12 S22
+
n1 n2

follows the t-distribution with modified degrees of freedom r which is given as


follows:
2
 S12 S22 
 + 
r=  n1 n2 
2 2
1  S12  1  S22 
  +  
n1 − 1  n1  n2 − 1  n2 
66
Unit 2 Sampling Distributions of Sample Means
It is also noted that if the population variances σ & σ are unknown and
2
1
2
2

sample sizes n1 and n2 are large (≥ 30), then according to the central limit
theorem, the sampling distribution of (X − Y) is very closely normally
 S2 S2 
distributed with mean (µ1− µ2) and variance  1 + 2  whatever the form of
 n1 n2 
the populations.
Let us move to the next case.
When Population Variances are Unknown and Equal
When the populations from which we draw independent samples are normally
( )
distributed, but their variances are equal σ12 = σ 22 = σ 2 though we do not have
their values, then the sampling distribution of the difference of two means is
again bell-shaped but not normally distributed. In this case, the unknown but
equal population variances σ12 = σ 22 = σ 2 is estimated by pooled sample
variance Sp2 to have a better estimate of the common variance where,

1 n n 
=Sp2
n=
∑ 2 2
 (Xi − X) + (Yi − Y) 
1 + n2 − 2 
 i 1 =i 1 

( )
If sample variances S12 and S22 are known, then we can write the above
expression of pooled sample variance as
1
=Sp2 ( n1 − 1) S12 + ( n2 − 1) S22 
n1 + n2 − 2 

and the quantity

t=
( X − Y ) − (μ − μ ) ~ t
1 2
(n1 + n2 − 2)
1 1
Sp +
n1 n2

follows the t-distribution with (n1 + n2 − 2) degrees of freedom.


In this case, it is also noted that if sample sizes n1 and n2 are large (≥ 30), then
according to the central limit theorem, the sampling distribution of (X − Y) is
1 1
very closely normally distributed with mean (µ1− µ2) and variance Sp2  + 
 n1 n2 
whatever the form of the populations.

Before describing the next case, let us understand the concept using an
example.

Example 3: A baseball recruiter is interested in finding out how the mean


height of players of two teams differs from one another. It is known that the
height of baseball players of both teams is normally distributed with means
190 cm and 188 cm with a common standard deviation of 5 cm. If 8 players
are selected from each team, can we assume that the sampling distribution for
differences of mean heights of the players of both teams is approximately
normal? Also, find

(i) The mean and standard error of the sampling distribution. 67


Block 1 Sampling Distributions

(ii) The probability that the difference between the mean heights of the
players of the samples lies between 2 cm to 5 cm.

Solution: As we have discussed the sampling distribution of the difference of


two means depends

• Samples are independent or paired,

• Form of populations from which samples are drawn, and

• Sample size.

Here, the players are selected from different teams, so the samples are
independent. Here it is given that the heights of the players of both teams are
normally distributed, therefore, the sampling distribution of differences in mean
heights follows either normal or t-distribution depending on whether standard
deviations are known or unknown. Since in this case, the standard deviations
are known, therefore, it is approximately normal whatever the sample sizes.

If X and Y represent the mean heights of the selected players in both teams,
respectively, then we can obtain the mean of the sampling distribution as
follows:

E(X − Y) = μ1 − μ2 = 190 − 188 = 2

and standard error

( )
SE X − Y= SD X − Y ( )
σ12 σ 22
= +
n1 n2

25 25
= + = 2.24
10 10

(ii) Here, we want to find out the probability which is given by

P 2 < X − Y < 5 

Since the sampling distribution of differences in the mean heights is


normally distributed with a mean of 2 cm and standard deviation of 2.24
cm, therefore, to obtain the value of this probability, we convert (X − Y)
into standard normal Z scores by the following transformation:
(X − Y) − E(X − Y) (X − Y) − 2
Z= =
Var(X − Y) 2.24

Therefore, by subtracting 2 from each term and then dividing each term by
2.24 in the above inequality, we calculate the required probability as

 1 − 2 (X − Y) − 2 5 − 2 
P < <  = P [ −0.45 < Z < 1.34]
 2.24 2.24 2.24 

= P [ Z < 1.34] − P [ Z < −0.45]

68 = 0.9099 − 0.3264 = 0.5835


Unit 2 Sampling Distributions of Sample Means
After understanding the shape of sampling distribution of difference of two
means when the population are normal, and samples are independent. We
now consider the case of non-normal populations.

2.6.2 When Population are Not Normally Distributed


In the previous section, we discussed the shape of the sample distribution of
the difference of two means when we take samples from normally distributed
populations. However, in real-life applications, particularly in business,
manufacturing, economics, biology, sociology, environmental sciences, social,
behavioural sciences, etc. we often do not know the actual shape of the
populations, or the shapes do not follow the normal distribution.
If the populations from which we draw the samples are not normally distributed
or the shape of both or any one is/are unknown, then the shape of the
sampling distribution of the difference of two means generally does not specify
or does not follow a standard distribution (normal) when the sample sizes are
small.
However, the central limit theorem helps us identify the shape of the sampling For the two distinct
distribution when sample sizes are large. According to that populations:
When the size of both samples is large (n1, n2 ≥ 30) then the sampling • If the sample sizes are
distribution of difference of two means converges to normal distribution small, the shape of
whatever the form of the population i.e. normal or non-normal. distributions are
Note: Since the sampling distribution is the basis of the statistical inference, important (should be
therefore, if we have to draw inferences for two population means on the basis normal)
of small samples (n1, n2 < 30 or n1 < 30 or n2 < 30) then the shape of the • If the sample sizes are
population should be normal for applying the parametric techniques. If any one large, the shape
of the populations or both is/are not normal or we do not know the form of the distributions are not
population then large samples (n1, n2 ≥ 30) will be required otherwise we apply important (need not be
the non-parametric techniques to draw the inference. normal)
Let us take an example.

Example 4: The course coordinator of the course MST-016 is interested in


studying the difference in final exam scores of the learners in the MST-016
course from two different study centres. It is known that the distribution of test
scores in the first study centre is left-skewed with a mean score of 65 (out of
100) and a standard deviation of 10 whereas in the second study centre, it is
normally distributed with a mean score of 60 and a standard deviation of 7. He
selects 40 learners randomly from the first study centre and 35 from the
second. Can we assume that the sampling distribution for differences in
sample means is approximately normal? Also, find
(i) The mean and standard error of the sampling distribution of the difference
of mean scores.
(ii) The probability that the mean scores of the learners of the first study
centre is at least 6 more than the mean scores of the learners of the first
study centre.
Solution: As we have discussed for the sampling distribution of the difference
of two means, we have to check the following points 69
Block 1 Sampling Distributions

• Samples are independent or paired,


• Sample size, and
• Form of populations from which samples are drawn.
Here, the coordinator selected the learners from two different study centres,
therefore, the samples are independent. Here, it is given that the distribution of
test scores in the first study centre is left skewed whereas in the second study
centre, it is normally distributed. Since the first distribution is not normal,
therefore, we have to check the sample size of this group. Since the sample
size is 40 (> 30), therefore, according to the central limit theorem if sample
sizes are large (> 30), then the sampling distributions of difference of two
means is normally distributed whatever the form of populations. Thus, we may
assume that the sampling distribution of differences of scores is normal. Note
that if the sample size of the learners taken from the first study centre is less
than 30 then the sampling distribution of differences of scores may not be
normal.
(i) If X and Y represent the mean scores of the selected learners from the
first and second study centres, respectively, then we can obtain the mean
of the sampling distribution as follows:
E(X − Y) = μ1 − μ2 = 65 − 60 = 5

and variance
σ12 σ 22 100 49
Var(X − Y) = + = + = 3.9
n1 n2 40 35

Therefore, the standard error is given by

(
SE X − Y = ) (
Var X − Y = ) 3.9= 1.97

(ii) Here, we want to find out the probability which is given by


P  X ≥ Y + 6  ⇒ P  X − Y ≥ 6 

Since the sampling distribution of differences in the mean scores is


normally distributed with mean of 5 and variance of 3.9, therefore, to
calculate this probability, we convert (X − Y) into a standard normal Z-
scores by the transformation:
(X − Y) − E(X − Y) (X − Y) − 5
=Z =
Var(X − Y) 1.97

Therefore, by subtracting 2 from each term and then dividing each term by
1.97 in the above inequality, we get the required probability as
 (X − Y) − 5 6 − 5 
P ≥ =P [ Z ≥ 0.51] =
1 − P [ Z < 0.51]
 1.97 1.97 

1 − 0.6950 =
= 0.305

Thus, we conclude that it would be only a 30.5% chance that the mean scores
of the learners of the first study centre is at least 6 more than the mean scores
70 of the learners of the first study centre.
Unit 2 Sampling Distributions of Sample Means
Now, you can try the following Self Assessment Question before moving to the
next section.

SAQ 3
Continuing with our example of the final exam scores of the learners in MST-
016 (Example 4). What will be the sampling distribution of difference of two
means in the following cases:
(i) If the standard deviations of final exam scores of the learners in the paper
of the course MST-016 of both study centres are unknown.
(ii) If the course coordinator of MST-016 selects 20 learners randomly from
the first study centre and 10 from the second instead of 40 and 35.

We hope that you understand the shape of the sampling distribution of


difference of two means when the samples are independent and have the
curiosity to know the shape when the samples are paired. Therefore, we now
discuss the sampling distribution when the samples are dependent or paired.

2.7 SAMPLING DISTRIBUTION OF DIFFERENCE


OF TWO MEANS WHEN SAMPLES ARE
PAIRED
In some experiments, the two sets of measurements X and Y are taken on the
same subjects under different conditions. As mentioned at the beginning of the
previous section, this constitutes a paired data set. Generally, such types of
observations are recorded to assess the effectiveness of a particular training,
diet, treatment, medicine, etc. In such situations, the observations are
recorded “before and after” the insertion of training, treatment, etc. as the
case may be. Some situations where paired samples may occur are given
below:
• The director of an institution wants to know whether a particular research
methodology programme has any impact in increasing the motivation level
of the teachers in research or not.
• A dietitian wishes to know whether a new diet plan for weight reduction is
effective or not.
• A quality control inspector wants to test whether a new method of
handling machines helps reduce the break-down period in comparison of
the old method or not.
In such situations, we have to collect the observations on the same units and
observations are taken at two different times. To draw the statistical inference/
comparison on the basis of the samples, we require the sampling distribution
of difference of two means when samples are paired.
Since in such cases, the populations are not independent, therefore, we
cannot use the previous approach to construct the sampling distribution of
difference of two means. Since the interest is focusing on the difference
between before and after implementing a treatment plan, therefore, it makes
sense to consider the difference between the two measurements into one. For 71
Block 1 Sampling Distributions

example, in the case of assessing the effect of a diet plan to reduce the
weight, we can take weight before and after the diet and subtract the weight
after the diet from the before weight. The difference makes sense too! It is the
weight lost on the diet. Therefore, to obtain the sampling distribution of
difference of two means when the populations are paired, we calculate the
difference of each pair. This approach essentially transforms the paired two-
population data into a one-population, and we can use the same approach
which is used in the sampling distribution of a single mean which is discussed
in Section. 2.2. That is, we draw all possible simple random samples with
replacement of the same size from the population of differences and prepare
the sampling distribution. To know the shape of the sampling distribution of the
paired sample, we follow the different cases as discussed in the sampling
distribution of a single mean.
2.7.1 When Population of Differences is Normally
Distributed
If the population of differences is normally distributed, then the sampling
distribution of mean is also bell-shaped for any sample size. However, the
exact shape of the sampling distribution of mean also depends on whether the
variance/standard deviation of the population of differences is known or
unknown. Therefore, we consider both cases:
When Population Variance of Differences is Known
In this case, when the population of differences is normally distributed, and the
variance of the population is known then the sampling distribution of the
difference of means follows normal distribution regardless of the sample size
n.

In symbolical form, if μD and σD2 denote the mean and variance of the
population of difference and XD denotes the mean of the differences of the
samples of size n drawn from the normal population, then XD will follow a
2
σD
normal distribution with mean μD and variance as we discussed in the
n
sampling distribution of the single mean. Therefore, in symbolic form
 σ2 
XD ~ N  μD , D 
 n 

Here, we use D in the suffix to indicate the difference of population before and
after which differentiate it from the single mean. Also, the quantity
XD − μD
Z= ~ N ( 0,1)
σD
n
follows the standard normal distribution.
When Population Variance of Differences is Unknown
In real-world situations, the variance of the population of differences is not. In
72 such situations, when the population of differences of paired data is normally
Unit 2 Sampling Distributions of Sample Means

distributed, and the variance of the population is unknown then the sampling
distribution of the difference of means is bell-shaped but not normally
( )
distributed. In such situations, we use sample variance SD2 in place of

( ) 2
population variance σ . Due to this replacement, the shape of the sampling
D

distribution of the difference of two means slightly changes from the normal
distribution. Therefore, the quantity

XD − μD
t= ~ t (n−1)
SD
n
follows the t-distribution with n – 1 degrees of freedom.

After understanding the shape of the sampling distribution of difference of two


means when population of differences follows normal distribution, let us study
the shape of the sampling distribution of difference of two means when
population of difference is not normally distributed.

2.7.2 When Population of Differences is not Normally


Distributed
In the previous section, we discussed the shape of sample distributions of the
difference of two means when the population of differences of paired data
follows the normal distribution. However, in real-life applications, particularly in
business, manufacturing, economics, biology, sociology, environmental
sciences, social, behavioural sciences, etc., we often do not know the actual
shape of the populations, or the shapes do not follow the normal distribution.
Therefore, if the population of differences is not normally distributed or the
shape of it is not known, then the shape of the sampling distribution of the
difference of two means generally does not specify or does not follow a
standard distribution when the sample size is small.
However, the central limit theorem helps us identify the shape of the sampling
distribution when the sample size is large. According to that

When the sample size is large (n ≥ 30) then the sampling distribution of
difference of two means for paired data converges to normal distribution
whatever the form of the population may be normal or non-normal
distribution.

Note: Since the sampling distribution is the basis of the statistical inference,
therefore, if we have to draw inferences for the population on the basis of a
small sample (n < 30) then the shape of the population of differences should
be normal. If it is not normal or we do not know the form of the population then
large samples (n1, n2 ≥ 30) will be required otherwise we apply the non-
parametric techniques to draw the inferences. We will discuss the non-
parametric techniques in the third semester.

After understanding the shape of the sampling distribution of difference of two


means in different situations, we now present a flow chart in Fig. 2.6 which
helps you to judge quickly the form of the sampling distribution of difference of
two means under different conditions. 73
Block 1 Sampling Distributions

Fig. 2.6: Sampling distribution of difference of two means.

After knowing the shapes of the sampling distribution of difference of two


means. You may be interested in how to apply this sampling distribution in
real-world situations.
Example 5: A researcher wants to investigate whether a new diet plan is
effective in lowering cholesterol. He randomly selected 11 cholesterol patients
and kept them under the new diet plan. The cholesterol level (in mg) of each
74 patient was measured before they took the new diet. After the new diet plan,
Unit 2 Sampling Distributions of Sample Means
the cholesterol level of each patient was again measured. If the differences of
the cholesterol levels before and after the new diet plan have a normal
distribution with a mean of 35 mg and a standard deviation of 5 mg, then

(i) What will be the sampling distribution of the difference of two means?
(ii) Find the mean, standard deviation and standard error of the sampling
distribution.
(iii) What is the chance that the cholesterol level of the 11 patients is reduced
by at least 32 mg?

Solution: Here, we are given that


µD = 35, σD = 5, n = 11
Since the differences of the cholesterol levels before and after the new diet
plan follow a normal distribution and variance is also known (5 mg), therefore,
the sampling distribution of the difference of two means is also normal, even
though the sample size is less than 30.
The mean of the sampling distribution of difference of two means is given as

( )
E XD= μ=
D 35

We can compute the standard deviation and standard error as


σD 5
( )
SD X=
D =
n
= 1.51
11

SE
= ( )
XD SD
= ( )
XD 1.51

Now, we can answer part (iii) by computing the probability that a randomly
selected group of patients has reduced on average more than 32 mg
cholesterol level using the new diet plan, that is, P  XD > 32

Since the mean cholesterol level is normally distributed with a mean of 35 mg


and a standard deviation of 1.51 mg, therefore, we can convert it to standard
normal Z-scores as follows:

=Z =
( )
XD − E XD XD − 35
SD XD( ) 1.51

Therefore, we can transform the above expression of probability as follows:

 XD − 35 32 − 35 
P  XD > 32
=  P  1.51 > 1.51=  P [ Z > −1.99
= ] P [ Z ≤ 1.99]
 

From the standard normal Table (Table IV) given in the appendix of this
volume, we get

P  XD > 32 = P [ Z ≤ 1.99] = 0.9769

Therefore, we conclude that it would be 97.69% chance that the new diet plan
would reduce the 32 mg cholesterol level of the patients.
Now, you can try the following Self Assessment Question. 75
Block 1 Sampling Distributions

SAQ 4
Continuing with our example of a new diet plan for lowering the cholesterol
level (Example 5). If the population standard deviation is not known and the
researcher obtained a sample standard deviation of 4 mg, then how will this
affect the sampling distribution and standard error? Do you think this will affect
the probability? Justify your answer.

With this, we end this unit. We now summarise our discussion.

2.8 SUMMARY
In this unit, we have discussed the following points:
• If the population from which we draw the samples is normally distributed
with population variance/standard deviation is known, then the sampling
distribution of mean will follow a normally distribution and if it is unknown
then the sampling distribution of mean will follow a t-distribution with n –
1 degrees of freedom, regardless the size of the samples.
• If the population from which we draw the samples is not normally
distributed or the shape is unknown, then the shape of the sampling
distribution generally does not specify or does not follow a standard
distribution when the sample size is small. However, if the sample size is
large (30 or more), then the sampling distribution of mean will be
normally distributed, regardless of the shape of the population.
• Two sets of observations are called independent when the observations
are taken independently from two different groups.
• Two sets of observations are called paired when the observations are
taken on the same subject at two different times.
• The sampling distribution of difference of two means when the
populations are independent and normally distributed is also normally
distributed when population variances are known, however, if they are
unknown then the sampling distribution follows the t-distribution.
• The sampling distribution of difference of two means when the
populations are paired and normally distributed is also normally
distributed when population variances are known, however, if it is
unknown then the sampling distribution follows t-distribution.

2.9 TERMINAL QUESTIONS


1. The average highest speed of cars at a particular racetrack is 160 mph,
with a standard deviation of 12 mph. If samples of 25 cars are selected
and measured the highest speed, then what is the shape of the sampling
distribution of the sample means when the distribution of the highest
speed of cars on the racetrack is:
(i) normal, and
(ii) right skewed.
2. A large organization's survey of male employees reveals that their
76 haemoglobin level is normally distributed with a mean haemoglobin level
Unit 2 Sampling Distributions of Sample Means
of 14.7 gm and a standard deviation of 0.7 gm. The female employees of
the same company are also surveyed, and the results show that their
haemoglobin level is also normally distributed with a mean haemoglobin
level of 12.7 gm and a standard deviation of 0.5 gm, then
(i) Are the populations independent?
(ii) Can we presume that the sampling distribution of the difference of
average haemoglobin of the male and female employees is roughly
normal?

2.10 SOLUTIONS / ANSWERS


Self Assessment Questions (SAQs)
1. Here, we are given that
μ = 40, σ = 5, n = 20
Since the distribution of the length of time that a caller is placed on hold
when telephoning a customer service centre is normally distributed,
therefore, the sampling distribution of mean length of hold time follows
either normal or t-distribution depending upon the standard deviation of
the population is known or unknown. In this case, the population standard
deviation is known (σ = 5), therefore, the sampling distribution of sample
means is also normal, even though our sample size is less than 30. If the
length of time on hold is not normally distributed, we would need a larger
sample size before using a normal model for the sampling distribution.
The mean of the sampling distribution of mean will be the same as the
mean of the general hold time. Therefore, the mean of the sampling
distribution of mean is

( )
E X = μ= 40

We can compute the standard deviation and standard error of the


sampling distribution of mean as
σ 5
( )
SD =
X =
n
= 1.12
20

SE
= ( )
X SD
= ( )
X 1.12

Now, we can answer part (iii) by computing the probability that the mean
length of time on hold in a sample of 20 calls will be within 2 seconds of
the population mean, that is, P  40 − 2 ≤ X ≤ 40 + 2=
 P 38 ≤ X ≤ 42 

Since the mean time on hold is normally distributed with a mean of 40


seconds and a standard deviation of 1.12 seconds, therefore, we convert
the sample mean into standard normal Z-scores as follows:

=Z
X −E X
=
( )
X − 40
SD X( )1.12

Therefore, we can transform the above expression of the probability as


follows: 77
Block 1 Sampling Distributions

 38 − 40 X − 40 42 − 40 
P 38 ≤ X ≤ 42  = P  ≤ ≤  = P [ −1.79 ≤ Z ≤ 1.79]
 1.12 1.12 1.12 

= P [ Z ≤ 1.79] − P [ Z ≤ −1.79]

From the standard normal tables (Tables III and IV) given in the Appendix
of this volume, we get
P 38 ≤ X ≤ 42
=  P [ Z ≤ 1.79] − P [ Z ≤ −1.79]

= 0.9633 − 0.0367 = 0.9266

Therefore, we conclude that there would be 92.66% chance that the mean
length of time on hold in a sample of 20 calls will be within 2 seconds of
the population mean.
2. We have to calculate the probability that the sample mean pH level of the
water is below 6.4, that is, P  X < 6.5  .

In Example 2, we observed that the sampling distribution of the mean pH


level is normally distributed with a mean of 6.4 and a standard deviation of
0.12, therefore, we convert the mean to a standard normal Z-scores as
follows:

=Z
X −E X
=
( )
X − 6.4
SD X( )0.12

Thus, we can transform the above expression of probability as follows:


 X − 6.4 6.1 − 6.4 
P  X < 6.1
=  P  < =  P [ Z < −2.5]
 0.12 0.12 

From the standard normal table (Table III) given in the Appendix of this
volume, we get
P  X < −2.5  =0.0062

This probability is too low. We conclude that it would be only 0.62%


chance that the mean pH level of the sample would go below 6.1 if we
selected a sample of 35 lakes.
3.
(i) Here, the samples are independent as discussed in Example 4. It is given
that the distribution of test scores of the first study centre is left-skewed,
therefore, the sampling distribution of difference of two mean scores does
not follow the t-distribution because for the t-distribution both populations
should be normally distributed.
Since both the sample sizes are large (> 30), therefore, according to the
central limit theorem, the sampling distribution difference of two mean
scores follows normal distribution whatever may be the form of the
populations.

(ii) Since it is given that the distribution of test scores of the first study centre
is left-skewed and the sample size is small (< 30), therefore, the sampling
78 distribution of difference of two mean scores does not follow the
Unit 2 Sampling Distributions of Sample Means
t-distribution because for the t-distribution both populations should be
normally distributed whatever may be the sample sizes. For the two distinct

Since the distribution of test scores of the first study centre is left skewed populations:

and the sample size is small (< 30), therefore, the sampling distribution • If the sample sizes are
difference of two mean scores does not follow the normal distribution small, the shape of
because for the normal distribution either both the populations should be distributions are
normal or sample sizes large. important (should be

4. Since the differences of the cholesterol levels before and after the new normal)

diet plan follow a normal distribution and the standard deviation is not • If the sample sizes are
known, therefore, the sampling distribution of the difference of two means large, the shape
follows the t-distribution with n – 1 degrees of freedom instead of normal. distributions are not

We can compute the standard error when the population standard important (need not be

deviation is not known as normal)

SD 4
( )
SE X=
D SD X=
D ( ) =
n
= 1.21
11
So, we conclude that the standard error is slightly smaller than in the
previous case.
We can compute the probability that a randomly selected group of
patients has reduced on an average more than 32 mg cholesterol level
using the new diet plan, that is, P  XD > 32 

Since the mean cholesterol level follows the t-distribution instead of the
normal distribution, therefore, we can convert it into t-distribution as
follows:

= t =
XD − E XD ( )
XD − 35
SD XD ( )1.21

Therefore, we can transform the above expression of probability as


follows:
 XD − 35 32 − 35 
P  XD > 32=
 P  1.21 > 1.21 =  P [ t > −2.48=
] P [ t < 2.48]
 
1 − P [ t > 2.48]
=

Here, we have to find the probability beyond the point 2.48 at 11 – 1 = 10


degrees of freedom. So, we check that this value lies in the row
corresponding to 10 degrees of freedom. If it is not there then we read the
probability value (α) from the column heading corresponding to just
greater and just smaller than 2.48. Thus, from the t-table (Table V), we get
0.025 and P [ t > 2.764] =
P [ t > 2.228] = 0.01

Therefore, we can find the required probability as

1 − 0.025 < P  XD > 32  < 1 − 0.01

0.975 < P  XD > 32  < 0.99

Therefore, we conclude that there is 97.5% to 99% chance that the new
diet plan reduces the 32 mg cholesterol level of the patients. 79
Block 1 Sampling Distributions

Note: Here, we cannot calculate the exact probability using the t-table.
But with the help of computer packages or software such as R, SPSS,
SAS, MINITAB, STATA, EXCEL, etc. we can calculate it exactly.

Terminal Questions (TQs)


1 (i) Since the distribution of the highest speed of cars on the racetrack is
normally distributed and variance/standard deviation of the population is
known, therefore, the sampling distribution of means is also normally
distributed, even though our sample size is less than 30.
(ii) Since the distribution of the highest speed of cars on the racetrack is
right skewed and the same size (25) is less than 30, therefore, the
central limit does not guarantee that the sampling distribution of means
is also normal. In this situation, the form of it is not known. If the sample
size becomes large (≥ 30) then the sampling distribution of means
becomes normal.
2(i) Here, the organization's survey of the male and female employees
separately, therefore, the groups/populations are independent.
(ii) Here, it is given that the distribution of haemoglobin levels of the male as
well as female employees is normally distributed, therefore, the sampling
distribution of difference of two average haemoglobin levels follows
either normal or t-distribution depending upon the population standard
deviation of both the populations are known or unknown. Since the
population variances are known, therefore, the difference of two average
haemoglobin levels will follow normal distribution for any value of the
sample sizes.

80
UNIT 3

SAMPLING DISTRIBUTIONS OF
SAMPLE PROPORTIONS
AND VARIANCES

Structure
3.1 Introduction 3.5 Sampling Distribution of Ratio
of Two Sample Variances
Expected Learning Outcomes
When Populations are Normally
3.2 Sampling Distribution of
Distributed
Sample Proportion
When Populations are not Normally
3.3 Sampling Distribution of
Distributed
Difference of Two Sample
Proportions 3.6 Summary
3.4 Sampling Distribution of 3.7 Terminal Questions
Sample Variance
3.8 Solutions/Answers
When Population is Normally
Distributed

When Population is not Normally


Distributed
Tools You Will Need

3.1 INTRODUCTION The following terms are


considered essential
There are so many situations in the real world when we deal with measurable background material for
data such as weight, height, distance, time, income, etc. In such situations, we this Unit. If you doubt your
use the sample mean to summarise the data and the sampling distribution of knowledge of any of these
mean is used to draw inferences about the population mean. In the previous terms, you should review

unit, we discussed the concept of the sampling distribution of mean and the appropriate Unit or
section before proceeding:
difference of two sample means. However, sometimes, we deal with the data
collected in the form of counts or the collected data classified into two • Basic Concepts of
categories or groups according to an attribute. For example, how many people Sampling Distribution
in a sample choose Thums up as their soft drink, how many people are (Unit 1).
educated in a city, how many patients recovered from COVID-19, etc. In such • Sampling

situations, we use sample proportion instead of mean and the sampling Distributions of
Means (Unit 2).
distribution of sample proportion is a fundamental concept in statistics that plays
a pivotal role in making precise inferences about population proportion from
sample data. 81
Unit Writer- Dr. Prabhat Kumar Sangal, School of Sciences, IGNOU, New Delhi
Block 1 Sampling Distributions
Similarly, in many practical situations, we have also been concerned with
variability. Especially, when a little variance in diameter as possible is crucial
when manufacturing things that fit together, like pipes and ball bearings. If such
products do not fit together correctly then we discard such products. In such
situations, it is important to study the variability of the population and the
sampling distribution of sample variance plays a pivotal role in making precise
inferences about population variance from sample data.

In this unit, we discuss the sampling distributions of proportion, difference of two


proportion, variance and ratio of two variances.
This unit is divided into 8 sections. Section 3.1 is introductive in nature and
describes the need of the sampling distribution of proportions as well as
variance. The sampling distribution of sample proportion is described in Section
3.2 and in Section 3.3, the sampling distribution of difference of two sample
proportions is explored. Sometimes, it is useful to estimate the variation of a
population. For this, we use the sampling distribution of sample variance. The
sampling distribution of sample variance is described in Section 3.4 whereas
the sampling distribution of ratio of two sample variances is given in Section 3.5.
The unit ends by providing a summary of what we have discussed in this unit in
Section 3.6. The terminal questions and the solution of the SAQs/TQs are given
in Sections 3.7 and 3.8, respectively.
In the next unit, we shall discuss some standard sampling distributions such as
chi and t-distributions.

Expected Learning Outcomes


After studying this unit, you should be able to:
• describe the need for sampling distributions of sample proportions and
variances;
• describe the shape of the sampling distribution of sample proportion in
different situations;
• explore the shape of the sampling distribution of difference of two sample
proportions;

• explain the sampling distribution of sample variance; and


• describe the sampling distribution of ratio of two sample variances.

3.2 SAMPLING DISTRIBUTION OF SAMPLE


PROPORTION
In Section 2.2, of the previous unit, we have discussed the sampling distribution
of sample mean or simply the sampling distribution of mean. But in many real-
world situations, from public opinion polling to quality control in manufacturing
where the data is collected in the form of counts, or the collected data is
classified into two categories or groups according to a characteristic/attribute.
For example, the people living in a colony may be classified into two groups
(male and female) with respect to the characteristic gender, patients in a
hospital may be classified into two groups as cancer and non-cancer patients, a
82 lot of articles may be classified as defective and non-defective, etc.
Unit 3 Sampling Distributions of Sample Proportions and Variances

Generally, such types of data are considered in terms of the proportion of


elements /individuals /units /items that possess (success) a given characteristic
or attribute. For example, the proportion of females in a country, the proportion
of cancer patients in a hospital, the proportion of defective articles in a lot, etc.

In such situations, we deal with population proportion instead of population


mean.

For sampling distribution of sample proportion, we draw all possible samples


from the population and for each sample, we calculate the sample proportion p
as
X
p
= ≤1
n

where X is the number of observations /individuals /items /units in the sample


which have the particular characteristic understudy and n is the total number of
observations in the sample i.e. sample size.

For example, if there are 400 learners enrolled in the MSCAST programme and
out of these 250 learners successfully complete the programme then we can
compute the proportion of pass learners as
Number of the learners complete the programme 250
=p = = 0.625
total learners enroll in the programme 400

When you want to know what proportion of individuals or objects in an entire


population possesses a particular characteristic, it is sometimes very time-
consuming or even impossible to collect all data. For example, you might want
to know the proportion of people who use Facebook. You cannot survey
everyone in the world. In such a situation, to draw the inference about the
population proportion on the basis of a sample, we require the sampling
distribution of sample proportion.

You have already tasted the flavour of the sampling distribution of mean with
the help of examples in Unit 2. To study the shape of the sampling distribution
of the sample proportion, we proceed in the same way that we have followed in
the case of the sampling distribution of the sample mean. That is, we draw all
possible samples of the same size from the population and calculate the sample
proportion for each sample instead of the sample mean. After calculating the
sample proportion for each sample, we construct the sampling distribution of
sample proportions. The distribution will be obtained is known as the sampling
The sampling distribution
distribution of sample proportion. Therefore, we can define it as:
of the sample proportion is
a theoretical probability
“The probability distribution of all possible values of the sample
distribution of sample
proportion that would be obtained by drawing all possible samples of the
proportions that would be
same size from the population is called the sampling distribution of
obtained by drawing all
sample proportion or simply says sampling distribution of proportion.”
possible samples of the
Now, the question may arise in your mind “What is the shape of the sampling same size from the
distribution of sample proportion, is it normal, or t?” as we have obtained in the population.
case of the sampling distribution of sample mean. But there is a main difference
between mean and proportion. The mean is used when we deal with continuous
variables such as height, income, etc. But in proportion, we will deal with data 83
Block 1 Sampling Distributions
collected in the form of counts when the collected data is classified into two
categories or groups according to an attribute. We know that (Unit 10 of MST-
012: Probability and Probability Distribution), if our data is classified into two
categories or groups (success or failure) according to an attribute and we count
how many of them possess a given attribute, then the data follow a binomial
distribution. It means that the form of the population is binomial instead of
normal in the case of the sampling distribution of sample proportion. Let us see
what the shape of the sampling distribution of sample proportion will be when
the population is binomial. For a better understanding, we consider an example
in which the size of the population is very small.
Consider the example of our cute children of play school discussed in Unit 2
and consider only the first five children out of a group of 9 so that we can easily
show all possible samples. Suppose the play-school teacher is now interested
to know the proportion of children who like to dance. To assess this, she asked
each child whether he/she liked to dance or not and obtained the information as
follows:

Child Name Avaya Ishaan Ria Zara Kavya Parth

Like Dance Yes No Yes Yes No Yes

We can calculate the population proportion of children who like dance as


Number of children who like dance 4
P= = = 0.67
Total number of children 6
We can plot the graph of the number of children who like and dislike the dance
as follows in Fig. 3.1.

6
Number of children

1
Like Dislike
Like Dance

Fig. 3.1: Distribution of the number of children who like and dislike dance.

From the above figure, we observe that the number of children who like to
dance does not follow a normal distribution while it follows a binomial
distribution.

Let us prepare the sampling distribution of sample proportion. To obtain this, let
us consider all possible samples with replacement of size n = 2 from the above
population. There are Nn = 62 = 36 possible simple random samples with
replacement of size 2 and the proportions of children who like to dance of each
84 sample are given in the following table.
Unit 3 Sampling Distributions of Sample Proportions and Variances

Table 3.1: Samples and Sample Proportion

Sample Children Name Sample Sample Sample Children Name Sample Sample
Observations Proportion(p) Observations Proportion(p)

1 (Avaya, Avaya) (Yes, Yes) 1 19 (Zara, Avaya) (Yes, Yes) 1

2 (Avaya, Ishaan) (Yes, No) 0.5 20 (Zara, Ishaan) (Yes, No) 0.5

3 (Avaya, Ria) (Yes, Yes) 1 21 (Zara, Ria) (Yes, Yes) 1

4 (Avaya, Zara) (Yes, Yes) 1 22 (Zara, Zara) (Yes, Yes) 1

5 (Avaya, Kavya) (Yes, No) 0.5 23 (Zara, Kavya) (Yes, No) 0.5

6 (Avaya, Parth) (Yes, Yes) 1 24 (Zara, Parth) (Yes, Yes) 1

7 (Ishaan, Avaya) (No, Yes) 0.5 25 (Kavya, Avaya) (No, Yes) 0.5

8 (Ishaan, Ishaan) (No, No) 0 26 (Kavya, Ishaan) (No, No) 0

9 (Ishaan, Ria) (No, Yes) 0.5 27 (Kavya, Ria) (No, Yes) 0.5

10 (Ishaan, Zara) (No, Yes) 0.5 28 (Kavya, Zara) (No, Yes) 0.5

11 (Ishaan, Kavya) (No, No) 0 29 (Kavya, Kavya) (No, No) 0

12 (Ishaan, Parth) (No, Yes) 0.5 30 (Kavya, Parth) (No, Yes) 0.5

13 (Ria, Avaya) (Yes, Yes) 1 31 (Parth, Avaya) (Yes, Yes) 1

14 (Ria, Ishaan) (Yes, No) 0.5 32 (Parth, Ishaan) (Yes, No) 0.5

15 (Ria, Ria) (Yes, Yes) 1 33 (Parth, Ria) (Yes, Yes) 1

16 (Ria, Zara) (Yes, Yes) 1 34 (Parth, Zara (Yes, Yes) 1

17 (Ria, Kavya) (Yes, No) 0.5 35 (Parth, Kavya) (Yes, No) 0.5

18 (Ria, Parth) (Yes, Yes) 1 36 (Parth, Parth) (Yes, Yes) 1

Total 24

Mean 0.67

From the above table, you can see that the sample proportion varies from
sample to sample. To obtain the distribution of the sample proportions, we
arrange the values of the sample proportion in ascending order and calculate
the frequency corresponding to each value as discussed in Units 1 and 2. The
obtained sampling distribution is shown in Table 3.2.
Table 3.2: Sampling Distribution of Sample Proportion

S. Sample
Frequency Probability
No. Proportion(p)

1 0 4 4/36

2 0.5 16 16/36

3 1 16 16/36

To get an idea of the shape of the sampling distribution of sample proportion,


we plot the histogram of the values of the sample proportion taking the sample
proportion on the X-axis and corresponding frequencies on the Y-axis as shown
in Fig. 3.2(a).
From Fig. 3.2(a), you can observe that for a small sample size n = 2, the shape
of the sampling distributions of proportion does not follow normal distribution. 85
Block 1 Sampling Distributions
The sampling distribution of the sample proportions itself has mean, variance,
etc. Therefore, the mean of sample proportions can be obtained (for the data
given in Table 3.1) by the formula:
1 k k
Mean of the sample proportions = (p) = ∑ ii
K i=1
p f where, K = ∑
i =1
fi

1
= ( 0 × 4 + 0.5 × 16 + 1× 16=) 0.67
= P
36
You see that the mean of the sampling distribution of sample proportions is
equal to the population proportion. Thus, we can say that to find the exact
estimate of the unknown population parameter, we first find the sampling
distribution of sample proportion and then compute the mean of the obtained
sampling distribution of proportion.
We now calculate the standard deviation of the sampling distribution of sample
proportion as

1 k
∑ fi ( pi − p )
2
SE(p)
=
K i=1

1 
SE (=
p) 4 × ( 0 − 0.67 ) + 16 × ( 0.5 − 0.67 ) + 16 × (1 − 0.67 ) 
2 2 2

36  

1
= (1.7956 + 0.4626 + 1.7424
= ) 0.3334
36

Fig. 3.2: Sampling distribution of sample proportion for sample size n = 2, 3, 5 and 30.
86
Unit 3 Sampling Distributions of Sample Proportions and Variances

Let us see what happens as we increase the sample size, whether it converges
to normal distribution or not. We also plot the sampling distribution of sample
proportion when sample size n = 3, 5 and 30 in Fig. 3.2(b) to 3.2(d).
From Fig. 3.2, you can observe that as we increase the same size the sampling
distribution of proportion approaches the normal distribution.
After knowing the shape of the sampling distribution of proportion, you may be
interested to know the mean and variance of the sampling distribution. Let us
find the mean, variance and standard error of the sampling distribution of
proportion.
Here, we divide the units of the population into two groups on the basis of an
attribute and the sample observations are independent then the number of units
that possess the given attribute (number of successes) follows a binomial
distribution (Unit 10 of the course MST-012) with mean
E(X) = nP

and variance
= nP (1 − P )
Var(X)

where P is the probability or proportion of success in the population.


Now, we can easily find the mean and variance of the sampling distribution of
sample proportion by using the above expression as
 X 1 1
E(p) = E   = E(X) = nP
n n n
E(p) = P

and variance
 X 1  Var ( aX ) a2 Var ( X ) 
Var(p) = Var   =
= Var(X)
 n  n2  

1
= nP (1 − P ) ( X ) nP (1 − P )
 Var=
n2
P (1 − P )
Var(p) =
n
Also, by the definition of standard error, we can obtain the standard error of The standard deviation of a
sample proportion as sampling distribution of a

P (1 − P ) statistic is known as
SE
= (p ) SD
= (p ) Var(p)
= standard error.
n
A question may arise “Can we apply the central limit theorem for the
sampling distribution of sample proportion?” The answer is “Yes” because
the sample proportion could be thought of as a mean. If we label the success as
1 and the failure as 0, then the sample mean becomes the sample proportion as
n
1 1 + 0 + 0 + 1 + .....0 X
X
=
n
∑ X=i n
= = p
n
i=1

Therefore, we can also apply the central limit theorem for large samples. Since
the number of successes follows binomial distribution and Binomial distribution 87
Block 1 Sampling Distributions
converses to normal distribution when n is large, and the probability of success
(P) remains close to 0.5. Therefore, we require some conditions on both n and
Some authors use the P. The central limit theorem states that the sampling distribution of the sample
following conditions for the proportions converges to the normal distribution if
normality of binomial
• nP > 15 and
distribution
• n(1 – P) > 15
• nP > 5 or > 10
• n(1 – P) > 5 or 10 these conditions are called normality conditions. Thus, we can say that if the
sample size is sufficiently large, such that nP > 15 and n(1 – P) > 15 then by
central limit theorem, the sampling distribution of sample proportion p is
approximately normally distributed with mean P and variance P(1 – P)/n, that is,

 P (1 − P ) 
p ~ N  P, 
 n 

In this case, we use the normal distribution to answer the probability questions
about sample proportions and the quantity (Z)
p −P
Z= ~ N ( 0,1)
P (1 − P )
n

follows the standard normal distribution.

When anyone of these normality conditions is not fulfilled, that is, if

• np < 15 or

• n(1 – P) < 15

then the sampling distribution of the number of the units possesses a given
attribute follows a binomial distribution with mean np and variance nP(1 – P),
and we must use the binomial distribution to answer probability questions about
sample proportions by converting the proportion to the number of items with the
characteristic of interest, X (See Example 1).

I think you understand the shape of the sampling distribution of proportion in


different cases. We also explain various forms of the sampling distribution under
different conditions using the flow chart given in Fig. 3.3 which helps you to
quickly judge the shape of the sampling distribution of proportion.

Fig. 3.3: Sampling distribution of sample proportion.


88
Unit 3 Sampling Distributions of Sample Proportions and Variances

After understanding the shape of the sampling distribution of sample proportion,


let us take a real-world situation and see the application of it with the help of an
example.
Example 1: An online retailer claims that 90% of all orders are shipped within
24 hours of being received. A consumer group placed 250 orders of different
sizes and at different times of the day and observed that 210 orders were
shipped within 24 hours. Then
(i) Compute the sample proportion of items shipped within 24 hours.
(ii) Is the sample size large enough to assume that the sampling distribution of
sample proportion is normally distributed? Use P = 0.90, corresponding to
the assumption that the retailer’s claim is valid.
(iii) Find the mean, standard deviation and standard error of the sampling
distribution of proportion.
(iv) Find the probability that the sample proportion computed from a sample of
size 250 will be within 5 percentage points of the true population proportion.
(v) If the consumer group placed 20 orders instead of 250 then how will this
affect the sampling distribution? Do you think this will affect the probability
of Part (iv)? Justify your answer.
Solution: Here, we are given that
P = 90% = 0.90, n = 250, X = 210

(i) We can compute the sample proportion as the number of orders (X) that
are shipped within 24 hours divided by the total number of orders (n) in the
sample. Therefore,
X 210
p= = = 0.84
n 250
(ii) The sample proportion p is approximately normally distributed, if nP > 15
and n(1 – P) > 15, Therefore, first, we check these conditions as follows:
nP = 250 × 0.90 = 225 >15 and

n (1 − p ) = 250 × (1 − 0.90 ) = 25 > 15

Since both conditions are fulfilled, therefore, we can say that the sample
size is large enough to take the normal distribution approximation of the
sampling distribution of sample proportion.
We can find the mean and standard deviation of the sampling distribution of
sample proportion as
E(p)= P= 0.90

and standard deviation


P (1 − P ) 0.90 × 0.10
SD(p)
= = = = 0.019
0.00036
n 250

By the definition of standard error, we can obtain the standard error as

SE
= (p ) SD
= (p ) 0.019
89
Block 1 Sampling Distributions
(iii) The sample proportion computed from a sample of size 250 will be within 5
percentage points of the true population proportion to lie between 0.90 –
0.05 = 0.85 and 0.90 + 0.05 = 0.95. Therefore, the required probability will
be P [0.85 < p < 0.95] .

Since the sampling distribution of sample proportion is normally distributed


with a mean of 0.90 and a standard deviation of 0.019, therefore, we
convert the sample proportion to the standard normal Z-score (as
discussed in Units 1 and 2) as follows:
p − E ( p ) p − 0.90
=Z =
SD ( p ) 0.019

Therefore, we can transform the above expression of the probability to the


Z-scores as follows:
 0.85 − 0.90 p − 0.90 0.95 − 0.90 
P [0.85 =
< p < 0.95] P  < <
 0.019 0.019 0.019 

= P [ −2.63 < Z < 2.63]

We can write the above expression as


P [ −2.63 < Z < 2.63
= ] P [ Z < 2.63] − P [ Z < −2.63]
From the standard normal (Tables III and IV) given in the Appendix of this
volume, we get
P [ −2.63 < Z < 2.63
= ] P [ Z < 2.63] − P [ Z < −2.63
= ] 0.9943 − 0.0043
= 0.99

Hence, we can conclude that there is 99% chance that the sample
proportion computed from a sample of size 250 will be within 5 percentage
points of the true population proportion.
(iv) Since in this case, the sample size n is 20, so we check whether the
normality conditions are fulfilled or not, that is, nP > 15 and n(1 – P) > 15.
Here,
nP = 20 × 0.90 = 1.8 < 15

n (1 − P ) = 20 × (1 − 0.90)= 2 < 15

Since both are less than 15, therefore, the sampling distribution of
proportion will follow the binomial distribution instead of normal.
We now compute the required probability P [0.85 < p < 0.95] using the
binomial distribution.
To calculate the required probability, we use the binomial distribution, that
is, we have to convert 0.85 and 0.95 into the number of orders by
multiplying these with the sample size, that is, n = 20.
Therefore, we can transform the required probability into a binomial as

P [0.85 × 20 < p × 20 < 0.95 × 20] = P [17 < X < 19 ]= P [ X = 18 ]

We can calculate this probability manually using a scientific calculator as


90 we have discussed in Unit 10 of MST-012 and with the help of the binomial
Unit 3 Sampling Distributions of Sample Proportions and Variances

table as discussed in Block 1 of MST-014.

P[X ] Cx p x (1 − p )
n n− x
= x=

Here n = 20, P = 0.90, x = 18, therefore,

P[X =
18] = C18 ( 0.90 ) (1 − 0.90 )
20 18 20 −18
0.2852
=

Hence, we can conclude that there is only 28.52% chance that the sample
proportion computed from a sample of size 20 will be within 5 percentage
points of the true population proportion.
Now, it is time for you to try the following Self Assessment Question to make
sure that you have learnt the sampling distribution of sample proportion.

SAQ 1
A machine produces a large number of items of which 8% are found to be
defective. If 40 items are selected randomly from the production, then
(i) What will be the sampling distribution of sample proportion?

(ii) Find the mean, standard deviation and standard error of the sampling
distribution.

(iii) Calculate the probability that less than or equal to 12% defectives are found
in the sample.

I hope you understood the sampling distribution of proportion. Let us move to


the sampling distribution of difference of two sample proportions in the next
session.

3.3 SAMPLING DISTRIBUTION OF DIFFERENCE


OF TWO SAMPLE PROPORTIONS
In Section 2.5 of Unit 2, we have discussed the sampling distribution of
difference of two sample means. In some cases, we are interested in a
comparative study of the proportions of an attribute in two different populations
or groups. For example,
• An analyst wishes to test whether the proportions of alcohol drinkers in the
two cities are the same,

• An economist interested in comparing the proportions of literates between


two groups of people,

• A researcher wants to compare the proportion of taxpayers in two states.

• A researcher wants to estimate the decrease in the proportions of


consumption of tea after the increase in excise duty on it, etc.

In such situations, we have to draw inferences about the two unknown


population proportions, therefore, it is necessary to know how the sample
statistic (i.e. the difference between two sample proportions) is related to its true
but unknown population parameter (i.e. the difference between two population
proportions). The relationship can be described by the sampling distribution of
difference of two sample proportions. 91
Block 1 Sampling Distributions
To study the shape of the sampling distribution of the difference of two sample
proportions, we use a similar rationale to that developed for the sampling
distribution of a single proportion and for difference of two sample means.
Suppose there are two populations, say, Population-I and Population-II under
study and suppose Population-I and Population-II have population proportions
P1 and P2, respectively, according to an attribute. For describing the sampling
distribution of difference of two sample proportions, we consider all possible
samples of the same size, say, n1 taken from Population-I and for each sample,
we calculate the sample proportion p1 of success. Similarly, determine the
sample proportion p2 of success by considering all possible samples of the
same size, say, n2 from Population-II. Then we consider all possible differences
of proportions p1 and p2. The difference of these proportions may or may not
differ, so we construct the probability distribution of these differences. The
The sampling distribution probability distribution thus obtained is called the sampling distribution of the
of difference of two difference of two sample proportions. Therefore, we can define it as:
sample proportions is the
“The probability distribution of all values of the difference of two sample
theoretical sampling
distribution of difference
proportions that have been obtained by drawing all possible samples of
between sample same sizes from both the populations is called the sampling distribution
proportions (p1− p2) that of difference between two sample proportions or simply sampling
would be obtained by distribution of difference between two proportions.”
drawing all possible
As we have seen in the case of single proportion (described in the previous
samples from both the
populations.
section) if the sample size is sufficiently large, such that nP > 15 and n(1 – P) >
15 then by central limit theorem, the sampling distribution of sample proportion
p is approximately normally distributed with mean P and variance P(1 – P)/n.
Therefore, if n1 and n2 are sufficiently large, such that n1P1 > 15, n1 (1 − P1 ) > 15,
n2P2 > 15, and n2 (1 − P2 ) > 15, then

 P (1 − P1 )   P2 (1 − P2 ) 
p1 ~ N  P1, 1  and p2 ~ N  P2 , 
 n1   n2 

Also, by the property of normal distribution described in Unit 14 of MST 012, the
sampling distribution of the difference of two proportions also follows a normal
distribution with mean
E ( p1 − p2 ) =E ( p1 ) − E ( p2 ) =P1 − P2

and variance
P1 (1 − P1 ) P2 (1 − P2 )
Var ( p1 − p=
2) Var ( p1 ) − Var ( p=
2) +
n1 n2

That is,
 P (1 − P1 ) P2 (1 − P2 ) 
p1 − p2 ~ N  P1 − P2 , 1 + 
 n1 n2 

By the definition of the standard error, we can obtain it as follows:

P1 (1 − P1 ) P2 (1 − P2 )
SE ( p1 −=
p2 ) Var ( p1 − =
p2 ) +
n1 n2
92 Therefore, we can say that the sampling distribution of difference of two sample
Unit 3 Sampling Distributions of Sample Proportions and Variances

proportions converge to the normal distribution if


• n1P1 > 15, n1 (1 − P1 ) > 15, and

• n2P2 > 15, n2 (1 − P2 ) > 15

In this case, we use the normal distribution to answer probability questions


about difference of two sample proportions and the quantity

Z=
(p1 − p2 ) − (P1 − P2 ) ~ N ( 0,1)
P1 (1 − P1 ) P2 (1 − P2 )
+
n1 n2

follows the standard normal distribution.


However, when any one of the above conditions is not fulfilled, that is, if
• n1P1 ≤ 15 or n1 (1 − P1 ) ≤ 15 or

• n2P2 ≤ 15 or n2 (1 − P2 ) ≤ 15

then the sampling distribution of the difference of two proportions may not be
approximately normal. For smaller sample sizes, this approximation may not
hold, and the distribution could be skewed or has heavier tails and does not
follow standard form.
I think you understand the shape of the sampling distribution of difference of two
proportions in different cases. We also explain various forms of the sampling
distribution under different conditions using the flow chart given in Fig. 3.4 which
helps you to quickly judge the shape of the sampling distribution of difference of
two proportions.

Fig. 3.4: Sampling distribution of difference of two sample proportions.

Let us see an application of the sampling distribution of difference of two


sample proportions with the help of an example.
Example 2: In City A, 30% of people had blue-eye and in City B, 20% had the
same blue-eye. If a random sample of 200 people is taken from each population
independently, then
(i) What is the shape of the sampling distribution of difference of two
proportions?
(ii) Find the mean, standard deviation and standard error of the sampling
distribution.
(iii) Find the probability that the difference in sample proportions is less than
0.05. 93
Block 1 Sampling Distributions
Solution: Here, we are given that
P1 = 0.30, P2 = 0.20, n1 = n2 = 200
To check whether the shape of the sampling distribution of difference of two
proportions follows the normal distribution or not, we check whether the
following conditions are satisfied or not as
n1P1= 200 × 0.30= 60 > 15, n1 (1 − P1 )= 200 × (1 − 0.30 )= 140 > 15,

n2P2= 200 × 0.20= 40 > 15, n2 (1 − P2 )= 200 × (1 − 0.20 )= 160 > 15

Since all conditions are satisfied, therefore, we can assume that the sampling
distribution of difference of two sample proportions is approximately normally
distributed.
Thus, we can compute the mean and standard deviation of the sampling
distribution as
E ( p1 − p2 ) = P1 − P2 = 0.30 − 0.20 = 0.10

and standard deviation

P1 (1 − P1 ) P2 (1 − P2 )
SD ( p1=
− p2 ) +
n1 n2

0.30 × 0.70 0.20 × 0.80


= + = 0.00185 0.043
=
200 200

By the definition of the standard error, we can compute the standard error as
SE ( p1 − p2=
) SD (p1 − p2=) 0.043
Now, we have to compute the probability that the difference in sample
proportions is less than 0.05. Therefore, we can represent it in symbolic form as
P [p1 − p2 < 0.05]

Since the sampling distribution of differences in sample proportions of blue-


eyed persons is normally distributed with a mean of 0.10 and standard deviation
of 0.043, therefore, to obtain the value of this probability, we convert (p1 − p2 )
into a standard normal Z-score by the following transformation:
(p1 − p2 ) − E(p1 − p2 ) (p1 − p2 ) − 0.10
=Z =
SD(p1 − p2 ) 0.043

Therefore, by subtracting 0.10 from each term and then dividing each term by
0.043 in the above inequality, we get the required probability as
 (p − p2 ) − 0.10 0.05 − 0.10 
P 1 < = P [ Z < −1.16]
 0.043 0.043 

From the standard normal table (Table III), we have


P [p1 − p2 < 0.05
= ] P [ Z < −1.16
= ] 0.1230
Thus, we conclude that there would be only 12.3% chance that the difference in
sample proportions of blue-eye in both cities is less than 0.05.
Now, try to solve the following Self Assessment Question to ensure that you
94 understand this section properly.
Unit 3 Sampling Distributions of Sample Proportions and Variances

SAQ 2
In Example 2, if 40 people are selected randomly from each population
independently instead of 200 people then how will this affect the sampling
distribution? Do you think this will affect the probability? Justify your answer.

I hope that you understand the shape of the sampling distributions of sample
proportion and difference of two proportions when the samples are independent
and now you may have the curiosity to know the shape of the sampling
distribution of sample variance. Therefore, we now discuss the sampling
distribution of sample variance in the next session.

3.4 SAMPLING DISTRIBUTION OF SAMPLE


VARIANCE
In Sections 2.2 and 3.3 of Unit 2 and Unit 3, you have studied the sampling
distributions of sample mean and sample proportion, respectively. But in many
practical situations, we have to be concerned with variability. Some situations
where we must study the variability of the population are given as follows:

• In the manufacturing sector, when we manufacture products that fit


together (such as pipes and ball bearings), then it is important to keep the
variations of the diameters of the products as small as possible; otherwise,
they will not fit together properly and will have to be scrapped.

• When pharmaceuticals are manufactured, then variation in the tablet plays


a crucial role in ensuring that patients are given the recommended amount
of medicine.

• The variation in the weight of the juice packet also plays an important role
in the goodwill of the company.

In such situations, it is important to study the variability of the population. But if


the population is too large or the units or items of the population are destructive
in nature or there are limited resources such as manpower, money, etc., then it
is not possible practically to examine every unit of the population to obtain the
necessary information of variance of the population. In such situations, one can
draw a sample from the population under study and utilize sample observations
to extract the necessary information about the population variance. For drawing
the inference about the population variance, we use sample variance, and we
require the sampling distribution of sample variance for providing information
about the variance of the whole population using sample data and conclude as
accurately and reliably as possible.
To obtain the sampling distribution of sample variance, we proceed in the same
manner as in the cases of the sampling distributions of mean and proportion.
For describing the sampling distribution of the sample variance, we consider all
possible samples of the same size, say, n taken from the population and for
each sample, we calculate sample variance (S2). The values of sample variance
may vary from sample to sample, so we construct the probability distribution of
sample variances. The probability distribution thus obtained is known as the
sampling distribution of the sample variance. Therefore, we can define the
sampling distribution of sample variance as:
95
Block 1 Sampling Distributions
“The probability distribution of all values of the sample variance would be
The sampling distribution obtained by drawing all possible sample of the same size from the parent
of the sample variance is population is called the sampling distribution of the sample variance or
a theoretical probability
simply the sampling distribution of the variance.”
distribution of sample
variance that would be Now, the question may arise in your mind “What is the shape of the sampling
obtained by drawing all distribution of sample variance, is it normal, or t?” as in the case of the sampling
possible samples of the distribution of mean. The shape of the sampling distribution of sample variance
same size from the also depends on
population.
• the form/shape/distribution of the population from which the samples are
taken, and
• the size of the sample.
To discuss how the form/shape/distribution of the population affects the
sampling distribution of sample variance, we divide our discussion into two
parts:
1. When population is normally distributed; and
2. When population is not normally distributed.
Let us discuss one at a time.

3.4.1 When Population is Normally Distributed


In many situations, it is reasonable to assume that the population from which we
select a random sample has a normal or nearly normal distribution. For
example, people's heights, IQ level, blood pressure, the weight of newly born
babies, test scores, incomes, shoe size, etc. generally follow a normal
distribution. When the population follows a normal distribution, then the
sampling distribution of sample variance does not follow a normal distribution
especially when the sample size is small (< 30). For a better understanding of
the sampling distribution of sample variance and its shape, we consider an
example in which the size of the population is very small.
Let us consider the example of our cute children of play school discussed in
Unit 2. Suppose the play-school teacher is now interested in knowing the
variability in their scores of how many digits they would repeat from memory
after hearing them once (out of 5). The obtained data are as follows:

Child Name Avaya Ishaan Ria Zara Kavya Parth Rayaan Nyra Yash

Score 2 5 4 3 1 3 2 4 3

We can calculate the population mean (average score) as


2 + 5 + 4 + 3 + 1+ 3 + 2 + 4 + 3
=μ = 3
9
Similarly, we can obtain the standard deviation of the population as

1 N (2 − 3)2 + (5 − 3)2 + ... + (3 − 3)2


σ
= ∑
N i=1
μ)2
(Xi − =
9
= 1.33

To know whether the shape of the population (test scores) is normal or not, we
first plot the graph of the scores of the children as shown in Fig. 3.5.
96
Unit 3 Sampling Distributions of Sample Proportions and Variances

Frequency
2

0
1 2 3 4 5
Score (out of 5)

Fig. 3.5: The distribution of the scores of the children in the listening task.

From the above figure, you can observe that the test scores of the children
follow an approximate bell-shaped normal distribution.
To obtain the shape of the sampling distribution of the sample variance when
the sample is taken from a normal population, let us consider all possible
samples with replacement of size n = 2 from the population of the test scores of
the children. There are Nn = 92 = 81 possible simple random samples with
replacement of size 2. All possible samples of size n = 2 are given in the
second column of Table 3.3. We also calculate the variance of each sample.
The obtained sample variances are given in the last column of the same table.
Table 3.3: Samples and Sample Variances

Sample
Sample Sample in Term of Sample
Observations
number Children Variance
(scores)

1 (Avaya, Avaya) (2, 2) 0.00

2 (Avaya, Ishaan) (2, 5) 2.25

3 (Avaya, Ria) (2, 4) 1.00

4 (Avaya, Zara) (2, 3) 0.25

5 (Avaya, Kavya) (2, 1) 0.25

6 (Avaya, Parth) (2, 3) 0.25

7 (Avaya, Rayaan) (2, 2) 0.00

8 (Avaya, Nyra) (2, 4) 1.00

9 (Avaya, Yash) (2, 3) 0.25

10 (Ishaan, Avaya) (5, 2) 2.25

… … … …

81 (Yash, Yash) (3, 3) 0.00

Here, we have given some of these possible samples to reduce the space. You
can prepare the same as we have discussed in Units 1 and 2. To obtain the
distribution of the sample variances, we arrange the values of the sample
variance in ascending order and calculate the frequency corresponding to each
value as discussed in Units 1 and 2. The obtained sampling distribution of
sample variance is shown in Table 3.4. 97
Block 1 Sampling Distributions
Table 3.4: Sampling Distribution of Sample Variance

Sample
S. No. Frequency(f) Probability(p)
Variance

1 0 19 19/81

2 0.25 32 32/81

3 1.00 20 20/81

4 2.25 8 8/81

5 4.00 2 2/81

6 4.50 0 0/81

Total 81 1

To get an idea of the shape of the sampling distribution of sample variance, we


plot the graph of the values of the sample variance taking the sample variance
on the X-axis and corresponding frequencies on the Y-axis as shown in Fig.
3.6(a).

Fig. 3.6: Sampling distribution of variance when population is normal for sample size n = 2,
3, 5 and 30.

From Fig. 3.6(a), you can observe the shape of the distribution of sample
variance, and you can notice that for a small sample size n = 2, the shape of the
sampling distributions of sample variance does not follow a normal. However, it
98 is right-skewed.
Unit 3 Sampling Distributions of Sample Proportions and Variances

Let us see what happens as we increase the sample size whether it converges
to normal distribution or not. We plot the sampling distribution of sample
variance when sample size n = 3, 5 and 30 in Fig. 3.6(b) to 3.6(d), respectively.
From Fig. 3.6, you can observe that as we increase the sample size the
sampling distribution approaches normal distribution.
The distribution of sample variance cannot be obtained directly. In 1875,
German statistician Friedrich Robert Helmert made some transformations in the
sample variance and obtained a new distribution which is known as the chi-
square distribution as the distribution of the sample variance for a normal
population. However, the exact shape of the sampling distribution of sample
variance also depends on whether the population mean is known or unknown.
Therefore, we consider both cases as follows:
Population Mean is Known
When we draw the samples from the normal population with known mean µ and
variance σ2, then we calculate sample variance using the following formula:
1 n
∑ ( Xi − μ )
2
=S2
n i=1

To find the exact distribution of the sampling distribution of sample variance, we


make some transformations. We multiply sample variance S2 by sample size n
and divide the product by the population variance (σ2) to find the exact form of
the sampling distribution of sample variance. In this situation the quantity
n

∑ ( X − μ)
2
2
2 nS i
χ= = i =1
~ χ(2n)
σ2 σ2

follows the chi-square distribution with n degrees of freedom. We will discuss


the chi-square distribution in detail in Unit 4.

Population Mean is Unknown

In most real-world situations, the mean of the population is rarely known,


especially when the population is too large or destructive in nature. For
example, the height of all people of a country, the sound of the crackers, the
blood test of a human being, the income of the people of a city, etc. In such
cases, the population mean is generally unknown.

When the population mean is unknown then we use the sample mean X in
place of the population mean (µ). However, due to the discrepancy between the
sample mean and population population mean, we calculate sample variance
using the following formula:

1 n
( )
2
=S2 ∑ Xi − X
n − 1 i =1

That is, we take (n −1) in place of n.

To find the exact distribution of the sampling distribution of sample variance, we


make some transformation as discussed above and, in this case, we multiply
sample variance S2 by (n −1) instead of sample size n and divide the product by
the population variance (σ2). The observed quantity 99
Block 1 Sampling Distributions
n

− 1) S2 ∑
2
(Xi − X)
=χ2
(n= i =1
~ χ(2n−1)
σ2 σ2
follows the chi-square distribution with (n − 1) degrees of freedom.
After knowing the shape of the sampling distribution of sample variance when
the population is normally distributed and the population mean is known or
unknown, you may be interested to know the mean and variance of the
sampling distribution of sample variance.
In practice, only one random sample is selected, and the concept of the
sampling distribution is used to draw the inference about the population
parameters. If X1, X2, …, Xn is a random sample of size n taken from a normal
population with mean µ and known variance σ2 then the mean and variance of
the sampling distribution of sample variance also depend whether mean of the
population is known or not known. Therefore, we consider the following cases:
When population mean is known
If the population mean is known, then the mean and variance of the sampling
distribution of sample variance are given as follows:
(n − 1) σ2
of S2 E=
Mean= S2 ( ) n

and variance
2 ( n − 1) σ 4
Var(S2 ) =
n2
Similarly, the standard error is given by

2 ( n − 1) σ4
SE
= S( )
2
SD
= S ( ) 2
Var
= S ( ) 2

n2

When population mean is unknown


If the population mean is unknown, then the mean and variance of the
sampling distribution of sample variance are given as follows:
Mean of S2 = E S2 = σ2 ( )
and variance
2σ 4
Var(S2 ) =
n −1
By the definition of standard error, the standard error is the standard deviation
of the sampling distribution, therefore,

2σ 4
SE
= ( )
S2 SD
= S2 ( ) Var
= ( )
S2
n −1
The proofs of the above expressions are out of scope.

Note: To solve the problems relating to the sampling distribution of sample


variance, we convert the variate S2 into a chi-square variate by the above
transformation. For better understanding, we explain it by taking an example
100 given as follows:
Unit 3 Sampling Distributions of Sample Proportions and Variances

Example 3: A manufacturer of steel ball bearings has found that the distribution
of the diameter of the steel ball bearings follows a normal distribution with a
variance of 0.18 inches2. If the manufacturer took a random sample of 25 steel
ball bearings, then
(i) What will the sampling distribution of sample variance be?
(ii) Find the mean, standard deviation and standard error of the sampling
distribution.

(iii) What is the probability that a random sample of 25 ball bearings will result
in a sample variance of at least 0.2 inches2?

Solution: Here, we are given that

σ2 = 0.18, n = 25

Since it is given that the diameter of the steel ball bearings follows a normal
distribution with a variance of 0.18 inches2, therefore, the sampling distribution

of
( n − 1) S2
follows the chi-square distribution. Since the population mean is
σ2
unknown, therefore, it follows the chi-square distribution with n – 1 degrees of
freedom.

We can compute the mean and standard deviation when population means is
unknown as

2σ 4 2 × 0.182
( ) S2
E S2 =σ2 =0.18 and SD= ( ) =
n −1
= 0.052
24

Since the standard error is the standard deviation of the sampling distribution,
therefore,

SE
= S2( )
SD
= S2 0.052( )
Now, we can answer part (iii) by computing the probability that the sample
variance is at least 0.2 inches2 as follows:

P S2 ≥ 0.2

To get the value of this probability, we convert variate S2 into the chi-square
variate which follows the chi-square distribution with n – 1 = 24 degrees of
freedom by the transformation

2 (n − 1)S2 24S2
χ
= ( n −1) =
σ2 0.18

Therefore, multiplying each term by 24 and then dividing each term by 0.18 in
the above inequality. We get the required probability as
 24S2 24 × 0.2 
 = P χ( 24 ) ≥ 26.67 
P ≥ 2 We discuss more about the
 0.18 0.18  chi-square table in Unit 4.

To find the value of the above expression, we will use the chi-square table
(Table VI) given in the Appendix at the end of this volume. The chi-square table
is similar to the t-table in the sense that the body of the chi-square table
101
Block 1 Sampling Distributions

represents the critical value or point corresponding to the probability. Here, we


have to find the probability beyond the point 26.67 at 25 – 1 = 24 degrees of
freedom. So, we check whether this value lies in the row corresponding to 24
degrees of freedom or not. Since it is not there so we read the probability value
(α) from the column heading corresponding to just greater and just smaller than
26.67. Thus, from the chi-square table, we get

0.90 and P χ(224 ) ≥ 33.20  =


P χ(224 ) ≥ 15.66  = 0.10.
   

Therefore, the above probability must lie between 0.10 and 0.90, that is,

0.10 < P χ2 ≥ 26.67


= ( )
 P S2 ≥ 0.2 < 0.90

Therefore, we conclude that there is 10% to 90% chance that the sample
variance is at least 0.2 inches2.

Note: With the help of the chi-square table, we cannot calculate the exact
probability. But computer packages or software such as R, SPSS, SAS,
MINITAB, STATA, EXCEL, etc. help us to calculate it exactly. From R, we find it
as

P S2 ≥ 0.2  = P χ2 ≥ 26.67  = 0.32007

After reading the above discussion, you can try the following Self Assessment
Question.

SAQ 3
Consider Example 3 of steel ball bearings. If the average diameter of the steel
ball bearings is known to be 2 inches, then how will this affect the sampling
distribution and standard error? Do you think this will affect the probability?
Justify your answer.

3.4.2 When Population is not Normally Distributed


In the previous sub-section, we discussed the shape of sample distributions of
sample variance when we take samples from a normally distributed population
and the population mean is known and unknown. However, in real-life
applications, it is quite common that the data does not follow normal distribution.
For example, the distribution of stock market returns, road accidents, and
several natural phenomena like rainfall and earthquake magnitudes do not
follow normal distribution. Many disciplines, including economics, biology,
sociology, and environmental science, deal with non-normal data or the shape
does not follow the normal distribution. We think you have a curiosity to know,
whether the shape of the sampling distribution of sample variance is normal or
not. In such situations, the shape of the sampling distribution of sample
variance (when samples are drawn from non-normal populations) generally
does not specify or follow a standard distribution when the sample size is small.
However, when the sample size becomes large then the sampling distribution of
variance is approximately normally distributed.

We also explain various forms of the sampling distribution under different


102 conditions using the flow chart given in Fig. 3.7.
Unit 3 Sampling Distributions of Sample Proportions and Variances

Fig. 3.7: Sampling distribution of sample variance.

We hope you understand the shape of the sampling distribution of sample


variance in different situations. Let us move to the sampling distribution of ratio
of two sample variances in the next session.

3.5 SAMPLING DISTRIBUTION OF RATIO OF TWO


SAMPLE VARIANCES
In the previous section, we discussed the sampling distribution of sample
variance where we were dealing with one population. Now, one may be
interested in pursuing the comparative study of the variances of the two
populations as we have done for the means and the proportions. For example,
• A quality control engineer wants to compare the number of defective units
produced by two machines,

• A doctor wants to test whether the variance in the weight of the newly born
baby girl is less than the baby boy,

• A researcher is interested to know which one of the two different types of


drugs has a lesser variance in controlling high blood pressure,

• An economist may wish to know whether the variability in income differs


across the two populations, etc.

In such situations, we require the comparative study of variances of both


groups. But if the populations are too large or the units or items of the
populations are destructive in nature or there are limited resources such as
manpower, money, etc., then it is not possible practically to examine every unit
of the population to obtain the necessary information about the population
variances. In such situations, for comparing the population variances, we
require the sampling distribution of ratio of two sample variances. In the cases
of two means and two proportions, we prepare the sampling distribution of
difference of two sample means and two proportions, respectively but the
sampling distribution of difference of two variances does not follow a standard
distribution. However, it is observed that the sampling distribution of ratio of two
sample variances follows a distribution which is known as F-distribution when
populations are independent and follow the normal distributions. Therefore, we
consider the sampling distribution of ratio of two sample variances in place of
difference. 103
Block 1 Sampling Distributions
To obtain the sampling distribution of ratio of sample variances, we proceed in
the same manner as in the cases of two sample means and proportions.
Suppose we study the same characteristics from two different groups such as
Population-I the income of people of two cities, the height of players of two basketball teams,
Mean- μ1
etc. and we consider the observations of both groups as Population-I and
Variance- σ1
2
Population-II and both populations are independent. Suppose these two
populations are normally distributed and Population-I has mean μ1 and variance
σ12 and Population-II has mean μ2 and variance σ22 .

To obtain the sampling distribution of ratio of sample variances, we take all


possible samples of the same size, say, n1 from Population-I and then the
sample variance, say, S12 is calculated for each sample. Similarly, all possible
samples of the same size n2 are taken from Population-II and the sample
Population-II variance, say, S22 is calculated for each sample. If the populations are too large,
Mean- μ2
then we consider only several samples as we afford instead of all from each
2
Variance- σ 2 population. Then we consider all possible ratio of sample variances S12 and S22 .
The ratio of these sample variances may or may not differ from sample to
sample and is considered a random variable. Therefore, we construct the
probability distribution of these ratios. The probability distribution thus obtained
is known as the sampling distribution of ratio of variances. We can define the
sampling distribution of ratio of sample variances as:
“The probability distribution of all values of the ratio of two sample
The sampling distribution variances would be obtained by drawing all possible samples from both
of ratio of two samples the populations is called the sampling distribution of ratio of two sample
variances is the variances or simply sampling distribution of ratio of variances.”
theoretical sampling
The shape of the sampling distribution of ratio of sample variances also
distribution of ratio of
sample variances
depends on the form of the populations. To discuss how the
form/shape/distribution of the population affects the sampling distribution of
that would be obtained by
sample variance, we divide our discussion into two parts:
drawing all possible
samples from both the 1. When populations are normally distributed; and
populations.
2. When populations are not normally distributed.

Let us discuss one at a time.


3.5.1 When Populations are Normally Distributed
In many situations, it is reasonable to assume that the populations for which we
have to compare the variances are normally distributed. For example, people's
heights, IQ level, the diameter of ball bearings, blood pressure, the weight of
newly born babies, test scores, incomes, shoe size, etc. generally follow a
normal distribution. When the populations are normally distributed, then the
sampling distribution of ratio of sample variances does not follow normal
distribution especially when the sample size is small (< 30). However, it follows
a new distribution which is known as F-distribution (We will describe the
F-distribution in detail in Unit 5) under some conditions.
In this case, the appropriate sample statistic is S12 / S22 , and it is associated
σ12
population parameter .
σ22
104
Unit 3 Sampling Distributions of Sample Proportions and Variances

The sampling distribution of ratio of variances was given by Prof. R. A. Fisher in


1924. According to Prof. R. A. Fisher, the quantity
S12 / σ12
F= ~ F(n1 −1, n2 −1)
S22 / σ22

If σ12 =σ22 then

S12
F= ~ F(n1 −1,n2 −1)
S22

follows the F-distribution with (n1 – 1, n2 – 1) degrees of freedom.


Therefore, the sampling distribution of ratio of sample variances follows the
F-distribution with (n1 – 1, n2 – 1) degrees of freedom when populations are
independent and normally distributed. We will discuss the F-distribution in detail
in Unit 5 unit of this Volume.

3.5.2 When Populations are not Normally Distributed


In the previous sub-section, we discussed the shape of sample distribution of
ratio of sample variances when we take independent samples from two
normally distributed populations. However, in real-life applications, it is quite
common that the data does not follow normal distribution. In such situations, the
shape of the sampling distribution of ratio of sample variances (when samples
are drawn from non-normal populations) generally does not specify or follow a
standard distribution when the sample size is small. However, when sample
sizes become large then the sampling distribution of ratio of two variances is
approximately normally distributed.
We also explain various forms of the sampling distribution of ratio of sample
variances under different conditions using the flow chart given in Fig. 3.8.

Fig. 3.8: Sampling distribution of ratio of two sample variances.

With this, we end this unit. We now summarise our discussion.

3.6 SUMMARY
In this unit, we have discussed the following points:
• The sampling distribution of the sample proportions converges to the
normal distribution with mean P and variance P(1 – P)/n if nP >15 and 105
Block 1 Sampling Distributions
n(1 – P) > 15 otherwise it follows a binomial distribution.
• The sampling distribution of difference of two sample proportions
converges to the normal distribution with mean P1 − P2 and variance
P1 (1 − P1 ) P2 (1 − P2 )
+ if
n1 n2

 n1P1 > 15, n1 (1 − P1 ) > 15, and

 n2P2 > 15, n2 (1 − P2 ) > 15

If any one of the above conditions is not fulfilled, then the sampling
distribution of difference of two sample proportions does not follow normal.
• The sampling distribution of variance follows a chi-square distribution with n
and n – 1 degrees of freedom when the population follows a normal
distribution and the population mean is known or unknown, respectively. If
the population is not normally distributed, then it does not follow chi-square
as well as normal distribution for small samples.
• The sampling distribution of ratio of sample variances follows the
F-distribution with (n1 – 1, n2 – 1) degrees of freedom when the populations
follow a normal distribution. If the populations are not normally distributed,
then it does not follow the F as well as normal distribution for a small
sample.

3.7 TERMINAL QUESTIONS


Consider SAQ 1. If a random sample of 200 is taken, then how will this affect
the sampling distribution?

3.8 SOLUTIONS / ANSWERS


Self Assessment Questions (SAQs)
1. Here, we are given that
P = 8% = 0.08, n = 40

(i) The sample proportion p is approximately normally distributed, if


nP > 15 and n(1 – P) > 15, Therefore,
nP = 40 × 0.08 = 3.2 < 15 and

n (1 − P ) = 40 × (1 − 0.08)= 36.8 >15

Since the conditions of normality are not fulfilled, therefore, we can


assume that the sampling distribution of sample proportion is binomial.

(ii) We can compute the mean and standard of the sampling distribution of
sample proportion as
E(p)= P= 0.08

and standard deviation

P (1 − P ) 0.08 × 0.92
SD(p)
= = = 0.00184 0.0429
=
n 40
106
Unit 3 Sampling Distributions of Sample Proportions and Variances

Also, by definition of the standard error, we have


SE
= (p ) SD
= (p ) 0.0429
(iii) The probability that the sample proportion will be less than or equal to
12% defectives is given by
P [p ≤ 0.12]

Since the sampling distribution of the sample proportion is binomial


and to calculate the required probability, we use the binomial
distribution, therefore, we convert 0.12 into the number of defective
items by multiplying 0.12 by sample size, that is, n = 40.
We can transform the required probability into a binomial as
P [p × 40 < 0.12 × 40] =P [ X < 4.8] =P [ X =0] + P [ X =1]

+ P[X =2] + P [ X =3] + P [ X =4]

We can calculate this probability as we have discussed in Unit 10 of


MST-012 and Block 1 of MST-014.
4
P[X ] ∑ Cx p x (1 − p )
n n− x
= x=
X =0

Here n = 40, P = 0.08, x = 0,1, 2, 3, 4, therefore,


P [ X < 4.8] =
0.7868

Hence, we can conclude that there is only 78.68% chance that the
machine produces 12% defective items.

2. Here the sampling sizes are given as


=n1 40,
= n2 40

Since sample sizes have been changed so we have to check whether the
normality conditions are satisfied or not in the current scenario to know the
shape of the sampling distribution of difference of two sample proportions
as
n1P1 = 40 × 0.30 = 12 < 15, n1 (1 − P1 ) = 40 × (1 − 0.30 ) = 28 > 15,

n2P2 = 40 × 0.20 = 8 < 15, n2 (1 − P2 ) = 40 × (1 − 0.20 ) = 32 > 15

Since n1P=
1 12 < 15 and n2P2= 8 < 15 conditions are not satisfied,
therefore, we cannot assume that the sampling distribution of difference of
two sample proportions is approximately normally distributed. In this
situation, we cannot calculate the required probability because we do not
know the form of the sampling distribution of difference of two sample
proportions.

3. Here, we are given that

σ2 = 0.18, n = 25, µ = 2

Since it is given that the diameter of the steel ball bearings follows a normal
distribution with the mean of 2 inches and variance of 0.18 inches2, 107
Block 1 Sampling Distributions
2
nS
therefore, the sampling distribution of follows the chi-square
σ2
distribution with n degrees of freedom because the population mean is
known.

Now, we compute the probability that the sample variance is at least 0.2
inches2, that is,

P S2 ≥ 0.2

To get the value of this probability, we convert variate S2 into the chi-square
variate which follows the chi-square distribution with n = 25 degrees of
freedom by the transformation
n S2 25S2
χ(2n=
) =
σ2 0.18
Therefore, multiplying each term by 25 and then dividing each term by 0.18
in the above inequality. We get the required probability as
 25S2 25 × 0.2 
 = P χ( 25 ) ≥ 27.78 
2
P ≥
 0.18 0.18 
To find the value of the above expression, we will use the chi-square table
(Table VI) given in the Appendix at the end of this volume as discussed in
Example 3. Here, we have to find the probability beyond the point 27.78 at
25 degrees of freedom. So, we check whether this value lies in the row
corresponding to 25 degrees of freedom or not. Since it is not there so we
read the probability value (α) from the column heading corresponding to
just greater and just smaller than 27.78. Thus, from the chi-square table,
we get

P χ(225 ) ≥ 16.47  =
0.90 and P χ(225 ) ≥ 34.38  =
0.10.
   
Therefore, the above probability must lie between 0.10 and 0.90, that is,
0.10 < P χ2 ≥ 27.78
= ( )
 P S2 ≥ 0.2 < 0.90

Thus, we conclude that there is 10% to 90% chance that the sample
variance is at least 0.2 inches2.
Terminal Questions (TQs)
To check whether the shape of the sampling distribution of difference of two
proportions follows the normal distribution or not, we check whether the
following conditions are satisfied or not. Therefore,
nP = 200 × 0.08 = 16 > 15

And n (1 − P ) = 200 × (1 − 0.08)= 184 >15

Since both are greater than 15, therefore, we can assume that the sampling
distribution of sample proportion is approximately normally distributed with
mean nP and variance P(1 – P)/n.

108
UNIT 4
SAMPLING DISTRIBUTIONS
ASSOCIATED WITH
NORMAL POPULATIONS-I

Structure
4.1 Introduction 4.7 t-distribution
Expected Learning Outcomes Probability Density Curve of t-
distribution
4.2 Concept of Degrees of
Freedom Summary Measures of t-distribution

4.3 Chi-square Distribution 4.8 Relation of t-distribution to


Other Distributions
Probability Density Curve of Chi-
square Distribution 4.9 Properties of t-distribution
Summary Measures of Chi-square 4.10 Applications of t-distribution
Distribution
4.11 Summary
4.4 Relation of Chi-square
4.12 Terminal Questions
distribution to Other
Distributions 4.13 Solutions /Answers

4.5 Properties of Chi-square


Distribution
4.6 Applications of Chi-square Tools You Will Need
Distribution
The following terms are

4.1 INTRODUCTION considered essential


background material for
In Unit 1, we have discussed the fundamentals of sampling distributions, and this Unit. If you doubt your

some basic definitions related to it. In Units 2 and 3, we have discussed the knowledge of any of these
terms, you should review
sampling distributions of various sample statistics such as sample mean,
the appropriate Unit or
sample proportion, sample variance, etc. You have observed that the
section before proceeding:
sampling distribution of mean follows normal or t-distributions under different
conditions whereas the sampling distribution of variance and ratio of two • Sampling
variances follow chi-square and F-distributions, respectively. You have studied Distributions for

the normal/ standard normal distribution in detail in the MST-012: Probability Means and Variance
(Units 2 and 3).
and Probability Distributions course. Therefore, we will focus only on other
important sampling distributions. In this unit, we will discuss the chi-square
109
Unit Writer- Dr. Prabhat Kumar Sangal, School of Sciences, IGNOU, New Delhi
Block 1 Sampling Distributions
and t-sampling distribution.
This unit is divided into 13 sections. Section 4.1 is introductive in nature. The
chi-square, t and F distributions are described with the help of the degrees of
freedom, therefore, in Section 4.2, the concept of degrees of freedom is
described in detail. The chi-square distribution with its probability curve,
summary measures, relation to other distributions, properties and applications
are discussed in Sections 4.3 to 4.6. Similarly, the t-distribution with its
probability curve, summary measures, relation to other distributions, properties
and applications are discussed in Sections 4.7 to 4.10. The unit ends by
providing a summary of what we have discussed in this unit in Section 4.11.
The terminal questions and the solution of the SAQs/TQs are given in
Sections 4.12 and 4.13, respectively.
In the next unit, we shall discuss the F-distribution and how to obtain the
tabulated values of the chi-square, t and F-distributions.

Expected Learning Outcomes


After studying this unit, you should be able to:
 describe and calculate the degrees of freedom;
 explain the chi-square distribution with its probability curve, summary
measures, relation to other distributions, properties and applications;
 describe the t-distribution with its probability curve, summary measures,
relation to other distributions, properties and applications; and
 identify the conditions under which the chi-square and t-distributions can
be used.

4.2 CONCEPT OF DEGREES OF FREEDOM


Sometimes statistics such as sample mean, sample proportion, sample
variance, etc. may follow a particular sampling distribution 9 (as you have
If random variable X
seen in Units 2 and 3) such as normal, t, chi-square, F distribution, etc. In
follows normal distribution
Units 2 and 3, you have also studied that the sampling distributions such as t,
with mean µ and variance
chi-square and F- distributions are described by degrees of freedom. The
σ2 i.e., then
concept of the degrees of freedom would be relatively unimportant for the
pdf of X is given by normal distribution because the pdf (see margin)/shape of the normal
distribution does not depend on the degrees of freedom, that is, the number of
independent sample observations in a sample (sample size). However, there
are certain sampling distributions in which the shape of the curve changes with
the degrees of freedom. For such distributions, the number of degrees of
freedom is used as a parameter and if we make a mistake in determining it
from the data, then the wrong probability value will be obtained from the table
and the results will be wrongly interpreted. The chi-square, t and F
distributions are very important sampling distributions which are used in
elementary work as well as modern statistical analysis and are described with
the help of degrees of freedom. Therefore, before describing these
distributions, first, we will discuss the concept of “degrees of freedom”.

In a general way, we can define the degrees of freedom as the number of


110 observations whose values are free to vary.
Unit 4 Sampling Distributions Associated with Normal Populations-I

First of all, we try to understand the degrees of freedom by considering the


freedom of motion possessed by certain familiar objects. For example, a drop
of water on a surface moves freely on a two-dimensional surface, therefore, it
has two degrees of freedom. However, a train can move only in one
dimensional (forward or backward), therefore, it has only one degree of
freedom. Similarly, a bird can move freely in three-dimensional space,
therefore, it has three degrees of freedom.
After understanding the concept of the degrees of freedom for familiar objects,
we now consider certain examples of simple arithmetic. Suppose you are
asked to choose a pair of numbers (X, Y) at random, then you have complete
freedom of choice to each of the two numbers, therefore, you have two
degrees of freedom. Now, if you are asked to choose a pair of numbers whose
sum is 8. It is clear that you can only select a single number at random such
as 5,6, 4, etc. but the second number is fixed as soon as the first is chosen so
that the sum will be 8. If you choose the first number 5 then the second
number, you must choose 3 and so on. However, there are two variables in
the situation, but you are choosing only one independently. Therefore, the
number of degrees of freedom is reduced from two to one by the imposition of
the condition X+Y = 8. Similarly, suppose you are asked to choose a pair of
numbers such that the sum of their squares is 40. Once more, it is obvious
that only one number may be selected at random and the second is decided
as soon as the first is chosen. Therefore, the number of degrees of freedom is
reduced from two to one by the imposition of the condition X2 + Y2 = 40.
Suppose you are asked again to choose a pair of numbers (X, Y) at random
under the simultaneously two conditions X+Y = 8 and X2+Y2 = 40. If we solve
these equations simultaneously then we have only two options as X = 6, Y =
2 or X = 2, Y = 6. Therefore, we cannot choose freely any variable out of both
and in another way, we can say that there is no freedom to choose the
numbers. Thus, the number of degrees of freedom is reduced from two to zero
when two conditions are imposed on the numbers. Therefore, we can define
degrees of freedom as:
The degrees of freedom is the total number of observations minus the
number of independent constraints or restrictions imposed on the
observations.
Let us see what the interpretation of degrees of freedom in statistical inference
is. In Unit 1, you studied the formula for the sample variance as
1 n
( )
2
=S2 ∑ Xi − X
n − 1 i=1

Here, it is noted that the sum of squares of deviations taken from the sample
mean is divided by (n – 1) instead of sample size n. The rationale relates to
the deviations' degrees of freedom.

Firstly, it should be noted that the deviations are calculated from the sample
mean rather than the population mean. It is so because the population mean is
generally not known. When we do not know the population mean then we can
take the deviation from any number as we choose. But the best number to
choose is the sample mean because it will minimise the sum of squares of the
deviations. Therefore, we have imposed one restriction so we lose one degree 111
Block 1 Sampling Distributions
of freedom. Also, when we take the deviation from the sample mean then it will
be smaller than the deviation from the population mean. Therefore, we divide
the sum of squared deviations by its degrees of freedom n –1 instead of n to
compensate for the downward bias. Here, we can also say that we estimate
one parameter (population mean) by sample data (sample mean) so we lose
one degree of freedom. Hence, in statistical inference, we can also define the
degrees of freedom as
The degrees of freedom is the total number of observations in the
sample minus the number of estimated parameters.
A more complicated instance of degrees of freedom arises when employing an
analysis of variance (ANOVA) approach as covered in the MST-013: Survey
Sampling and Design of Experiment-I course. We now try to understand the
concept of degrees of freedom in the context of analysis of variance as used in
MST-013.
In one-way ANOVA, we consider three quantities:
• the total sum of squares (TSS),
• the sum of squares due to treatment (SST), and
• the sum of squares due to error (SSE).
We discuss degrees of freedom for each one by one. The formula of
computing TSS is
k ni

∑∑ ( X )
2
TSS
= ij −X
=i 1=j 1

It is the sum of the squares of the deviations of all data points (X11 + X12 + …+
Xkn j ) from the grand mean. Here, we also impose one restriction, that is, we
take the deviation from the grand mean instead of any number or we use an
estimate of the population mean so we lose one degree of freedom. Therefore,
the degrees of freedom for TSS is n –1 instead of n.
We now come to the second quantity, that is, the sum of squares due to
treatment (SST). The formula for computing SST is
k

∑ ( X − X)
2
SST
= i
i =1

It is the sum of squares of the deviations of all sample means from the grand
mean. Here, we can consider k sample means as k independent data points,
and the grand mean as the mean of all these sample means so again one
restriction of taking a deviation from the mean of sample means instead of the
population mean so it reduces again one degree of freedom. Therefore, the
degrees of freedom for SST is k – 1 instead of k.
We now consider the sum of squares due to error (SSE). The formula for
computing SSE is
k ni

∑∑ ( X )
2
SSE
= ij − Xi
=i 1=j 1

It is the sum of squares of the deviations of all data points (n1 + n2 + …+ nk =


112 n) from the k sample means. Since there are k groups and each group has
Unit 4 Sampling Distributions Associated with Normal Populations-I

one restriction that the deviation is taken from its sample mean so each group
has ni – 1 degrees of freedom. There are k groups so the total degrees for
SSE is n1 – 1 + n1 – 1 + …+ nk – 1 = n – k.
Let us discuss the degrees of freedom with the help of an example.
Example 1: Let us suppose you received a summary statement of your credit
card from your bank as follows:
Transition Description Amount
Date
05/11/2023 INSTL 1/12 iPHONE X POWER 5800.00
12/11/2023 SM NORTH DEPT STORE 8000.00
15/11/2023 SM NORTH TRAVEL CLUB QUEZON CI 3200.00
20/11/2023 SM NORTH HYPER MARKET QUEZON CIT 6500.00
25/11/2023 INSTL 2/6 ELECTRO WORLD 2500.00

Suppose you want to calculate the sum of squares of deviations (SSD) of the
amount paid about a point so that it will be minimum.
(i) What number should you choose to minimise the SSD?
(ii) Calculate the SSD with the chosen number.
(iii) What is the degrees of freedom for the calculated SSD? Also, calculate
the sample standard deviation (SD).
Solution: As we know the sum of squares of deviation (SSD) is minimum
when we take deviations from the mean, therefore, the statement mean
n

∑ ( X − X)
2
minimises the SSD. We now calculate SSD, that is, i as follows:
i =1
n
Calculation for ∑ ( Xi − X )
2

i=1

Deviation from Mean Deviation Square


X ( X − X) ( X − X)
2

5800 100 10000


8000 2300 5290000
3200 – 2500 6250000
6500 800 640000
5000 – 700 490000
n n

∑(X )
2
∑X i = 28500 i −X = 12680000
i=1 i=1

∑ ( X − X)
2
For calculating i , first of all, we have to calculate mean (X) ,
i =1
therefore,
1 n 28500
=X = ∑
n i=1
Xi = 5700
5
Therefore, we can calculate the SSD of the data as
n

∑ ( X − X)
2
i 12680000
=
i =1

Since we are using sample mean instead of any number so degrees of


freedom for SSD is 5 – 1 = 4. 113
Block 1 Sampling Distributions
We can calculate the sample SD as
n

∑(X )
2
−X
i
12680000
Sample SD = =
i =1
= 1780.449
df 4

I think you have understood the concept of the degrees of freedom and how to
calculate it. Therefore, try the following Self Assessment Question to assess
your understanding of the concept of degrees of freedom.

SAQ 1
A professor of statistics selected some students of the MSCAST programme
from three study centres (SCs) of IGNOU and noted their marks in the MST-
016 paper. The obtained marks (out of 100) are given as follows:

SC I SC II SC III
70 58 69

94 75 73

67 51 62

82 69 68

87 52

92

(i) If you can take the deviations of these data from three numbers as you
select, and you want to minimize the sum of the squared deviations
(SSD), what numbers would you select?
(ii) What is the minimised SSD?
(iii) How many degrees of freedom are connected to this SSD?
(iv) Compute the mean squared deviation (MSD) by dividing the SSD by its
degrees of freedom.

After understanding the concept of degrees of freedom. Let us take the chi-
square distribution in the next section.

4.3 CHI-SQUARE DISTRIBUTION


A chi-square distribution is a family of continuous probability distributions. It is
widely used in hypothesis testing such as the goodness of fit test and
independence of attributes. However, very few real-world situations follow a
chi-square distribution. Therefore, it is not very useful to describe real-world
(1843-1917)
data as the other more widely known distributions such as normal,
Friedrich Robert Helmert
exponential, Poisson, etc. The chi-square is denoted by the Greek symbol χ2
is a German statistician
and is considered the
and the Greek letter chi, pronounced as “ki”.
founder of the
The main contribution to the derivation of the chi-square distribution is given
mathematical and physical
theories of modern by three statisticians. First, it was discovered by the German statistician
geodesy. Friedrich Robert Helmert in 1875. He described the chi-square distribution as
the distribution of the sample variance for a normal population. Therefore, the
114 chi-square distribution is also known as "Helmert distribution". Second, it is
Unit 4 Sampling Distributions Associated with Normal Populations-I

independently rediscovered by the English mathematician Karl Pearson in the


context of goodness of fit in 1900. Afterwards, the idea of a family of chi-
square distributions is given by Ronald. A. Fisher in the 1920s.
In Unit 3, we studied the sampling distribution of the sample variance S2. In
Unit 3, you have also studied that when we multiply the sample variance S2 by

(n − 1) and then divide the product by σ2 then the quantity


(n − 1) S2 follows
σ2
the chi-square distribution with (n − 1) degrees of freedom, that is,

=χ 2 (n − 1) S2 ~ χ(2n−1)
2 (1857 – 1936)
σ
Karl Pearson is the father
follows the chi-square distribution with (n −1) degrees of freedom instead of n of modern statistics. He
because it is based on S2 which has (n −1) degrees of freedom. founded the first statistics
department in the world at
Let us take an example to understand how to calculate the chi-square statistic
University College
and its degrees of freedom.
London.
Example 2: A battery company has developed a new laptop battery and
believes that the battery lasts, on average, 12 hours on a single charge with a
standard deviation of 3 hours. The manufacturing division of the company
performs a quality control test on it. They randomly select 10 batteries, and it
is noted that the mean and standard variation of the selected batteries are 11
hours and 4 hours, respectively. What would be the chi-square statistic
represented by this test and the degrees of freedom of the statistic?
Solution: It is given that

The mean of the laptop battery/population (µ) = 12 hours

The standard deviation of the population (σ) = 3 hours

The number of sample observations (n) = 10

The mean of the sample (X) = 11 hours

The standard deviation of the sample (S) = 4 hours

We can compute the chi-square statistic as follows:

χ2
=
(n − 1)=
S2 9 × 42
= 16
σ2 32

Since the chi-square statistic is based on S2 which has (10 −1) = 9 degrees of
freedom, therefore, it is distributed with (n −1) = 9 degrees of freedom.

After understanding how to calculate the chi-square statistic, we now define


the chi-square distribution as
If X is a continuous random variable, then X follows a chi-square
distribution with n degrees of freedom if and only if it has the following
probability density function (pdf):

x
1 − n
f (x) ( x )2
−1
n
e 2
; 0<x<∞
n
22
2 115
Block 1 Sampling Distributions

Symbolically, we denote that X follows a chi-square distribution with n degrees


of freedom as X ~ χ(n)
2
.

Since the chi-square distribution is a sampling distribution, therefore, to


distinguish it from the well-known distributions such as normal, exponential,
binomial, etc. we may use χ2 symbol instead of X and define it as follows:
If a random sample X1, X2, …, Xn of size n is drawn from a normal population
having mean µ and variance σ2 then the probability density function of the chi-
square distribution with n df is given by

χ2 n
1
( ) ( )
− −1
f χ2
= n
e 2
χ2 2
; 0 < χ2 < ∞
n
2 2
2
n
where denotes the gamma function. The chi-square distribution has only
2
one parameter: a positive integer n that specifies the number of degrees of
freedom (It is also known as the shape parameter of the distribution because
the shape of the distribution depends on n as shown in Fig. 4.1). It means that
a chi-square distribution is determined by its degrees of freedom. There is a
different chi-square distribution for every value of df n. Therefore, it is a family
of continuous probability distributions.
Let us take an example to understand how to find degrees of freedom when
the pdf of a chi-square distribution is given.
Example 3: The probability density function (pdf) of a chi-square distribution is
given by
2
1 − χ2 2
( ) ( )
3
2
f χ
= e χ ; 0 < χ2 < ∞
96
What are the degrees of freedom of this chi-square distribution?
Solution: We know that the pdf of a chi-square distribution with n df is given
by
2
1 − χ2 2 n2 −1
= ( )
f χ2 n
n
e χ ; 0 < χ2 < ∞( )
22
2
We now compare the given pdf
2
1 − χ2 2
( ) ( )
3
f χ2
= e χ ; 0 < χ2 < ∞
96
with the standard form of the pdf of the chi-square distribution with n degrees
of freedom then we observe that
n
−1=3
2
Therefore,
n
= 4⇒n=8
2

After understanding the pdf of the chi-square distribution, we now study the
shape of the probability curve of the chi-square distribution and study the
116 effect of the degrees of freedom on the shape of it.
Unit 4 Sampling Distributions Associated with Normal Populations-I

4.3.1 Probability Density Curve of Chi-square


Distribution
The shape of a chi-square distribution depends on the parameter n, that is,
degrees of freedom. Therefore, the exact shape of the chi-square distribution
varies according to its degrees of freedom. The chi-square distribution is
asymmetrical and positively skewed. The shape of the chi-square distribution
for n = 1 and n = 2 is inverse J alphabet and the curve starts out high and
then subsequently declines as shown in Fig. 4.1(a). However, for n > 2, the
shape of the chi-square distribution is as shown in Fig 4.1(b) which first
increases and attains the maximum value and after that starts to decrease. It
has a longer tail to the right than a standard normal curve, so it is
asymmetrical and positively skewed (as shown in Fig 4.1(b)). We plot the
probability curve of the chi-square distribution at different degrees of freedom
as n = 3, 5, 10 and 20 as shown in Fig. 4.1(b).

Fig. 4.1: Chi-square probability density curves for (a) n = 1 and 2 (b) n = 3,5,10 and 20.

From Fig 4.1(b), you can see that as we increase the degrees of freedom the
curve tends to a normal curve.
4.3.2 Summary Measures of Chi-square Distribution
In the previous sub-section, the probability curve of the chi-square distribution
is discussed with some of its properties. Now, in this sub-section, we will
discuss some summary measures of the chi-square distribution.
The mean of the chi-square distribution with n degrees of freedom is its
degrees of freedom, that is, n. Hence,
Mean = n
Since the chi-square distribution is asymmetrical and right-skewed, therefore,
the mean is greater than the median and mode. The mode of the chi-square
distribution is given as
Mode = n – 2 when n > 2.
The variance of the chi-square distribution is 2n, that is,
Variance = 2n
Since the mean and variance of the chi-square distribution are n and 2n,
therefore, we can say that the variance of the distribution is double of its
mean.
117
Block 1 Sampling Distributions
The MSCAST programme is applied in nature, therefore, we do not give the
proof of the mean and variance of the chi-square distribution. If someone is
interested in that he/she can derive these summary measures as discussed in
MST-012.
Let us take some simple examples based on pdf and summary measures.
Example 4: For the probability density function (pdf) of the chi-square
distribution given in Example 3, find the mean and variance of this distribution.
Solution: In Example 3, we obtained the degrees of freedom of the given chi-
square distribution as n = 8. We know that the mean and variance of the chi-
square distribution with n degrees of freedom are
Mean = n and Variance = 2n
Therefore, for n = 8, we have
Mean = 8 and Variance = 16
Try the following Self Assessment Question to make sure that you have
understood the chi-square distribution.

SAQ 2
The pdf of a chi-square distribution is given as follows:
2
1 − χ2
=f χ
2
( )
e2
; 0 < χ2 < ∞

Obtain the degrees of freedom of the chi-square distribution. Also, find its
mean and variance.

4.4 RELATION OF CHI-SQUARE DISTRIBUTION


TO OTHER DISTRIBUTIONS
If X~Gamma(n, λ) then pdf
The chi-square distribution is connected to a number of other well-known
of X is given by
distributions. Some of them are discussed as follows:
• The chi-square distribution with n degrees of freedom is a special case
When n = n/2 and λ = 1/2, of the gamma distribution. The gamma distribution, which corresponds to
then the pdf X becomes shape parameter n/2 and scale parameter 1/2, is a chi-square
distribution with n degrees of freedom.
• The chi-square distribution with 2 degrees of freedom is the exponential
distribution with scale parameter 2. We can show it as follows:
We can write it as
The pdf of the chi-square distribution is given as
x
1 − n
f (x) ( x )2
−1
= n
e 2
; 0<x<∞
n
22
The pdf of exponential 2
distribution is given as
Put n = 2, we get

1 − 2x 2
1 − x/2
f (x) ( )
−1
= = e x 2 e
2
n 2
2 2
118 2
Unit 4 Sampling Distributions Associated with Normal Populations-I

which is the pdf of the exponential distribution with parameter 2.


• If Z has the standard normal distribution, then X = Z2 has a chi-square
distribution with 1 degree of freedom, that is,

( )
If X ~ N μ,σ 2 , then Z =
X −μ
σ
~ N ( 0,1) and

2
 X −μ
=Z2  2
 ~ χ(1)
 σ 
In general, the sum of squares of n standard normal variate follows a
chi-square distribution with n df, that is,

( )
If Xi ~ N μi ,σi2 , then Zi =
Xi − μi
σi
~ N ( 0,1)

Therefore,
n n 2
 Xi − μi 
= Zi2 ∑
 ∑ σi 
2
 ~ χ(n)
=i 1 =i 1 

Now, you can try the following Self Assessment Question.

SAQ 3
A stock market expert collected the share price of a Pharma company in 11
days which are given as follows:
70, 76, 85, 96, 102, 105, 100, 95, 88, 75, 72
It is assumed that the share price is normally distributed. The stock market
expert first standardises the price and then squares the prices. What will be
the distribution of the share price after the transformation? Also, find the mean
and variance of that distribution.

4.5 PROPERTIES OF CHI-SQUARE


DISTRIBUTION
After studying the chi-square distribution and its probability density function in
the previous sections, we now look at the important properties of the chi-
square distribution which are listed as follows:
1. Since the chi-square distribution is the sum of the squares of the standard
normal random variables so it cannot be negative, therefore, it begins at
zero and continues to infinity.
2. The chi-square distribution has only one parameter n, that is, the degrees
of freedom and the shape of the distribution depends on it. Therefore,
there is a different chi-square distribution for each value of df.
3. The chi-square distribution is positively skewed and asymmetrical.
4. The chi-square distribution is a uni-modal distribution, that is, it has a
single mode (the peak of the graph occurs at the mode) at n ̶ 2. It exists
for n > 2.
5. The mean of the chi-square distribution is its degrees of freedom, that is,
mean = n. 119
Block 1 Sampling Distributions
6. The variance of the chi-square distribution is twice its degrees of freedom,
that is, variance = 2n and also double of its mean.
After knowing the properties of the chi-square distribution, you may have the
curiosity to know the real-life situations where this distribution can be applied.
Let us discuss the applications of the distribution.

4.6 APPLICATIONS OF CHI-SQUARE


DISTRIBUTION
The chi-square distribution has a number of applications. Some of them are
listed below:
1. The chi-square distribution is used to test whether the assumed value of a
population variance of the normal population is true or not. It is also used
to construct the confidence interval for population variance.
2. It is used to judge whether there is a discrepancy between theoretical and
experimental observations, that is, to test the goodness of fit. With the
help of the chi-square distribution, we test whether the random variable
under study follows a specified distribution such as binomial, Poisson,
normal or any other distribution when the data are in categorical form.
3. It is used to test whether two attributes are independent or not such as the
IQ level of students is independent of the economic condition of their
parents, hair colour and eye colour of a person are independent.
We will discuss the first application (listed above) of the chi-square distribution
in detail in Units 14 and 18 of this course and the rest two applications shall be
discussed in detail in the course MST-021.
Now, try to write down the applications of chi-square distribution by answering
the following Self Assessment Question.

SAQ 4
List the applications of the chi-square distribution.
After understanding the chi-square distribution, we now come to the next
exact/standard sampling distribution, that is, the t-distribution.

4.7 t-DISTRIBUTION
The t-distribution is a continuous distribution which is very similar to the
standard normal distribution. The t-distribution was introduced by William
Sealy Gossett in 1908. Gossett was the Chief Brewer at the Guinness Brewery
in Dublin and was dedicated to applying the scientific method to beer
production. He needed a procedure for statistically analysing of small batches
of barley. In 1908, Gossett discovered the t-distribution for this purpose. The
(1876-1937) Guinness Brewery did not allow its workers to publish findings under their own
William Sealy Gossett was names so that the competitors would not learn about their methods. Therefore,
a scientist (chief Brewer) Gossett published his findings under the pen name “Student”. As a result,
at the Guinness brewery the distribution became known also as Student’s t-distribution.
in Dublin, Ireland.
As we discussed in Unit 2, when the population is normal and the population
120 standard deviation (σ) is not known, then we may use the sample standard
Unit 4 Sampling Distributions Associated with Normal Populations-I

deviation (S) in place of the population standard deviation (σ). But, due to the
discrepancy between sample SD (S) and population SD (σ) especially when
the sample standard deviation (S) is calculated from a very small sample, then
X−µ
the distribution of the statistic follows the t-distribution instead of the
S/ n
standard normal. Since it is associated with the sample standard deviation
which has (n − 1) degrees of freedom, therefore, the statistic
X−µ
t=
S/ n
follows the t-distribution with (n − 1) degrees of freedom.
Let us take an example to understand how to calculate the t-statistic and its
degrees of freedom.

Example 5: A mobile company claim that on average the people of India


change their mobile phones after 2 years. To test the claim of the company, a
student of the MSCAST programme collects such information about 16 mobile
users randomly and observes that the people change their mobile phones
after 2.2 years with 0.5 years standard deviation. What would be the t-statistic
represented by this test and the degrees of freedom of it?
Solution: It is given that
The mean of the population (µ) = 2 hours
The sample mean ( X ) = 2.2 hours
The standard deviation of the sample (S) = 0.5 hours
The number of sample observations (n) = 16
We calculate the t-statistic as follows:
X −μ 2.2 − 2 0.2 × 4
=t = = = 1.6
S / n 0.5 / 16 0.5

Since the t-statistic is based on S2 which has (16 −1) = 15 degrees of freedom,
therefore, it is distributed with (16 −1) = 15 degrees of freedom.
After understanding the t-statistic, we now learn more about the t-distribution
such as its pdf and shape.
The t-distribution is a family of continuous probability distributions. Since it is
very similar to the standard normal distribution, therefore, the Student's
t-distribution plays a role in a number of widely used statistical analyses. We
will discuss its application in Section 4.10.
If X is a continuous random variable, then X follows a t-distribution with n
degrees of freedom if and only if it has the following probability density
function (pdf):

1
f (x) n +1
; −∞< x<∞
 1 n  x2  2
n B  ,  1 + 
 2 2  n 
121
Block 1 Sampling Distributions
Symbolically, we denote that X follows a t-distribution with n degrees of
freedom as X ~ t (n) .

Since the t-distribution is a sampling distribution, therefore, to distinguish it


from the well-known distributions such as normal, exponential, binomial, etc.
We use t symbol instead of X and define it as follows:
If a random sample X1, X2, …, Xn of size n is drawn from a normal population
having mean μ and variance σ2 then the probability density function of the
t-distribution with n df is given by

1
=f (t) n +1
; −∞<t<∞
 1 n  t  2 2
n B  ,  1 + 
 2 2  n

1 n
where B  ,  is known as the beta function and
2 2
1 n
1 n 2 2  a b
=B ,  =  B ( a,b ) 
2 2 n +1  a + b 
2
n
where denotes the gamma function. As the chi-square distribution has
2
only one parameter (degrees of freedom), similarly, the t-distribution also has
only one parameter: a positive integer n that specifies the number of degrees
of freedom (It is also known as the shape parameter of the distribution
because the shape of the distribution depends on n as shown in Fig. 4.2).
Therefore, a t-distribution is determined by its degrees of freedom. For each
value of n, there is a different t-distribution. Therefore, it is a family of
continuous probability distributions.
Let us take an example.
Example 6: The probability density function (pdf) of a t-distribution is given
by
3
=f (t) 5/2
; −∞<t<∞
 t2 
8 1 + 
 4

What are the degrees of freedom of this t-distribution?


Solution: We know that the pdf of a t-distribution with n df is given by
1
=f (t) n +1
; −∞<t<∞
 1 n  t  2 2
n B  ,  1 + 
 2 2  n

We now compare the given pdf with the pdf of the t-distribution with n degrees
of freedom then we observe that
n +1 5
=
122 2 2
Unit 4 Sampling Distributions Associated with Normal Populations-I

Therefore,
n=4
After understanding the pdf of the t-distribution, let us see its shape and see
how the probability curve differs from the standard normal distribution and also
observe the effect of the degrees of freedom on its shape.
4.7.1 Probability Density Curve of t-distribution
The shape of the t-distribution depends on the degrees of freedom. In Fig. 4.2,
we have plotted the probability density curves of the t-distribution for n = 1, 5,
10, and 15 degrees of freedom along with the probability density curve of
standard normal distribution. By looking at the figure, you can observe that the
probability density curve of the t-distribution is bell-shaped and symmetric
about t = 0 line as the standard normal curve but it has a lower peak and
heavier tails (more observations near the tail, which means that it gives a
higher probability to the tails than the standard normal distribution) than the
standard normal curve. As the degrees of freedom increase, the curve pulls in
tighter around zero, the tails become thinner and the peak becomes taller. In
other words, we can also say that as the degrees of freedom increase, the t-
distribution will come closer to the standard normal distribution. Therefore,
the standard normal distribution can be used in place of the t-distribution with
large sample sizes.

Fig. 4.2: Probability curves for t-distribution for n =1, 5, 10 and 15 along with standard
normal curve.

After understanding the shape of the t-distribution, we now come to its


summary measures.
4.7.2 Summary Measures of t-distribution
Since the probability density function of the t-distribution is symmetrical about
the t = 0 line, therefore, mean and all moments of odd order about the origin
are zero. Also, for symmetrical distribution the mean = median = mode (as in
the case of normal distribution), therefore, for the t-distribution
Mean = Median = Mode = 0. 123
Block 1 Sampling Distributions
The mean of the Student’s t distribution is 0 for degrees of freedom n greater
than 1 and for n = 1 it is undefined, that is,
0; for n > 1
Mean = 
undefined; for n=1

Similarly, the variance of the t-distribution depends on the degrees of freedom


n
and it is for n greater than 2 and for n less than or equal to 2, the
n−2
variance is undefined, that is,
 n
 ; for n > 2
Variance =  n − 2
undefined; for n ≤ 2

But keeping the applied nature of the programme in view, we are not focusing
on proof of each measure. If someone is interested in that he/she can derive
these summary measures as discussed in MST-012.
After describing the t-distribution and its probability density curve and
summary measures, let us take an example.
Example 7: Consider the probability density function (pdf) of a t-distribution
given in Example 6 and find its mean and variance.
Solution: In Example 6, we obtained the degrees of freedom of the given t-
distribution as n = 4. We know that the mean and variance of the t-distribution
with n degrees of freedom as
n
Mean = 0 and Variance =
n−2
Therefore, for n = 4, we have
Mean = 0 and Variance = 2.
Now, try the following Self Assessment Questions for your practice.

SAQ 5
Obtain the degrees of freedom of a t-distribution whose pdf is given below:
1
=f (t) 3
; −∞<t<∞
 1 5  t2 
5 B  ,  1 + 
 2 2  5
Also, find the mean and variance of the above distribution.

As you have seen the probability density curve of the t-distribution is closely
related to the standard normal curve. Let us see the relationship of the t-
distribution with other well-known distributions.

4.8 RELATION OF t-DISTRIBUTION TO OTHER


DISTRIBUTIONS
The t-distribution is connected to a number of other well-known distributions.
Some of the main relationships are given as follows:

124 • As the number of degrees of freedom of the t-distribution increases then


Unit 4 Sampling Distributions Associated with Normal Populations-I

t-distribution approaches to the standard normal distribution with mean 0 and


variance 1.
• The t-distribution with degrees of freedom n equal to 1 is the standard
Cauchy distribution. The standard Cauchy distribution has an undefined
mean and variance.
• If variable Z follows a standard normal distribution and the χ2 follows a
Z
chi-square distribution with n degrees of freedom, then t = has a
χ2 / n
Student's t-distribution with n degrees of freedom.
Now, you can try a Self Assessment Question before going to the next section.

SAQ 6
Show that the t-distribution with 1 degree of freedom is a standard Cauchy.

4.9 PROPERTIES OF t-DISTRIBUTION


In the previous sections, we have discussed the t-distribution briefly. We now
shall discuss some of the important properties of the t-distribution. The
t-distribution has the following properties:
1. The t-distribution ranges from –∞ to +∞ (–∞ < t < +∞).

2. The t-distribution has only one parameter n, that is, the degrees of
freedom and the shape of the distribution depend on it. Therefore, there is
a different t-distribution for each value of df.
3. The mean of the t-distribution is zero, that is, mean = 0.

4. The variance of the t-distribution depends on the degrees of freedom and


n
is . It exists for n > 2.
n−2

5. The t-distribution is a uni-modal distribution, that is, it has a single mode


(the peak of the graph occurs at the mode) at 0.
6. The probability density curve of the t-distribution is similar to the standard
normal distribution (bell-shaped) and is symmetric about the t = 0 line but
it has a lower peak and heavier tails than the standard normal curve.
7. For large values of n (i.e. increased sample size n); the t-distribution tends
to be a standard normal distribution.
After discussing the properties of the t-distribution, let us look at the real-life
situations where this distribution is used.

4.10 APPLICATIONS OF t-DISTRIBUTION


The t-distribution has a wide number of applications in statistics some of these
are listed below:
1. If the population is distributed normally, and the standard deviation is
unknown, then the t-distribution is used for making inferences about the
population mean, that is, to test the hypothesis about the population mean
125
Block 1 Sampling Distributions
and also in the construction of the confidence interval for the population
mean when variance is unknown.
2. It is used to test/assess the statistical significance of the difference
between two population means of the two dependent or independent
populations whose variances are unknown. Also, it is used for the
construction of the confidence intervals for the difference between two
population means when variances are unknown.

3. The t-distribution is used to test the hypothesis that the population


correlation coefficient is zero.

4. It is also used in regression analysis when we have to test the hypothesis


related to the regression coefficients as well as the construction of the
confidence intervals for the same.
We will discuss the first two applications (listed above) of the t-distribution in
detail in Unit 12 and Unit 16 of this course and the rest two applications shall
be discussed in detail in the course MST-017.
Now, it is time to write down the main applications of the t-distribution by
answering the following Self Assessment Question.

SAQ 7
Write any three applications of the t-distribution.

We now end this unit by giving a summary of what we have covered in it.

4.11 SUMMARY
A brief summary of what we have covered in this unit is given as follows:
• The degree of freedom is the total number of observations minus the
number of independent constraints or restrictions imposed on the
observations (or minus the number of estimated parameters).
• The probability density function of the chi-square distribution with n df is
given by

χ2 n
1
( ) (χ )
− −1
2 2 2
f χ
= n
e 2
; 0 < χ2 < ∞
n
2 2
2

• The shape of the chi-square distribution for n = 1 and n = 2 is inverse J


alphabet and the curve starts out high and then subsequently declines.
However, for n > 2, the shape of the chi-square distribution first increases
and attains the maximum value and after that starts to decrease.

• The mean and variance of the chi-square distribution with n degrees of


freedom are n and 2n, respectively.
• The probability density curve of the t-distribution is bell-shaped and
symmetric about the t = 0 line as the standard normal curve, but it has a
126
lower peak and heavier tails.
Unit 4 Sampling Distributions Associated with Normal Populations-I

• The probability density function of the t-distribution with n df is given by


1
f (t) n +1
; −∞<t<∞
 1 n  t  2 2
n B  ,  1 + 
 2 2  n

• The mean and variance of the t-distribution with n degrees of freedom are
n
0 and , respectively.
n−2

4.12 TERMINAL QUESTIONS


1. Write the pdf of the chi-square distribution for 5 degrees of freedom.
2. Write the pdf of the t-distribution for 3 degrees of freedom.
3. The life of light bulbs manufactured by company A is known to be
normally distributed. The CEO of the company claims that the average
lifetime of the light bulbs is 300 days. A researcher randomly selected 25
bulbs for testing the lifetime and observed the average lifetime of the
sampled bulbs is 290 days with a standard deviation of 50 days.
Calculate the value of the t-statistic.

4.13 SOLUTIONS / ANSWERS


Self Assessment Questions (SAQs)
1. As we know the sum of squares of deviations (SSD) is minimum when
taking about the mean, therefore, we should select three numbers as the
corresponding means.
Thus, we can calculate the mean marks of the students of each SC as
1 n 70 + 94 + 67 + 82 + 87 + 92
=X1 = ∑
n i=1
Xi
6
= 82

58 + 75 + 51 + 69 + 52
X2 = 61
5
69 + 73 + 62 + 68
=X2 = 68
4
n

∑ ( X − X)
2
We now calculate SSD, that is, i
i =1

n
Calculation for ∑ ( Xi − X ) :
2

i=1

Deviation
SC Mean Deviation
Square
70 82 –12 144
94 82 12 144
67 82 –15 225
82 82 0 0
87 82 5 25
92 82 10 100
127
Block 1 Sampling Distributions
58 61 –3 9
75 61 14 196
51 61 –10 100
69 61 8 64
52 61 –9 81
69 68 1 1
73 68 5 25
62 68 –6 36
68 68 0 0
1069 1069 0 1150

Therefore, we can calculate SSD as


SSD = 1150
Since we are using three sample means instead of any number so
degrees of freedom for SSD is 15 – 3 = 12
We can calculate the mean squared deviation (MSD) as
1150
MSD = = 95.83
12
2. We know that the probability density function of the chi-square
distribution with n df is given by
χ2 n
1
( ) (χ )
− −1
2 2 2
f χ
= n
e 2
; 0 < χ2 < ∞
n
22
2
Here, we are given that
2
1 − χ2
f χ
= ( )
2
2
e ; 0 < χ2 < ∞

We can arrange, the given pdf as


χ2
1 2
 2/2 2

( )= (χ )
− −1
2 2 2
f χ e 2
 2 = 2 × 1 = 2 
2
2  2 
2 2
2
After comparing it with the standard form, we get n = 1. So, the degree
of freedom of the chi-square distribution is 1.

3. As we know the square of the standard normal distribution is a chi-


square distribution with 1 degree of freedom, therefore, the distribution of
the share price after the transformation follows the chi-square
distribution with 1 degree of freedom.
We know that the mean and variance of chi-square distribution with n
degrees of freedom are
Mean = n and Variance = 2n
For n = 1, we have,

128 Mean = 1 and Variance = 2


Unit 4 Sampling Distributions Associated with Normal Populations-I

4. Refer to Section 4.6.


5. We know that the probability density function of the t-distribution with n
df is given by
1
f (t) n +1
; −∞<t<∞
 1 n  t  2 2
nB  ,   1 + 
 2 2  n

Here, we are given that


1
f (t) 3
; −∞<t<∞
 1 5  t2 
5 B  ,  1 + 
 2 2  5

1
5 +1
; −∞<t<∞
 1 5  t  2 2
5 B  ,  1 + 
 2 2  5

After comparing the given pdf with the standard form, we get n = 5. So,
the degrees of freedom of the given t-distribution is 5.
6. We know that the probability density function of the t-distribution with n
df is given by
1
f (t) n +1
; −∞<t<∞
 1 n  t  2 2
nB  ,   1 + 
 2 2  n

After putting n = 1, we get


1
=f (t) 1
; −∞<t<∞
 1 1 t2 
1B  ,   1 + 
 2 2  1

 11 
1   1 1 2 2 
=f (t) ; − ∞ < t < ∞  B  ,  = = π π = π 
π (1 + t 2 )  2 2 1 
 
 
which is the pdf of the standard Cauchy distribution. Hence, the Cauchy
distribution is a particular case of the t-distribution for n = 1.
7. Refer to Section 4.10.

Terminal Questions (TQs)


1. We know that the probability density function of the chi-square
distribution with n df is given by

χ2 n
1
( ) (χ )
− −1
2 2 2
f χ
= n
e 2
; 0 < χ2 < ∞
n
2 2
2
We have to find the pdf of the chi-square distribution when n = 5, so by
putting n = 5 in the above expression, we get 129
Block 1 Sampling Distributions
χ2 5
1
( ) (χ )
− −1
f χ2
= 5
e 2 2 2
; 0 < χ2 < ∞
5
2 2
2
χ2 3
1
(χ )

2 2
= 5
e 2

3 1 1
22 × ×
2 2 2
2
1 − χ2 2 3
 1 
= e χ ( ) 2
 = π
π  2 
3
2
2. We know that the probability density function of the t-distribution with n
df is given by
1
=f (t) n +1
; −∞<t<∞
 1 n  t2  2
nB  ,   1 + 
 2 2  n

We have to find the pdf of the t-distribution when n = 3, so by putting n =


3 in the above expression, we get
1
=f (t) 3 +1
; −∞<t<∞
 1 3  t  2 2
3 B  ,  1 + 
 2 2  3

1 3  
+ 
2 2 ab 
= = 2
 B ( a,b ) 
 1 3  t2   a+b
3    1 +   2 
 2 2  3

2  n n  n 
= 2  = − 1 − 1
 1 1 1 t2   2  2  2 
3  ×   1 + 
 2 2 2  3

2  1 
= 2
; −∞<t<∞  = π and 2 = 1 
 t2   2 
3π  1 + 
 3

3. Here, we are given that


=μ 300,
= n 25,
= X 290 and S = 50

The value of the t-statistic can be calculated by the formula given below:
X −μ
t=
S/ n
Therefore, we have
290 − 300 −10
t= = = −1
50 / 25 10

130
UNIT 5
SAMPLING DISTRIBUTIONS
ASSOCIATED WITH
NORMAL POPULATIONS-II

Structure
5.1 Introduction 5.6 Tabulated Values of
t-distribution
Expected Learning Outcomes
5.7 Tabulated Values of Chi-
5.2 F-distribution
square Distribution
Probability Density Curve of
5.8 Tabulated Values of
F-distribution
F-distribution
Summary Measures of F-distribution
5.9 Summary
5.3 Relation of F-distribution to
5.10 Terminal Question
Other Distributions
5.11 Solutions /Answers
5.4 Properties of F-distribution
5.5 Applications of F-distribution

5.1 INTRODUCTION
In Unit 4, we have discussed the chi-square and t-distributions in detail with Tools You Will Need
their properties and applications. The F-distribution also has numerous real- The following terms are
world applications especially when discussing variance analysis and considered essential
hypothesis testing of the variances of the two normally distributed populations. background material for
For example, in finance, it is used to check whether the variances of stock this Unit. If you doubt your
returns are equal across two or more stocks. In engineering, it is also used to knowledge of any of these
test the effectiveness of different manufacturing processes by comparing the terms, you should review
variances of the outcomes. Additionally, the F-distribution is used in the appropriate Unit or
biostatistics to compare the variances of health outcomes across different section before proceeding:
treatments or interventions. In this unit, we discuss the F-distribution in detail • Sampling
and explain the method of reading the tabulated value of t, chi-square and F- Distributions for
distribution tables. Means and Variance
(Units 2 and 3).
This unit is divided into 11 sections. Section 5.1 is introductive. In Sections 5.2
to 5.5, we discuss the F-distribution with its probability density curve, summary
measures, relation to other distributions, properties and applications. As the
standard normal distribution has a standard normal table (Z-table), in a similar
131
Unit Write- Dr. Prabhat Kumar Sangal, School of Sciences, IGNOU, New Delhi
Block 1 Sampling Distributions

way, the chi-square, t and F-distributions also have their tables. Therefore,
Sections 5.6 to 5.8 are devoted to how to read tabulated values of the t, chi-
square, and F-distributions, respectively. The unit ends by providing a
summary of what we have discussed in this unit in Section 5.9. The terminal
questions and the solution of the SAQs/TQs are given in Sections 5.10 and
5.11, respectively.
In the next unit, we shall discuss the estimation of the unknown parameters.

Expected Learning Outcomes


After studying this unit, you should be able to:
 explain the F-distribution with its probability curve, summary measures,
relations to other distributions, properties and applications; and
 describe the method of obtaining the tabulated value from the t, chi-
square and F-distribution tables.

5.2 F-DISTRIBUTION
The F-distribution is also a continuous probability distribution as the chi-square
or t-distribution. It is a sampling distribution that commonly occurs in statistics
when we discuss variances. As the chi-square distribution is not very useful in
describing real-world data, similarly, the F- distribution is not much used to
(1890 – 1962) describe real-world data. However, it is used for estimating and hypothesis
Sir Ronald Aylmer Fisher testing related to variances of two normal populations.
was a British and worked
as a statistician, The F-distribution was first introduced by British statistician Ronald A. Fisher in
mathematician, biologist, 1928. So sometimes it is called Fisher's F-distribution. Later on, George
geneticist, and academic. Waddel Snedecor was an American mathematician and statistician who
For his work in statistics, tabulated the F-distribution and used the letter F in Fisher's honour. The
he has been known as
distribution is also known as Snedecor's F distribution or the Fisher-
father of modern statistical
Snedecor distribution. Prof. Ronald. A. Fisher defined the F-distribution
sciences.
when he was interested in comparing the variances of two normally distributed
populations, and he derived the F-distribution as the ratio of two independent
chi-square variates when divided by their respective degrees of freedom, that
is,
χ(2n1 −1) / ( n1 − 1)
F=
χ(2n2 −1) / ( n2 − 1)

As we know the chi-square statistic χ2 =


(n − 1) S2 , therefore,
σ2
(1881– 1974)
George Waddel Snedecor F=
(n1 − 1) S12 / σ12 (n1 − 1)
was an American (n2 − 1) S22 / σ22 (n2 − 1)
mathematician and
statistician who S12 / σ12
F= ~ F(n1 −1, n2 −1)
contributed to the S22 / σ22
foundations of analysis of
variance, data analysis, If the variances of both populations are equal i.e. σ12 =σ22 then we can write the
experimental design, and F-statistic as follows:
statistical methodology.
S12
F= ~ F(n1 −1,n2 −1)
132 S22
Unit 5 Sampling Distributions Associated with Normal Populations-II

Thus, the sampling distribution of the ratio of sample variances follows the
F-distribution with (n1 –1, n2 – 1) degrees of freedom. The F-distribution is also
a family of distribution, and it has different shapes for each combination of
these degrees of freedom. Let us do an example to learn how to calculate the
F-statistic.
Example 1: Suppose a student of the MSCAST programme of Jammu Kasmir
region wants to test whether the variance of the weight of apples produced by
two different orchards in Kasmir is the same. He collects a random sample of
25 apples from Orchard I and 15 apples from Orchard II and obtains the
sample variances of 62 grams and 45 grams, respectively. Compute the value
of the F-statistic and also find the degrees of freedom associated with the F-
statistic.
Solution: Here, we are given that
=n1 25,
= n2 15,

S12 62,
= = S22 45

We can calculate the value of the F-statistic as


S12 62
F
= = = 1.38
S22 45

Since the F-statistic is associated with S12 and S22 , therefore, the degrees of
freedom of it depends on the degrees of freedoms of S12 and S22 . Since S12
has n1 – 1 = 25 – 1 = 24 degrees of freedom and, S22 has n2 – 1 = 15 – 1 = 14
degrees of freedom, therefore, the F-statistic has ( 24, 14) degrees of
freedom.
After understanding the F-statistic, let us discuss the probability density
function (pdf) of the F-distribution. As you seen, it has two degrees of freedom,
one for the numerator (n1) and the other for the denominator (n2), therefore,
the pdf of the F-distribution is more complex than the chi-square and the
t-distributions. We can define the F-distribution as
A continuous random variable X follows an F-distribution with (n1, n2 ) degrees
of freedom if and only if it has the following probability density function:
n1
 n1  2 n1
  −1
n
 2 x 2
f ( x) ; 0<x<∞
 n1 n2  n1 + n2
B ,   n1  2
 2 2  1 + x 
 n2 

Symbolically, we denote that X follows an F-distribution with (n1, n2)


degrees of freedom as X ~ F(n1,n2 ) .

Since the F-distribution is a sampling distribution, therefore, to distinguish it


from the well-known distributions such as normal, exponential, binomial, etc.
We may use F symbol instead of X and define it as follows:
If a random sample X1, X2 ,..., Xn1 of size n1 is taken from a normal population
with mean μ1 and variance σ12 and another independent random sample
133
Block 1 Sampling Distributions
Y1, Y2 ,..., Yn2 of size n2 is taken from another normal population with mean μ2
and variance σ22 , respectively, then the probability density function of the
F- distribution with (n1, n2) degrees of freedom is given by
n1
 n1  2 n1
  −1
 n2  F2
=f (F) ; 0<F<∞
n n  n1 + n2
B 1 , 2   n1  2
 2 2  1 + F 
 n 2 
The distribution is bounded on the left by zero and has no upper limit, that is,
extending indefinitely to the right, which reflects the fact that variances are
always non-negative. The F-distribution has two parameters that specify the
number of degrees of freedom. It means that an F-distribution is determined
by its degrees of freedom and there is a different F-distribution for each pair of
degrees of freedom. Therefore, it is also a family of continuous probability
distributions.
Let us take an example to understand how to find degrees of freedom when
the pdf of an F-distribution is given.
Example 2: If a random variable X follows an F-distribution whose pdf is given
by
1
=f (x) ; 0<x<∞
(1 + x )
2

then obtain the degrees of freedom of this distribution.


Solution: If random variable X follows the F-distribution with (n1, n2) degrees
of freedom, then the probability density function of X is given as
n1
 n1  2 n1
  −1
 n2  x2
=f ( x) ; 0<x<∞
 n1 n2  n1 + n2
B ,   n1  2
 2 2  1 + x 
 n2 
We now try to convert the given pdf in the form of the standard form of the
F-distribution so that we can compare and find the degrees of freedom.
Therefore,
2
 2 2 2
2 x 2
−1

=f ( x)   = ; 0 < x < ∞  B (1,1) 1


2 2 2+ 2
B ,   2  2
 2 2  1 + 2 x 
 

By comparing the above form with the standard form, we get degrees of
freedom n1 = 2 and n2 = 2.
After understanding the form of the pdf of the F-distribution, you may be
interested to know the shape of the F-distribution and the impact of both
degrees of freedom on the shape of it. Let us discuss the probability density
134 curve of the F-distribution in the next section.
Unit 5 Sampling Distributions Associated with Normal Populations-II

5.2.1 Probability Density Curve of F-distribution


The F-distribution is a positively skewed distribution that has a minimum value
of 0, but no maximum value. The shape of the F-distribution depends on two
parameters: degrees of freedom for the numerator (n1) and degrees of
freedom for the denominator (n2). The shape of the F-distribution for n1 = 1, 2
and n2 = 1, 2 is inverse J alphabet and the curve starts out high and then
subsequently declines as shown in Fig. 5.1(a). However, for n1 and n2 > 2, the
shape of the F-distribution is as shown in Fig 5.1(b and c). The probability
density curve of the F-distribution reaches a peak (not far to the right of 0), and
then gradually approaches the horizontal axis with the larger value of F. The
F-distribution approaches the horizontal axis but never touches it.

(b): n1 is fixed and n2 is very (c): n2 is fixed and n1 is very

Fig. 5.1: Probability density curves of F-distribution for various degrees of freedom.

In Fig. 5.1 (b), we plot different probability curves and put n1 as fixed at n1 = 5
and increase n2 = 5 to 10 to 20 to 50. By increasing the second parameter n2
from 5 to 50, the mean of the distribution (shown by the vertical line) 135
Block 1 Sampling Distributions

decreased, and the probability curve shifted from the tail to the centre of the
distribution. Similarly, in Fig. 5.1 (c), we plot different probability curves and
put n2 as fixed at n2 = 5 and increase n1 = 5 to 10 to 20 to 50. By increasing
the first parameter n1 from 5 to 50, the mean of the distribution (shown by the
vertical line) does not change but the probability curve shifted from the tail to
the centre of the distribution. After looking at the probability curves of the F-
distribution, we can observe that the probability curve of the F-distribution is a
uni-model curve for n1, n2 > 2.
After understanding the probability curve of the F-distribution with some of its
properties. In this sub-section, we will discuss some summary measures of it.

5.2.2 Summary Measures of F-distribution


The mean of the F-distribution depends only on the degrees of freedom of the
denominator, that is, n2 and is given as follows:
n2
Mean = for n2 > 2.
n2 − 2

Since the F-distribution is asymmetrical and right-skewed, therefore, the mean


is greater than the median and mode.
The variance of the F-distribution depends on both (n1, n2) dfs and is given by
2n22 ( n1 + n2 − 2 )
Variance = for n2 > 4.
n1 ( n2 − 2 ) ( n2 − 4 )
2

This programme is applied in nature, therefore, we do not give proof of the


mean and variance of the F-distribution.
If someone is interested in that he/she can derive these summary measures
as discussed in MST-012.
Let us take some simple examples based on the probability density function
and summary measures.
Example 3: For the probability density function of the distribution given in
Example 2, find the mean and variance of this distribution.
Solution: In Example 2, we obtained the degrees of freedom of the given
F-distribution as n1 = 2, and n2 = 2. We know that the mean and variance of
the F-distribution with (n1, n2) degrees of freedom are:
n2
Mean = for n2 > 2 and
n2 − 2

2n22 ( n1 + n2 − 2 )
Variance = for n2 > 4.
n1 ( n2 − 2 ) ( n2 − 4 )
2

Since both mean and variance of the F-distribution exist for n2 > 2 and n2 > 4,
respectively, therefore, for n1 = 2, and n2 = 2, the mean and variance of the F-
distribution do not exist.

Now, try to answer the following Self Assessment Question to see how much
136 you learn about the F-distribution.
Unit 5 Sampling Distributions Associated with Normal Populations-II

SAQ 1
If a random variable X follows the F-distribution whose pdf is given by
3 1
f ( x) 5
; 0<x<∞
8 x 1 2
1 + 4 x 
 

then obtain the degrees of freedom of this distribution. Also, find its mean and
variance.

After understanding the F-distribution with its pdf, probability density curve and
summary measures, let us see the relationship of the F-distribution with other
well-known distributions.

5.3 RELATION OF F-DISTRIBUTION TO OTHER


DISTRIBUTIONS
The F-distribution is closely related to t and chi-square distributions. Some of
the relationships are discussed as follows:
• If a variable t follows the t-distribution with n df, then the square of t
follows the F-distribution with (1, n) df i.e. if t ~ t(n) then t2 ~ F(1, n).
• As the denominator degrees of freedom (n2) of the F-distribution increases
then n1F follows a chi-square distribution with n1 degrees of freedom.
• If a random variable X follows the F-distribution with (n1, n2) degrees of
freedom then 1/X also follows the F-distribution with (n2, n1) degrees of
freedom.
Now, try to answer the following Self Assessment Question.

SAQ 2
If a statistic t follows Student’s t-distribution with 4 df, then what will be the
distribution of the square of t? Also, write the pdf of that distribution.

After introducing the F-distribution, one may be interested in knowing the


properties of this distribution. We now discuss some of the important
properties of the F-distribution in the next section.

5.4 PROPERTIES OF F-DISTRIBUTION


The important properties of the F-distribution are as follows:
1. The value of the F-distribution is always positive, or zero because it is the
distribution of the ratio of the variances which are the square of the
deviations and hence cannot assume negative values. Its value lies
between 0 and ∞.
2. The F-distribution has two parameters n1 and n2, that is, the degrees of
freedom for numerator and denominator and the shape of the distribution
depends on them. Therefore, there is a different F-distribution for each
pair of degrees of freedom. 137
Block 1 Sampling Distributions

3. The F-distribution is positively skewed. The curve is more positively


skewed when n2 is smaller than n1.
4. The F-distribution is a uni-modal distribution, that is, it has a single mode.
5. The square of the t-statistic with n df follows the F-distribution with 1 and n
degrees of freedom.
n2
6. The mean of the F-distribution with (n1, n2) df is for n2 > 2.
n2 − 2

7. The variance of the F-distribution with (n1, n2) df is


2n22 ( n1 + n2 − 2 )
for n2 > 4.
n1 ( n2 − 2 ) ( n2 − 4 )
2

8. If we interchange the degrees of freedom n1 and n2 then there exists an


extremely useful relation as
1
F(n1,n2 ),(1−α ) =
F(n2 ,n1 ), α

This is called as Reciprocal Property of the F-distribution.


9. As the degrees of freedom for the numerator and for the denominator get
larger, then the curve approximates the normal curve.
You can try the Self Assessment Question to see how much you learn about
the properties of the F-distribution.

SAQ 3
Write any five properties of the F-distribution.

5.5 APPLICATIONS OF F-DISTRIBUTION


After discussing the main properties of the F-distribution in the previous
section, we are now discussing some of the important applications of the
F-distribution. The F-distribution has the following applications:
1. The F-distribution is used to test the hypothesis about the variances of
two normal populations. Also, it is used for the construction of the
confidence intervals for the ratio of two population variances.
2. The F-distribution is also used in regression analysis, particularly in the
test for the overall significance of a regression model or to compare the
variances of the residuals for two or more models.
3. The t-distribution is used to test/assess the statistical significance of the
difference between two population means but if we have to test the
significance of more than two means then we use the F-distribution. That
is the F-distribution is used in ANOVA as well as the design of
experiments.
You will study the first application listed above in Units 14 and 18 of this
course and the second application in the course MST-017: Applied Regression
Analysis. The third application you have already studied in MST-003: Survey
138 Sampling and Design of Experiments-I.
Unit 5 Sampling Distributions Associated with Normal Populations-II

Now, try the following Self Assessment Question.

SAQ 4
Write four applications of the F-distribution.

After understanding the standard sampling distributions such as t, chi-square


and F-distributions in detail, next, you are going to learn how to read their
tables. As we have seen in Unit 14 of the course MST-012: Probability and
Probability Distributions, we can calculate the area or probability of a random
variable which follows a normal distribution with the help of a standard normal
table. Similarly, we can also use the t, chi-square, and F-distribution tables.
The main difference between the standard normal distribution table and the t,
chi-square, and F-distribution tables is that the body of the standard normal
distribution table represents the probability whereas the body of the t, chi-
square, and F-distribution tables represents the critical value or point beyond
which the area/probability of the distribution is α. Therefore, these tables are
generally used to find out the value of the variable for which area in the tail of
the distribution is given. The tabulated values are also known as critical
values of the standard sampling distributions. They are also used in
constructing confidence intervals and testing of hypotheses which you will
study in the next units of this course. We discuss how to read tabulated values
of these distributions one at a time.

5.6 TABULATED VALUES OF t-DISTRIBUTION


As you have seen the t-distribution is described with the help of degrees of
freedom (n) and for each degree of freedom there is a different t-distribution.
To include each t-distribution, the statisticians arrange the t-table as given in The t-distribution table is
Table V in the Appendix. The t-table contains the tabulated values (it is also also known as Student’s t-
known as critical values, especially in testing of hypothesis) of t- variable for table, t-table, t-score table,

different degrees of freedom (n) such that the area under the probability curve t-value table, or t-test table.

of the t-distribution to its right tail (upper tail) is equal to α (α is also known as
level of significance in testing of hypothesis) as shown in Fig. 5.2 (a). In the t-
table, the first column of the left-hand side represents the degrees of freedom
(n) while the column heading represents the upper (right-hand side) tail
area/probability (α) of the probability curve of the t-distribution. The body
contains the value of the t-statistic for each particular value of n and α which
represents the critical value or point beyond which the area/probability of the t-
distribution is α. The area/probability (α) represents the proportions of the
t-distribution contained in the right tail.

How to Use the t-distribution Table

To read the t-distribution table, you only need to know three values:

• degrees of freedom (n)

• area/probability, that is, α (level of significance) of the t-statistic (common


choices are 0.01, 0.05, and 0.10)

• the tail of the t-statistic on which α lies (one tail or two tails). 139
Block 1 Sampling Distributions

If the area/probability, that is, alpha (level of significance) of the t-statistic may
lie on the right tail, left tail or both tails so the tabulated value of the
t-distribution is called right tail value, left tail value or both tail values. We now
discuss how to read the tabulated values in each case one at a time.

Fig. 5.2: Representation of tabulated value(s) of t-distribution.

Right tail value


The t-table contains the tabulated values (it is also known as critical values,
especially in testing of hypothesis) of the t-statistic for different degrees of
freedom (n) such that the area under the probability curve of the t-distribution
to its right tail (upper tail) is equal to α. To read the right tail tabulated value,
we follow the following steps:
Step 1: We start with the first column of the t-table, that is degrees of
freedom and downward headed ‘n’ until the required degree of
freedom is reached.
Step 2: After that, we proceed right to the column headed α up to the
required α is reached.
Step 3: We get the required right tail tabulated value in the cell of the table at
the intersection of required degrees of freedom n and α.
We represent the right tail tabulated value of the t-statistic for n degrees of
freedom and for right tail area/probability (α) as t (n), α . The right tail probability
value represents the probability that the t-statistic would be greater than t (n), α
is α.
Suppose we want to read the tabulated value of the t-statistic for which the
area/probability on the right tail is 0.05 and the degrees of freedom is 6. To
see the tabulated value, we start with the first column of the t-table, that is,
140 degrees of freedom and downward headed ‘n’ until entry 6 is reached and
Unit 5 Sampling Distributions Associated with Normal Populations-II

then proceed right to the column headed α = 0.05. Find the cell in the table at
the intersection of degrees of freedom n = 6 and α =0.05 level. This is the
t-distribution value. For your convenience, we give a part of the t-table in Table
5.1.
Thus, we get the required right value of the t-statistic as t(n), α = t(6), 0.05 = 1.943.
The value of the t-statistic equal to 1.943 means, the probability that the
t-statistic would exceed (greater than) 1.943 is 0.05 as shown in Fig. 5.3.
Table 5.1: Part of t-table

α= 0.10 0.05 0.025 0.01 0.005


n (df) =1 3.078 6.314 12.706 31.821 63.657
2 1.886 2.920 4.303 6.965 9.925
Fig. 5.3
3 1.638 2.353 3.182 4.541 5.841
4 1.533 2.132 2.776 3.747 4.604
5 1.476 2.015 2.571 3.365 4.032
6 1.440 1.943 2.447 3.143 3.707
7 1.415 1.895 2.365 2.998 3.499

Left tail value


Since the t-distribution is symmetrical at the t = 0 line, therefore, the left tail
tabulated values will be equal to the right tail tabulated values in magnitude
but opposite in sign (as in the case of normal distribution). Therefore, in the
t-table, only right tail values are given. To read the left tail tabulated value, we
follow the following steps:
Step 1: First of all, we read the right tail value for the same degrees of
freedom (n) and the same area/probability α (level of significance) as
discussed.
Step 2: After that, we assign the negative sign to the right tail tabulated value
read in Step 1.
We represent the left tail tabulated value of the t-statistic for n degrees of
freedom and α level as t (n),(1−α ) . But the t-distribution is symmetrical about the
t = 0 line, therefore, we can represent it as t (n),(1−α ) = − t (n), α . The left tail
tabulated value represents the probability that the t-statistic would be greater
than t (n),(1−α ) = − t (n), α is 1 – α and less than it is α as shown in Fig. 5.2(b).

Suppose in the above example, if we want to find the value of the t-statistic
such that the left area is 0.05 for 6 df then due to the symmetry of the t-
distribution the value of the t-statistic will be – t(n), α = – t(6), 0.05 = −1.943. The
tabulated value is shown in Fig. 5.4. Fig. 5.4
Both tails values
After learning how to read tabulated values for right and left tails, we now
discuss how to read it for two tails. For two-tails, there are two tabulated
values. To read both tails tabulated value, we follow the following steps:
Step 1: First of all, we half the total area on both tails. If it is α then the half
area i.e. α/2 lies in both tails as shown in Fig. 5.2(c).

Step 2: We then read the right-tail and left-tail tabulated values for the same 141
Block 1 Sampling Distributions

degrees of freedom (n) and α/2 (area/probability lie on right and left
tails) instead of α as discussed in the case of right and left tails
tabulated values.
We represent both tails tabulated values of the t-statistic for n degrees of
freedom and α level as ± t(n), α/2.
For example, if we want to find out the values of the t-statistic for which the
area on both tails is 0.05 and the degrees of freedom is 6. Since the total area
on both tails is 0.05, therefore, the area on the right tail as well as on the left
tail will be 0.05/2 = 0.025. Thus, we start with the first column of the t-table and
downward headed n until entry 6 is reached. Then proceed right to the column
headed α = 0.025. We get t(n), α/2 = t(6), 0.025 = 2.447. Since the t-distribution is
symmetrical at the t = 0 line, therefore, the left tail value will be the same as
the right tail value but in a negative sign. Therefore, – t(n), α/2 = – t(6), 0.025 = –
2.447. So the required values of the t-statistic are ± t(n), α/2 = ± t(6), 0.025 = ± 2.447.
Fig. 5.5 We also show the tabulated values in Fig. 5.5.

Let us take an example.

Example 4: Find the tabulated value of the t-statistic in each case for which
the degrees of freedom (n) and the area (level of significance) are given as
follows:

(i) n = 10 and α = 0.01 (right tail)

(ii) n = 8 and α = 0.05 (left tail)

(iii) n = 14 and α = 0.10 (both tails)

Solution:

(i) Here, we want to read the tabulated value of the t-statistic for

n = 10 and α = 0.01 (right tail)

Therefore, we start from the first column of the t-table (Table V) given in
the Appendix at the end of this volume and downward headed n until
entry 10 is reached. Then proceed right to the column headed α = 0.01.
So we get the required tabulated value of the t-statistic as t(n), α = t(10), 0.01
= 2.764.

(ii) Here, we are given that

n = 8 and α = 0.05 (left tail)

First of all, we read the right tail value by proceeding the same way as
part (i) t(n), α = t(8), 0.05 = 1.860. Since the t-distribution is symmetrical at the
t = 0 line, therefore, the left tail value is – t(n), α = – t(8), 0.05 = – 1.860.

(iii) Here, we want to read the value of the t-statistic for two tails and it is
given that

n = 14 and α = 0.10

Since the total area on both tails is 0.10, therefore, the area on the right
tail as well as on the left tail will be 0.05/2 = 0.05. Thus, we start with the
142 first column of the t-table and downward headed n until entry 14 is
Unit 5 Sampling Distributions Associated with Normal Populations-II

reached. Then proceed right to the column headed 0.025. We get t(n), α/2 =
t(14), 0.05 = 1.761. Since the t-distribution is symmetrical at the t = 0 line,
therefore, the left tail value will be the same as the right tail value but in a
negative sign. Therefore, – t(n), α/2 = – t(14), 0.05 =– 1.761. So the required
values of the t-statistic are ± t(n), α/2 = ± t(14), 0.025 = ± 1.761.

After understanding how to read the tabulated (critical) values for different
cases, we now discuss some more facts about the t-table. If we closely inspect
the t-table then we observe:
• The t-table contains only the right-tailed tabulated values.
• As the degrees of freedom increase, the tabulated value decreases. The
reason is that the tails of the t-distribution shift towards the centre as the
degrees of freedom increase as shown in Fig. 5.6.
• The t-distribution table does not include entries for every possible degree
of freedom. For example, the table lists continuously up to df = 30 and Fig. 5.6
after that t values for df = 40, 60, 120 and does not list entries for
degrees of freedom values between these.
Since the t-table does not include entries for every possible degree of
freedom, therefore, a question may arise “how do we read the tabulated
values for df which are not included in the t-table?” Don’t bother about
that we can read such values either using software or by the interpolating
method which is discussed as follows:
Method of Finding the Values of t-statistic for Degrees of Freedom which
are not Listed in the Table
The t-table (Table V) given in the Appendix does not list values for every
possible degree of freedom. Therefore, it becomes necessary to know how to
find values of the t-statistic for degrees of freedom not listed in the table. Let
us discuss the process of finding the values which are not listed in the table
with the help of an example.
Suppose we want to find out the tabulated value of the t-statistic for 34
degrees of freedom which is not listed in the table such that the area on the
right side is equal to 0.05.
For that, first of all, we read the tabulated values of the t-statistic that are just
greater and just smaller than the degrees of freedom for our interest. Thus,
from the t-table, we get the values of t-statistic for 40 and 30 degrees of
freedom and area α = 0.05 as
t(40), 0.05 = 1.684 and t(30), 0.05 = 1.697

Note that the larger the degrees of freedom, the smaller the tabulated value of
the t-statistic.

We now calculate how much the t-value changes for each degree of freedom
between these two tabulated values. Here, there is a difference of 10(40 – 30)
degrees of freedom and a t-value change of 0.013(1.697 – 1.684).

Thus, we can obtain the change in the t-value corresponding to a unit change
in degree of freedom as 143
Block 1 Sampling Distributions
0.013
= 0.0013
10

Since we have to obtain the value for 34 degrees of freedom, this is either 4
more than 30 or 6 less than 40. Therefore, we can interpolate from either
value. To get from 30 to 34 degrees of freedom there is a difference of 4 (34 –
30). So we multiply this difference by the amount by which the t-value changes
per degree of freedom i.e. 0.0013. This result as
4 × 0.0013 =
0.0052

Since the larger the degrees of freedom the smaller the tabulated value of the
t-statistic, therefore, subtracting this value 0.0052 from t(30), 0.05 = 1.697 to get
the required value. Thus,

t(34), 0.05 = 1.697 − 0.0052 = 1.6918

Now, if we interpolate it from 40 degrees of freedom then the difference 6(40 –


34) multiplied by 0.0013 and adding this to 1.684, we get
t(34), 0.05 = 1.684 + 6 × 0.0013 = 1.6918

Thus, we get the same value.

Now, you can try the following Self Assessment Question.

SAQ 5
Find the tabulated values of the t-statistic for which the area and the degrees
of freedom are given as follows:
(i) n = 9 and α = 0.01 (right tail)

(ii) n = 15 and α = 0.05 (left tail)

(iii) n = 13 and α = 0.05 (both tails)

5.7 TABULATED VALUES OF CHI-SQUARE


DISTRIBUTION
Similar to the t-table, the chi-square table is given in the Appendix as
The chi-square distribution Table VI. The chi-square table contains the tabulated values (it is also known
table is also known as chi- as critical values, especially in the testing of hypothesis) of the chi-square
square table, chi-square statistic for different degrees of freedom (n) such that the area under the
score table, chi-square probability curve of the chi-square distribution to its right tail (upper tail) is
value table, or chi-square equal to α (α is also known as the level of significance in testing of hypothesis)
test table. as shown in Fig. 5.7.

In the chi-square table, the column headings indicate the area on the upper
portion (right tail) of the probability curve of the chi-square distribution and the
first column on the left-hand side indicates the values of degrees of freedom
(n). The body contains the value of the chi-square statistic for each particular
value of n and α which represents the critical value or point beyond which the
area/probability of the chi-square distribution is α. The area/probability (α)
represents the proportions of the chi-square distribution contained in
144 the right tail.
Unit 5 Sampling Distributions Associated with Normal Populations-II

How to Use the Chi-square Table


Since the area/probability, that is, α (level of significance) of the chi-square
statistic may lie on the right tail, left tail or both tails so the tabulated value of
the chi-square distribution is called the right tail value, left tail value or both tail
values. We now discuss how to read these tabulated values one at a time.
Right tail value
The chi-square table contains the tabulated values (it is also known as critical
values, especially in testing of hypothesis) of the chi-square variable for
different degrees of freedom (n) such that the area under the probability curve
of the chi-square distribution to its right tail (upper tail) is equal to α.

Fig. 5.7: Representation of Tabulated value(s) of chi-square distribution.

The procedure to read the right tail tabulated value of the chi-square statistic is
almost similar to reading the t-table, we follow the following steps:
Step 1: We start with the first column of the chi-square table, that is degrees
of freedom and downward headed ‘n’ until the required degree of
freedom is reached.
Step 2: After that, we proceed right to the column headed α up to the
required α is reached.
Step 3: We get the required right tail tabulated value in the cell of the table at
the intersection of required degrees of freedom n and α.
We represent the tabulated value of the chi-square statistic for n degrees of
2
freedom and for right tail area/probability as χ (n), α .

Suppose we want to find out the tabulated value of the chi-square statistic for
which the area/probability on the right tail is 0.01 and the degrees of freedom 145
Block 1 Sampling Distributions

is 4. To read the tabulated value, we start with the first column of the chi-
square table, that is degrees of freedom and downward headed ‘n’ until entry 4
is reached and then proceed right to the column headed α = 0.01. Then we
find the cell in the table at the intersection of degrees of freedom n = 4 and α =
0.01 level. For your convenience, we give a part of the chi-square table as
shown in Table 5.2.
Table 5.2: The Part of Chi-square Table

α= 0.995 0.99 0.975 0.95 0.90 0.10 0.05 0.025 0.01 0.005
n(df)=1 --- --- --- --- 0.02 2.71 3.84 5.02 6.63 7.88
2 0.01 0.02 0.05 0.10 0.21 4.61 5.99 7.38 9.21 10.60
3 0.07 0.11 0.22 0.35 0.58 6.25 7.81 9.35 11.34 12.84
4 0.21 0.30 0.48 0.71 1.06 7.78 9.49 11.14 13.28 14.86
5 0.41 0.55 0.83 1.15 1.61 9.24 11.07 12.83 15.09 16.75
6 0.68 0.87 1.24 1.64 2.20 10.64 12.59 14.45 16.81 18.55

Thus, we get the required right value of the chi-square statistic as


χ2(n), α =
χ2(4),0.01 =
13.28 . The value of the chi-square statistic equal to 13.28
means, the probability that the chi-square statistic would exceed (greater than)
13.28 is 0.01 as shown in Fig. 5.8.
Left tail value
Fig. 5.8
Since the t-distribution is symmetrical at the t = 0 line, therefore, the left tail
values are the same in magnitude and opposite in sign as the right tail value
and the t-table contains only tabulated values for α = 0.10, 0.05, 0.025, 0.01
and 0.005. But the chi-square is not symmetrical, therefore, to find out the left
tail values using the chi-square table, the tabulated values for α = 0.995, 0.99,
0.975, 0.90, are also given. For α = 0.95, means there exists 0.95 probability/
area on the right side and the rest 1 – 0.95 = 0.05 on the left side of the curve.
Hence, to read the left tail value from the chi-square table we have to see the
value at 1 – α instead of α. To read the left tail tabulated value of the chi-
square statistic, we follow the following steps:
Step 1: We start with the first column of the chi-square table, that is degrees
of freedom and downward headed ‘n’ until the required degree of
freedom is reached.
Step 2: After that we proceed right to the column headed α up to 1 – α is
reached.
Step 3: We get the required left tail tabulated value in the cell of the table at
the intersection of required degrees of freedom n and 1 – α.
2
We represent the left tail tabulated/critical value of the chi-square as χ (n),(1−α )
which represents the probability that the chi-square statistic would be less than
χ2(n),(1−α ) is α and greater than χ2(n),(1−α ) is 1 – α as shown in Fig. 5.7(b).

Suppose we want to read the tabulated value for which the left area is 0.05
and the degree of freedom is 4. To read this tabulated value, we start with the
first column of the chi-square table, that is degrees of freedom and downward
headed ‘n’ until entry 4 is reached and then proceed right to the column
headed α to 1 – 0.05 = 0.95 instead of 0.05. Then we find the cell in the table
146 at the intersection of degrees of freedom n = 4 and α = 0.95 level. Thus, we
Unit 5 Sampling Distributions Associated with Normal Populations-II
2
get the required left value of the chi-square statistic as χ (n),(1−α ) = χ2(4),0.95
= 0.71. The tabulated value of the chi-square statistic equal to 0.71 means,
the probability that the chi-square statistic would be less than 0.71 is 0.05 and
greater than 0.71 is 0.95 as shown in Fig. 5.9.
Let us learn how to read both/two tails tabulated values from the chi-square
table.
Both tails values
The procedure to read the two tails tabulated value of the chi-square statistic Fig. 5.9
is almost similar to the t-table, we follow the following steps:
Step 1: First of all, we half the total area on both tails. If it is α then the half
area i.e. α/2 lies in both tails as shown in Fig. 5.7(c).
Step 2: We then read the right tail and left tail tabulated values for the same
degrees of freedom (n) and α/2 (area/probability lie on right and left
tails) instead of α as discussed in the case of right and left tails
tabulated values.
2
In this case, we represent the right-tail tabulated value as χ (n), α /2 and left-tail
2
as χ (n),(1−α /2) .

For example, if we want to read the values of the chi-square statistic for which
the area on both tails is 0.05 and the degrees of freedom is 6. For that first of
all, we half the total area/probability as 0.05/2 = 0.025. After that, we read the
right tail and left tail values in such a way that the area on both the tails
remains α/2 = 0.025 as discussed in the case of right and left tails tabulated
2
values cases. We get χ (n), α /2 = χ2(6),0.025 =
14.45 and χ2(n),(1−α /2) = χ2(6),(1−0.025)
χ2(6),0.975 =
= 1.24 . These values are shown in Fig. 5.10.
Fig. 5.10
After understanding how to read tabulated (critical) values for different cases,
we now discuss some more facts about the chi-square table. If we closely
inspect the chi-square table, then we observe:
• The table contains the tabulated values for the right tail as well as the left
tail.
• As the degrees of freedom increase, the tabulated value decreases. The
reason is that as we increase the degrees of freedom the tails of the chi-
square distribution shift towards the right side as shown in Fig. 5.11.
• The chi-square table does not include entries for every possible degree of
freedom. For example, the table lists continuously up to df = 30 and after
that the chi-square values for df = 40, 60, 120 and does not list entries for
degrees of freedom values between these. Fig. 5.11

Method of Finding the Values of Chi-square Statistic for Degrees of


Freedom which are not Listed in the Table
We obtain the values of the chi-square statistic for degrees of freedom which
are not listed in the table in a similar manner as discussed in the case of the
t-table. This is explained in part (iv) of Example 5.

Now, let us do one example based on the above discussion.


147
Block 1 Sampling Distributions

Example 5: Find the values of the chi-square statistic for which the degrees
of freedom (n) and area are given as follows:
(i) n = 2 and α = 0.05 (right tail)
(ii) n = 10 and α = 0.01 (left tail)
(iii) n = 8 and α = 0.05 (both tails)
(iv) n = 64 and α = 0.01 (right tail)
Solution:
(i) Here, we want to read the tabulated value of the chi-square statistic for
n = 2 and α = 0.05 (right tail)
Thus, we start from the first column of the chi-square table given in the
Appendix and downward headed n until entry 2 is reached. Then
proceed right to the column headed α = 0.05. We get the required
2 2
tabulated value of the chi-square statistic as χ(n),α =
χ(2),0.05 5.99.
=

(ii) Here, we want to find the value of the chi-square statistic for
n = 10 and α = 0.01 (left tail)
To read the left-tail tabulated value, we start with the first column of the
chi-square table, that is, degrees of freedom and downward headed ‘n’
until entry 10 is reached and then proceed right to the column headed α
at 1 – 0.01 = 0.99 instead of 0.01. Then we find the cell in the table at the
intersection of degrees of freedom n = 10 and α = 0.99 level. Thus, we
get the required left value of the chi-square statistic as χ2(n),(1−α ) = χ2(10),0.99
= 2.56 .

(iii) Here, we want to find the tabulated value of the chi-square statistic for
two-tail area α = 0.05 and n = 8.
Since the total area on both tails is 0.05, therefore, half area 0.05/2 =
0.025 lies on both tails. In this case, the chi-square statistic has two
values one on the right-tail as χ(2n), α /2 =
χ(28 ),0.025 and one on the left-tail as
χ(2n),(1−α /2) =
χ(28 ),0.975 . So by proceeding the same way as above, we get the
required values of the chi-square statistic as χ(28 ),0.025 =
17.53 and
χ(28 ),0.975 =
2.18 .
(iv) Here, we want to find the tabulated value of the chi-square statistic for
the right tail and for n = 64 and α = 0.01.
Since the chi-square table does not have the tabulated value for 64
degrees of freedom so we need to interpolate it. For this, we find the
tabulated values of the chi-square statistic that are just greater and just
less than the degree of freedom 64 with α = 0.01. Thus, we have

100.42 and χ(260 ), 0.01 =


χ(270 ), 0.01 = 88.38
There is a difference of 10 degrees of freedom between these two and a
difference of 12.04 (100.42−88.38) in the chi-square value. Thus, each
degree of freedom has an approximate change in the value of the chi-
148 square statistic as
Unit 5 Sampling Distributions Associated with Normal Populations-II
12.04
= 1.204
10
To get the value of the chi-square statistic for 64 degrees of freedom, we
multiply 1.204 by 4 (64−60) and get
1.204 × 4 =4.816

Since the larger the degrees of freedom larger the tabulated value of the
chi-square statistic, so adding 4.816 in 88.38, we get the required value
as
χ(264 ),0.01 = 88.38 + 4.816 = 93.196

After understanding how to read the tabulated values of the chi-square


distribution, now you can assess yourself by solving the following Self
Assessment Question.

SAQ 6
Find the tabulated value of the chi-square statistic for which the degrees of
freedom and area are given as
(i) n = 11 and α = 0.01 (right tail)
(ii) n = 19 and α = 0.10 (left tail)

5.8 TABULATED VALUES OF F-DISTRIBUTION


As you have seen the F-distribution is described with the help of two degrees
of freedom, one for the numerator (n1) and the other for the denominator (n2).
Therefore, for each combination of degrees of freedom, the F-statistic has a The F-distribution table is

different tabulated value, therefore, the statisticians organised the F-table also known as the F-table,

somewhat differently than the tables for the other distributions. The F-score table, F-value table,

statisticians prepare a separate F-table that is associated with different α (the or F-test table.

area in the right tail of the distribution) and in each of these tables, the F
values are given for various combinations of degrees of freedom. Table VII in
the Appendix contains the tabulated (critical) values of the F-statistic for
various degrees of freedom such that the area under the probability curve of
the F-distribution to its right (upper) tail is equal to α = 0.10, 0.05, 0.025, 0.01
and 0.005. A part of the table is shown in Table 5.3.
In the F-table, the first row of the F-table indicates the values of degrees of
freedom for the numerator (n1) and the first column on the left-hand side
indicates the values of degrees of freedom for the denominator (n2). The body
of the table contains the value of the F-statistic for each particular pair of
(n1, n2) and α which represents the critical value or point beyond which the
area/probability of the F-distribution is α. The area/probability (α) represents
the proportions of the F-distribution contained in the right tail.

How to Use the F-table


The area/probability, that is, α (level of significance) of the F-statistic may lie
on the right tail, left tail or both tails so the tabulated value of the F-distribution
is called right tail value, left tail value or both tail values. We now discuss how
to read these tabulated values one at a time. 149
Block 1 Sampling Distributions

Right tail value


The F-table contains the tabulated values (it is also known as critical values,
especially in testing of hypothesis) of the F-statistic for different degrees of
freedom (n1, n2) such that the area under the probability curve of the
F-distribution to its right tail (upper tail) is equal to α, therefore, for right tail
value, we can read the table as such. To read the right tail tabulated value of
the F-statistic, we follow the following steps:
Step 1: First of all, we select the table of required α because there is a
separate F-table for each α.
Step 2: After selecting the required F-table, we start with the first row of the
selected F-table, that is, degrees of freedom for the numerator (n1)
and move right until the required degree of freedom is reached.
Step 3: After that, we proceed downward headed ‘denominator degrees of
freedom (n2)’ until the required degree of freedom is reached.
Step 4: We get the required right tail tabulated value in the cell of the table at
the intersection of the required degrees of freedom n1 and n2.
We represent the tabulated value of the F-statistic for (n1, n2) degrees of
freedom and for right tail area/probability α as F(n1,n2 ), α and is shown in
Fig. 5.12 (a).

Fig. 5.12: Representation of tabulated value (s) of F-distribution.

Suppose we want to find out the tabulated value of the F-statistic for which the
area/probability on the right tail is 0.05 and the degrees of freedom for the
numerator is 8 and for the denominator is 12. To read the tabulated value, we
first select the F-table corresponding to α = 0.05. Then we start with the first
Fig. 5.13
row of the selected F-table, that is degrees of freedom for the numerator and
proceed right up to required n1 = 8 is reached and then downward headed
150 ‘denominator degrees of freedom (n2)’ until required n2 = 12 is reached. The
Unit 5 Sampling Distributions Associated with Normal Populations-II

F(n1,n2 ), α F=
= (8,12),0.05 2.85 and is
shown in Fig. 5.13. For your convenience, we give a part of the F-table as
follows:
Table 5.3: Part of F-table for α = 0.05

Degrees of
freedom for Degrees of freedom for numerator (n1)
denominator
(n2) 1 2 3 4 5 6 7 8 9 10 11 12

10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98 2.94 2.91
11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.85 2.82 2.79
12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75 2.72 2.69
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.67 2.63 2.60
14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.60 2.57 2.53
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54 2.51 2.48
16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.49 2.46 2.42

Note 1: The value of the F-statistic (n1, n2) is not the same for (n2, n1) for the
denominator, that is,
F(n1,n2 ) ≠ F(n2 ,n1 )

Left tail value


The F-distribution is not symmetrical as the t-distribution, therefore, the left tail
value for which the left tail area/probability is α means there exists 1 – α
area/probability on the right side of the value. Hence, to read the left tail value
from the F-table, we have to see the value at 1 – α instead of α and then we
use the relationship of the F-distribution as
1
F(n1,n2 ),(1−α ) =
F(n2 ,n1 ), α

It means that we can find the left-tail tabulated value of the F-statistic using the
right-tail values by interchanging the degrees of freedom. We represent the left
tail tabulated/critical value of the F-table as F(n1,n2 ),(1−α ) . To read the left tail
tabulated value of the F-statistic, we follow the following steps:
Step 1: First of all, we select the table of required α because there is a
separate F-table for each α.
Step 2: After selecting the required F-table, we first read the tabulated value
for the reverse degrees of freedom (n2, n1), that is, F(n2 ,n1 ), α as
discussed above.
Step 3: After that, we use the relationship of the F-distribution to find the left-
tail tabulated values
1
F(n1,n2 ),(1−α ) =
F(n2 ,n1 ), α

Suppose we want to find out the tabulated value of the F-statistic for which the
area/probability on the left tail is 0.01 and the degrees of freedom for the
numerator is (n1) 6 and for the denominator (n2) 10. To read the tabulated
value, we first select the F-table corresponding α = 0.01. After that we first 151
Block 1 Sampling Distributions

read the tabulated value for reversing the degrees of freedom, that is,
F(n2 ,n1 ), α = F(10,6),0.01 as discussed above. We get F(10,6),0.01 = 7.87 . After that, we
use the relationship to calculate the required left-tail value as
1
F(n1,n2 ),(1− α) =
F(n2 ,n1 ),α

1 1
F( 6,10 ),(1−=
0.01)
F( 6,10 ),=
( 0.99 ) = = 0.127
F(10,6 ),0.01 7.87

This is also shown in Fig. 5.14.


Let us learn how to read both/ two tails tabulated values from the F-table.
Both tails values
Fig. 5.14
To read the tabulated values of the F-statistic when the total area on both tails
is α, we follow the same procedure as in the case of the t-table and chi-square
table. We follow the following steps:
Step 1: First of all, we half the total area on both tails. If it is α then the half
area i.e. α/2 lies in both tails as shown in Fig. 5.12 (c).
Step 2: We then read the right tail and left tail tabulated values for the same
degrees of freedom (n) and α/2 (area/probability lie on right and left
tails) instead of α as discussed in the case of right and left tails
tabulated values.
In this case, we represent the right-tail tabulated value as F(n1,n2 ), α /2 and left-tail
as F(n1,n2 ),(1−α /2) .

For example, if we want to read the values of the F-statistic for which the area
on both tails is 0.05 and the degrees of freedom (4, 12). For that first of all, we
half the total area/probability as 0.05/2 = 0.025. After that, we read the right-tail
and left-tail values in such a way that the area on both the tails the remaining
α/2 = 0.025 as discussed in the case of right and left tails tabulated values then,
1 1
we get F =(n1,n2 ), α /2 F=
(4,12),0.025 4.12 and F(n=
1 ,n2 ),(1−α )
F=
( 4,12),0.975 =
F(12,4),0.025 8.75
= 0.11.

These tabulated values are also shown in Fig. 5.15.


After understanding how to read tabulated (critical) values for different cases,
Fig. 5.15
we now discuss some more facts about the F-table. If we closely inspect the
F-table then we observe:
• As the degrees of freedom either for the numerator or denominator
increase, the tabulated value of the F-statistic decreases. The reason is
that as we increase the degrees of freedom the tails of the F-distribution
shift towards the left side as shown in Fig. 5.16.
• The value of the F-statistic (n1, n2) is not the same for (n2, n1) for the
denominator, that is,
Fig. 5.16
F(n1,n2 ) ≠ F(n2 ,n1 )

152 • The F-table does not include entries for every pair of possible degrees of
Unit 5 Sampling Distributions Associated with Normal Populations-II

freedom. For example, the table lists continuously up to n1 = n2 = 30 and


after that for df = 40, 60, 120 and does not list any entries for degrees of
freedom values between these.
Method of Finding the Values of F-Statistic for Degrees of Freedom
which are not Listed in the Table
We obtain the values of the F-statistic for degrees of freedom which are not
listed in the table in a similar manner as discussed in the case of the t-table.
This is explained in part (iv) of Example 6.
Now, let us try to read the tabulated values for F-statistic by taking an
example.
Example 6: Find the values of the F-statistic in each part for which the
degrees of freedom (n1, n2) and the area (α ) are as follows:
(i) n1 = 15, n2 = 8 and α = 0.01 (right tail)
(ii) n1 = 7, n2 = 4 and α = 0.05 (left tail)
(iii) n1 = 5, n2 = 10 and α = 0.10 (both tails)
(iv) n1 = 10, n2 = 32 and α = 0.01 (right tail)
Solution:
(i) Here, we want to find the tabulated value of the F-statistic for which
n1 = 15, n2 = 8 and area/probability on the right tail is α = 0.01. Thus, we
first select the F-table for α = 0.01 and then we start with the first row of
the selected F-table, that is degrees of freedom for the numerator and
proceed right until we reach n1 = 15 and then downward headed
‘denominator degrees of freedom (n2)’ until we reach n2 = 8. We get the
F(n1,n2 ), α F=
required tabulated value of the F-statistic as = (15,8),0.01 5.52 .

(ii) Here, we want to find the tabulated value of the F-statistic for which
n1 = 7, n2 = 4 and area/probability on the left tail is α = 0.05.
Since we have to read the tabulated value for the left tail, so we use the
relation:
1
F(n1,n2 ),(1−α ) =
F(n2 ,n1 ), α

We first select the F-table corresponding α = 0.05. After that we read the
tabulated value by reversing the degrees of freedom, that is, for (4, 7).
We get F =(n2 ,n1 ), α F=
(4,7),0.05 4.12. After that, we use the relationship to
calculate the required left-tail value as
1
F(n1,n2 ),(1−α ) =
F(n2 ,n1 ), α

1 1
F( 7,4 ),(1−=
0.05 )
F( 7,4=
),0.95 = = 0.243
F( 4,7 ),0.05 4.12

(iii) Here, the total area on both tails is 0.10, therefore, half area 0.10/2 =
0.05 lies on both tails. In this case, the F-statistic has two values one on
right-tail as F(n1,n2 ), α /2 = F(5,10),0.05 F(n1,n2 ),(1−α /2)
153
Block 1 Sampling Distributions
= F=
( 5,10 ),(1−0.05 ) F( 5,10 ),0.95 . So, by proceeding the same way as above, we
get the required tabulated values of the F-statistic as F( 5,10 ),0.05 = 3.33 and
1 1
F( 5,10=
),0.95 = = 0.21 .
F(10,5 ),0.05 4.74

(iv) Here, we want to find the tabulated value of the F-statistic for
n1 = 10, n2 = 32 and α = 0.01(right tail)
Since the F-table for α = 0.01 does not have the tabulated value
corresponding to degrees of freedom n1 = 10, n2 = 32 so we need to
interpolate it.
For this, we find the tabulated values of the F-statistic that are just
greater and just less than the degree of freedom 32. Thus, we have
F(10, 40) , 0.01 = 2.80 F(10, 30) , 0.01 = 2.98
There is a difference of 10 degrees of freedom between these two and a
difference of 0.18 (2.98 – 2.80) in the F-values. Thus, each degree of
freedom has an approximate change in the value of the F-statistic as
0.18
= 0.018
10
To get 32 degrees of freedom, multiplying 0.018 by 2 (32 – 30), we get
0.018 × 2 =0.036
Since the larger the degrees of freedom smaller the tabulated value of
the F-statistic, therefore, subtracting this from 2.98, we get the required
values as
F(10, 32) = 2.98 – 0.036 = 2.944
Now, try to answer the following Self Assessment Question.

SAQ 7
Find the tabulated value of the F-statistic in the following cases:
(i) n1 = 8, n2 = 10 and α = 0.05 (right tail)
(ii) n1 = 4, n2 = 12 and α = 0.01 (left tail)

Let us end with a brief look at what we have covered in this unit.

5.9 SUMMARY
In this unit, we have covered the following points:
• The probability density function of the F-distribution with (n1, n2 ) degrees
of freedom is given by
n1
 n1  2 n1
  −1
 n 2  F 2
=f (F) ; 0<F<∞
 n1 n2  n1 + n2
B ,   n1  2
 2 2  1 + F 
154  n2 
Unit 5 Sampling Distributions Associated with Normal Populations-II

• The F distribution is a positively skewed distribution that has a minimum


value of 0, but no maximum value. The shape of the F-distribution for
n1 =1, 2 and n2 = 1, 2 is inverse J alphabet and the curve starts out high
and then subsequently declines. However, for n1 and n2 > 2, the shape
of the F-distribution first increases and attains the maximum value and
after that starts to decrease.

• The mean and variance of the F-distribution with (n1, n2) df are given as
follows:
n2
Mean = for n2 > 2.
n2 − 2

2n22 ( n1 + n2 − 2 )
Variance = for n2 > 4.
n1 ( n2 − 2 ) ( n2 − 4 )
2

• If we interchange the degrees of freedom n1 and n2 then there exists a


very useful relation as
1
F(n1,n2 ),(1−α ) =
F(n2 ,n1 ), α

• The applications of the F-distribution.


• The method of reading the tabulated value from the t, chi-square and F-
distributions tables.

5.10 TERMINAL QUESTIONS


Write down the pdf of the F-distribution in each of the following cases:

(i) (4, 2) degrees of freedom

(ii) (2, 4) degrees of freedom

And show that the pdf of the F-distribution is not the same by reversing the
degrees of freedom.

5.11 SOLUTIONS / ANSWERS


Self Assessment Questions (SAQs)

1. If random variable X follows the F-distribution with (n1, n2) degrees of


freedom, then the probability density function of X is given as
n1
 n1  2 n1
  −1
n
 2 F 2
f (F) ; 0<F<∞
 n1 n2  n1 + n2
B ,   n1  2
 2 2  1 + F 
 n2 

We now try to convert the given pdf in the form of the standard form of
the F-distribution so that we can compare and find the degrees of
5
 x 2 1
freedom. Since the coefficient of x in 1 + 
 4 4
155
Block 1 Sampling Distributions
 n  n
that n1 = 1 and n2 = 4 because the coefficient of x in 1 + 1 x  is 1 . We
 n2  n2
try to write the given pdf into the F-distribution as follows:
1
 1 2 1
4 x2
−1

=f ( x)   ; 0<x<∞
1 4 1+ 4
B ,   1 2
 2 2  1 + 4 x 
 

 1 1 
 1  2 2 
B  =
,2  2
= 2 = 4
 2  5 3 1 1 3
 × × 
2 2 2 2
By comparing the above form with the standard form, we get n1 = 1 and
n2 = 4.
We know that the mean and variance of the F-distribution with (n1, n2)
degrees of freedom are:
n2 2n22 ( n1 + n2 − 2 )
Mean = for n2 > 2 and Variance = for n2 > 4.
n2 − 2 n1 ( n2 − 2 ) ( n2 − 4 )
2

Therefore, for n1 = 1 and n2 = 4 degrees of freedom


n2 4
Mean = = = 2
n2 − 2 4 − 2

Since the variance of the F-distribution exists for n2 > 4, therefore, for
n1 = 4 it does not exist.
2. As we know if a variable t follows the t-distribution with n df, then the
square of t follows the F-distribution with (1, n) df, therefore, for n = 4,
the distribution of t square follows the F-distribution with (1, 4) df. The
pdf of the F-distribution is given by
n1
 n1  2 n1
  −1
 n2  F2
=f (F) ; 0<F<∞
 n1 n2  n1 + n2
B ,   n1  2
 2 2  1 + F 
 n2 
1
 1 2 1 1
4 F2
−1 −
3F 2
    1  4
=f (F) = = Since B  2 ,2  3 
 1 4  1+ 4 5
   
B ,   1 2  1 2
 2 2   1 + F  8  1 + F 
 4   4 

3. Refer to Section 5.4.


4. Refer to Section 5.5.
5. (i) Here, we want to find the value of the t-statistic for the right tail
corresponding to

156 n = 9 and α = 0.01


Unit 5 Sampling Distributions Associated with Normal Populations-II

Therefore, we start with the first column of the t-table (Table V)


given in the Appendix and downward headed n until entry 9 is
reached. Then proceed right to the column headed α = 0.01. We
get the required value of the t-statistic as t(n), α = t(9), 0.01 = 2.821.
(ii) In this part, we have to read the tabulated value for the left tail
corresponding to
n = 15 and α = 0.05
First of all, we read the right tail value by proceeding the same way
as in part (i) as t(n), α = t(15), 0.05 = 1.753. Since the t-distribution is
symmetrical at the t = 0 line, therefore, the left tail value will be
– t(n), α = – t(15), 0.05 = – 1.753.
(iii) Here, we want to read the tabulated values of the t-statistic for the
two-tail corresponding to
n = 13 and α = 0.05
Since the total area on both tails is 0.05, therefore, the area on the
right tail as well as on the left tail will be 0.05/2 = 0.025. Thus, we
start with the first column of the t-table and downward headed n
until entry 13 is reached. Then proceed right to the column headed
0.025. We get t(n), α/2 = t(13), 0.025 = 2.160. Since the t-distribution is
symmetrical at the t = 0 line, therefore, the left tail value will be the
same as the right tail value but in a negative sign. Therefore,
– t(n), α/2 = – t(13), 0.025 =– 2.160. So the required values of the
t-statistic are ± t(n), α/2 = ± t(13), 0.025 = ± 2.160.
6. (i) Here, we want to find the value of the chi-square statistic for
n = 11 and α = 0.01 (right tail)
We start with the first column of the chi-square table given in the
Appendix and downward headed n until entry 11 is reached. Then
proceed right to the column headed α = 0.01. We get the required
2 2
value of the chi-square statistic as χ(n),α =
χ(11),0.01 24.72.
=

(ii) Here, we want to find the value of the chi-square statistic for
n = 19 and α = 0.10 (left tail)
To read the left-tail tabulated value, we start with the first column
of the chi-square table, that is degrees of freedom and downward
headed ‘n’ until entry 19 is reached and then proceed right to the
column headed α at 1 – 0.1 0= 0.90 instead of 0.10. Then we find
the cell in the table at the intersection of degrees of freedom n = 19
and α = 0.90 level. Thus, we get the required left value of the chi-
2
χ2(19),0.90 =
square statistic as χ (n),(1−α ) = 11.65.

7. (i) Here, we want to find the value of the F-statistic for

n1 = 8, n2 = 10 and α = 0.05 (right tail)

Thus, we first select the F-table for α = 0.05 and then we start with
the first row of the selected F-table, that is, degrees of freedom for
157
Block 1 Sampling Distributions

the numerator and proceed right until we reach n1 = 8 and then


downward headed denominator degrees of freedom (n2) until we
reach n2 = 10. We get the required tabulated value of the
F(n1,n2 ), α F=
F-statistic as = (8,10),0.05 3.07 .

(ii) Here, we want to find the tabulated value of the F-statistic for which
n1 = 4, n2 = 12 and area/probability on the left tail is α = 0.01.
Since we have to read the tabulated value for the left tail, so we use
the relation
1
F(n1,n2 ),(1−α ) =
F(n2 ,n1 ), α

We first select the F-table corresponding to α = 0.01. After that we


read the tabulated value by reversing the degrees of freedom, that
F(n2 ,n1 ), α F=
is, for (12, 4). We get = (12,4),0.01 14.37 . After that, we use
the relationship to calculate the required left-tail value as
1
F(n1,n2 ),(1−α ) =
F(n2 ,n1 ), α

1 1
F( 4,12),=
(1−0.01) F( 4,12=
),( 0.99 ) = = 0.07
F(12,4 ),0.01 14.37

Terminal Questions (TQs)


For n1 = 4 and n2 = 2, the pdf of the F-distribution is given by
4
 4 2 4
−1
2 F2
=f (F)   ; 0<F<∞
 4 2 4+2
B ,   4  2
 2 2  1 + F 
 2 
2
=
(=2) F2 −1 8F
=

since B ( 2, 1)
1
B ( 2, 1) 6 3  2 
(1 + 2F ) 2 (1 + 2F ) 

Similarly, n1 = 2 and n2 = 4, the pdf of the F-distribution is given by


2
 2 2 2
−1
4 F2
=f (F)   ; 0<F<∞
2 4 2+ 4
B ,   2  2
 2 2  1 + F 
 4 
1
 1
2 F1−1 1 1
  
= = =  since B (1, 2 )
B (1, 2 ) 6
 1  
3 2 
 1  2
1 + 2 F 
1 + 2 F   
 
Hence, we can say that
F(n ,n ) ≠ F(n ,n )
1 2 2 1
158

You might also like