Introduction To Data Analytics: Sampling Distributions
Introduction To Data Analytics: Sampling Distributions
DATA ANALYTICS
Class #9
Sampling Distributions
Dr. Sreeja S R
Assistant Professor
Indian Institute of Information Technology
IIIT Sri City
IIITS: IDA - M2021 1
IN THIS PRESENTATION…
• Basic concept of sampling distribution
• t-distribution
IIITS: IDA - M2021 2
• F distribution
Introduction
As a task of statistical inference, we usually follow the following steps:
• Data collection
• Collect a sample from the population.
• Statistics
• Compute a statistics from the sample.
• Statistical inference
• From the statistics we made various statements concerning the values of population
parameters can be inferred.
• For example, population mean from the sample mean, etc.
• Population: A population consists of the totality of the observation, with which we are
concerned.
• Random variable: A random variable is a function that associates a real number with each
element in the sample.
• Statistics: Any function of the random variable constituting random sample is called a
statistics.
Normal (Gaussian) distribution: A probability distribution that looks like a bell. Two
terms that describe a normal distribution are mean and standard deviation. Mean is the
average value that has the highest probability to be observed. Standard deviation is a
measure of how spread out the values are. As standard deviation increases, the normal
distribution curve gets wider.
1. Population parameters are fixed number whose values are usually unknown.
2. Sample statistics are known values for any given sample, but vary from sample to
sample, even taken from the same population.
• In fact, it is unlikely for any two samples drawn independently, producing identical
values of sample statistics.
• In other words, the variability of sample statistics is always present and must be
accounted for in any inferential procedure.
• This variability is called sampling variation.
Note:
A sample statistics is random variable and like any other random variable, a sample
statistics has a probability distribution.
• Using the values of and for different random samples of a population, we are to make
inference on the parameters and (of the population).
Example 7.2: Consider the following small population consisting of N=6 patients
who recently underwent total hip replacement. Three months after surgery they rated
their pain-free function on a scale of 0 to 100 (0=severely limited and painful
functioning to 100=completely pain free functioning). The data are shown below and
ordered from smallest to largest.
Pain-Free Function Ratings in a Small Population of N=6 Patients:
25, 50, 80, 85, 90, 100
Suppose we did not have the population data and instead we were estimating the mean functioning
score in the population based on a sample of n=4. The table below shows all possible samples of size
n=4 from the population of N=6. The rightmost column shows the sample mean based on the 4
observations contained in that sample.
This further, can be established with the famous “central limit theorem”, which is
stated below.
If is the mean of a random sample of size taken from a population having the
mean and the finite variance , then
• Probability questions about a sample mean can be addressed with the Central Limit Theorem, as long as
the sample size is sufficiently large.
• In this case n=40, so the sample mean is likely to be approximately normally distributed, so we can
compute the probability of HDL>60 by using the standard normal distribution table.
• The population mean is 54, but the question is what is the probability that the sample mean will be >60?
Solution:
= 60, = 54, = 17, = 40.
Therefore, the probability that the mean HDL in these 40 patients will exceed 60 is 1.32%.
One very important application
n=large
of the Central Limit Theorem
is the determination of
reasonable values of the
population mean and variance
n = small
n=1 to moderate
4. In such situation, only measure of the standard deviation available may be the sample
standard deviation .
5. It is natural then to substitute for . The problem is that the resulting statistics is not
normally distributed!
If is the mean of a random sample of size taken from a normal population having
the mean and , then
Solution:
= 10.63, = 12.40, = 2.48, = 20.
is a random variable having the 𝑡 distribution with the parameter 𝑣 = 𝑛−1 = 19 degrees of
freedom. From the t-distribution table, for t = -3.19 and v =19, the probability is 0.005.
Since the probability is very small, we conclude that the data refute the manufacturer’s
claim. In all likelihood, the mean blowing time of his fuses with a 20% overload is less than
12.40 minutes.
The Distribution
• The distribution finds enormous applications in comparing sample variances.
If and are the variances of independent random samples of size and , respectively,
taken from two normal populations having the same variance, then
Therefore, if we assume that we have sample of size from a population with variance
and an independent sample of size from another population with variance , then the
statistics
Let the chi square variables , with degrees of freedom, and , with degrees of
freedom, be independent.
𝑛 2 , 𝜇 2 , 𝜎 2