0% found this document useful (0 votes)
113 views31 pages

Introduction To Data Analytics: Sampling Distributions

This document discusses sampling distributions and the central limit theorem. It defines key concepts like population, sample, random variable, and statistics. It explains that a sample statistic has a probability distribution called a sampling distribution. The central limit theorem states that as sample size increases, the sampling distribution of the sample mean will approach a normal distribution, regardless of the shape of the population distribution. Examples are provided to illustrate sampling distributions and how the central limit theorem can be applied.

Uploaded by

preethi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
113 views31 pages

Introduction To Data Analytics: Sampling Distributions

This document discusses sampling distributions and the central limit theorem. It defines key concepts like population, sample, random variable, and statistics. It explains that a sample statistic has a probability distribution called a sampling distribution. The central limit theorem states that as sample size increases, the sampling distribution of the sample mean will approach a normal distribution, regardless of the shape of the population distribution. Examples are provided to illustrate sampling distributions and how the central limit theorem can be applied.

Uploaded by

preethi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

INTRODUCTION TO

DATA ANALYTICS

Class #9
Sampling Distributions

Dr. Sreeja S R
Assistant Professor
Indian Institute of Information Technology
IIIT Sri City
IIITS: IDA - M2021 1
IN THIS PRESENTATION…
•  Basic concept of sampling distribution

• Usage of sampling distributions

• Issue with sampling distributions

• Central limit theorem

• Application of Central limit theorem

• Major sampling distributions


• distribution

• t-distribution
IIITS: IDA - M2021 2
• F distribution
Introduction
As a task of statistical inference, we usually follow the following steps:

• Data collection
• Collect a sample from the population.

• Statistics
• Compute a statistics from the sample.

• Statistical inference
• From the statistics we made various statements concerning the values of population
parameters can be inferred.
• For example, population mean from the sample mean, etc.

IIITS: IDA - M2021 3


Basic terminologies
Some basic terminology which are closely associated to the above-mentioned tasks are
reproduced below.

• Population: A population consists of the totality of the observation, with which we are
concerned.

• Sample: A sample is a subset of a population.

• Random variable: A random variable is a function that associates a real number with each
element in the sample.

• Statistics: Any function of the random variable constituting random sample is called a
statistics.

• Statistical inference: It is an analysis basically concerned with generalization and prediction.


IIITS: IDA - M2021 4
Basic terminologies
Probability distribution: A function that shows the probabilities of the outcomes of an
event or experiment.

Normal (Gaussian) distribution: A probability distribution that looks like a bell. Two
terms that describe a normal distribution are mean and standard deviation. Mean is the
average value that has the highest probability to be observed. Standard deviation is a
measure of how spread out the values are. As standard deviation increases, the normal
distribution curve gets wider.

IIITS: IDA - M2021 5


Statistical Inference
There are two facts, which are key to statistical inference.

1. Population parameters are fixed number whose values are usually unknown.
2. Sample statistics are known values for any given sample, but vary from sample to
sample, even taken from the same population.
• In fact, it is unlikely for any two samples drawn independently, producing identical
values of sample statistics.
• In other words, the variability of sample statistics is always present and must be
accounted for in any inferential procedure.
• This variability is called sampling variation.

Note:
A sample statistics is random variable and like any other random variable, a sample
statistics has a probability distribution.

IIITS: IDA - M2021 6


Sampling Distribution
•   precisely, sampling distributions are probability distributions and used to describe
More
the variability of sample statistics.

Definition 7.1: Sampling distribution


The sampling distribution of a statistics is the probability distribution of that
statistics.

• The probability distribution of sample mean (hereafter, will be denoted as ) is called


the sampling distribution of the mean (also, referred to as the distribution of sample
mean).

• Like we call sampling distribution of variance (denoted as ).

• Using the values of and for different random samples of a population, we are to make
inference on the parameters and (of the population).

IIITS: IDA - M2021 7


Sampling Distribution
•  
Example 7.1:
Consider five identical balls numbered and weighing as . Consider an experiment consisting of drawing two
balls, replacing the first before drawing the second, and then computing the mean of the values of the two balls.
Following table lists all possible samples and their mean.

Sample Mean Sample Mean Sample Mean

[1,1] [2,4] [4,2]

IIITS: IDA - M2021 8


Sampling Distribution
Sampling distribution of means

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

IIITS: IDA - M2021 9


Issues with Sampling Distribution
1. In practical situation, for a large population, it is infeasible to have all
possible samples and hence probability distribution of sample statistics.

2. The sampling distribution of a statistics depends on

• the size of the population

• the size of the samples and

• the method of choosing the samples.

IIITS: IDA - M2021 10


Theorem on Sampling Distribution
Famous theorem in Statistics

Theorem 7.1: Sampling distribution of mean and variance


 
The sampling distribution of a random sample of size n drawn from a population
with mean and variance will have mean and variance

Example 7.2: Consider the following small population consisting of N=6 patients
who recently underwent total hip replacement. Three months after surgery they rated
their pain-free function on a scale of 0 to 100 (0=severely limited and painful
functioning to 100=completely pain free functioning). The data are shown below and
ordered from smallest to largest.
Pain-Free Function Ratings in a Small Population of N=6 Patients:
25, 50, 80, 85, 90, 100

IIITS: IDA - M2021 11


  Example 7.2: 25, 50, 80, 85, 90, 100 For the population,

Suppose we did not have the population data and instead we were estimating the mean functioning
score in the population based on a sample of n=4. The table below shows all possible samples of size
n=4 from the population of N=6. The rightmost column shows the sample mean based on the 4
observations contained in that sample.

Sample Observations in the Sample (n=4) Mean


1 25 50 80 85 60.0
2 25 50 80 90 61.3   From the table,
3 25 50 80 100 63.6
4 25 50 85 90 62.5
5 25 50 85 100 65.0
6 25 59 90 100 66.3
7 25 80 85 90 70.0
8 25 80 85 100 72.5
9 25 80 90 100 73.8
10 25 85 90 100 75.0
11 50 80 85 90 76.3
12 50 80 85 100 78.8
13 50 80 90 100 80.0
14 50 85 90 100 81.3
15 80 85 90 100 88.8

IIITS: IDA - M2021 12


Central Limit Theorem
•  The Theorem 7.1 is an amazing result and in fact, also verified that if we sampling
from a population with unknown distribution, the sampling distribution of will still be
approximately normal with mean and variance provided that the sample size is large.

This further, can be established with the famous “central limit theorem”, which is
stated below.

 Theorem 7.3: Central Limit Theorem

If is the mean of a random sample of size taken from a population having the
mean and the finite variance , then

is a random variable whose distribution function approaches that of the standard


normal distribution as

IIITS: IDA - M2021 13


Central Limit Theorem
CLT states that the sampling distribution of
the sample means approaches a normal
distribution as the sample size gets larger –
no matter what the shape of the population
distribution. This fact holds especially true
for sample sizes over 30.

Why is it so important to have a normal


distribution?

Normal distribution is described in terms


of mean and standard deviation which can
easily be calculated. And, if we know the
mean and standard deviation of a normal
distribution, we can compute pretty much
everything about it.

IIITS: IDA - M2021 14


Example for central limit theorem:
Different classes of these lipid transport carriers can be separated (fractionated)based on their density
and where they layer out when spun in a centrifuge. High density lipoprotein cholesterol (HDL) is
sometimes referred to as the "good cholesterol," because higher concentrations of HDL in blood are
associated with a lower risk of coronary heart disease. In contrast, high concentrations of 
low density lipoprotein cholesterol (LDL) are associated with an increased risk of coronary heart
disease. The illustration on the right outlines how total cholesterol levels are classified in terms of risk,
and how the levels of LDL and HDL fractions provide additional information regarding risk.

IIITS: IDA - M2021 15


Example
  for central limit theorem:
Data from the Framingham Heart Study found that subjects over age 50 had a mean HDL of 54 and a
standard deviation of 17. Suppose a physician has 40 patients over age 50 and wants to determine the
probability that the mean HDL cholesterol for this sample of 40 men is 60 mg/dl or more (i.e., low risk).

• Probability questions about a sample mean can be addressed with the Central Limit Theorem, as long as
the sample size is sufficiently large.
• In this case n=40, so the sample mean is likely to be approximately normally distributed, so we can
compute the probability of HDL>60 by using the standard normal distribution table.
• The population mean is 54, but the question is what is the probability that the sample mean will be >60?

Solution:
= 60, = 54, = 17, = 40.

P(Z > 2.22) = 1 - 0.9868 = 0.0132.

Therefore, the probability that the mean HDL in these 40 patients will exceed 60 is 1.32%.

IIITS: IDA - M2021 16


Applicability of Central Limit Theorem
•   The normal approximation of will generally be good if 0
• The sample size is, hence, a guideline for the central limit theorem.
• The normality on the distribution of becomes more accurate as grows larger.

 
One very important application
n=large
of the Central Limit Theorem
is the determination of
reasonable values of the
population mean and variance
n = small
n=1 to moderate

IIITS: IDA - M2021 17


STANDARD SAMPLING DISTRIBUTIONS

•  • Apart from the normal distribution to describe sampling distribution, there


are some other quite different sampling, which are extensively referred in
the study of statistical inference.

• : Describes the sampling distribution of the mean when is unknown


• : Describes the distribution of variance.
• F: Describes the distribution of the ratio of two variables.

IIITS: IDA - M2021 18


The 𝒕 Distribution
•1.   To know the sampling distribution of mean we make use of Central Limit Theorem
with

2. Central Limit Theorem require the known value of a priori.

3. However, in many situation, is certainly no more reasonable than the knowledge of


the population mean .

4. In such situation, only measure of the standard deviation available may be the sample
standard deviation .

5. It is natural then to substitute for . The problem is that the resulting statistics is not
normally distributed!

6. The distribution is to alleviate this problem. This distribution is called or simply .


IIITS: IDA - M2021 19
The 𝒕 Distribution

  Definition 7.4: distribution

If is the mean of a random sample of size taken from a normal population having
the mean and , then

is a random variable having the distribution with the parameter

IIITS: IDA - M2021 20


 
Example for t-distribution:
A manufacturer of fuses claims that with a 20% overload, the fuses will blow in 12.40
minute on the average. To test this claim, a sample of 20 of the fuses was subjected to a
20% overload, and the time it took them to blow had a mean of 10.63 minutes and a std.
dev. of 2.48 minutes. If it can be assumed that the data constitute a random sample from a
normal population, do they tend to support or refute the manufacturer’s claim?

Solution:
= 10.63, = 12.40, = 2.48, = 20.

is a random variable having the 𝑡 distribution with the parameter 𝑣 = 𝑛−1 = 19 degrees of
freedom. From the t-distribution table, for t = -3.19 and v =19, the probability is 0.005.
Since the probability is very small, we conclude that the data refute the manufacturer’s
claim. In all likelihood, the mean blowing time of his fuses with a 20% overload is less than
12.40 minutes.

IIITS: IDA - M2021 21


 
THE DISTRIBUTION
• A
  common use of the distribution is to describe the distribution of the sample
variance.
• It is concerned with the sampling distribution of the sample variance for random
samples from normal populations.

  Definition 7.5: distribution

If is the variance of a random sample of size taken from a normal population


having the variance , then

is a random variable having the chi squaredistribution with the parameter

IIITS: IDA - M2021 22


 

The Distribution
•  The distribution finds enormous applications in comparing sample variances.

 Definition 7.5: distribution

If and are the variances of independent random samples of size and , respectively,
taken from two normal populations having the same variance, then

is a random variable having the F distribution with the parameter and

Therefore, if we assume that we have sample of size from a population with variance
and an independent sample of size from another population with variance , then the
statistics

IIITS: IDA - M2021 23


 

Representation of random variable

  Definition 7.6: random variable

Let be independent standard normal random variables.

has a chi sqaure distribution with degrees of freedom.

IIITS: IDA - M2021 24


Representation of 𝒕 random variable

  Definition 7.7: random variable

Let the standard normal and with degrees of freedom be independent.

has a t distribution with degrees of freedom.

IIITS: IDA - M2021 25


Representation of F random variable

  Definition 7.8: random variable

Let the chi square variables , with degrees of freedom, and , with degrees of
freedom, be independent.

has a F distribution with degrees of freedom.

IIITS: IDA - M2021 26


REFERENCE

The detail material related to this lecture can be found in

Probability and Statistics for Enginneers and Scientists (8 th Ed.) by


Ronald E. Walpole, Sharon L. Myers, Keying Ye (Pearson), 2013.

IIITS: IDA - M2021 27


Any question?

You may post your question(s) at the “Discussion Forum” maintained in


the course Web page!

IIITS: IDA - M2021 28


QUESTIONS OF THE DAY…

1. What are the degrees of freedom in the


following cases.
Case 1: A single number.
Case 2: A list of n numbers.
Case 3: a table of data with m rows and n
columns.
Case 4: a data cube with dimension m×n×p.

IIITS: IDA - M2021 29


QUESTIONS OF THE DAY…

2. In the following, two normal sampling distributions are shown


with parameters n, μ and σ (all symbols bear their usual
meanings).
𝑛  1 , 𝜇 1 , 𝜎 1

𝑛  2 , 𝜇 2 , 𝜎 2

What are the relations among the parameters in the two?

IIITS: IDA - M2021 30


QUESTIONS OF THE DAY…

•3.   Suppose, and S denote the sample mean and standard


deviation of a sample. Assume that population follows
normal distribution with population mean and standard
deviation . Write down the expression of z and t values
with degree of freedom n.

IIITS: IDA - M2021 31

You might also like