0% found this document useful (0 votes)
9 views45 pages

Sp25 Module 06 Sampling

Module 6 of ISOM2500 Business Statistics focuses on estimation and sampling distributions, covering topics such as sampling methods, the central limit theorem, and how to infer population parameters from sample statistics. It emphasizes the importance of representative sampling and introduces the concept of sampling distributions, particularly for the sample mean, and the conditions under which they approximate normality. The module also discusses the implications of sampling distributions for estimating population parameters and the behavior of various statistics.

Uploaded by

smstky
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views45 pages

Sp25 Module 06 Sampling

Module 6 of ISOM2500 Business Statistics focuses on estimation and sampling distributions, covering topics such as sampling methods, the central limit theorem, and how to infer population parameters from sample statistics. It emphasizes the importance of representative sampling and introduces the concept of sampling distributions, particularly for the sample mean, and the conditions under which they approximate normality. The module also discusses the implications of sampling distributions for estimating population parameters and the behavior of various statistics.

Uploaded by

smstky
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Module 6: Estimation and

Sampling Distribution
ISOM2500 BUSINESS STATISTICS
(L1 & L2, Spring 2025)
Jason HO
Contents
Samples and surveys
Sampling distribution of the sample mean
Central limit theorem
Sampling distribution of any statistic

2
Journey in this course
Descriptive statistics Building blocks of theory of statistics
Module 2 Modules 4 & 5
Module 3
Graphical Tools Random Variables
Probability • Discrete or
Continuous
Numerical Tools • Jointly distributed

From Module 6 Module 7 Modules 9 & 10


now on Estimation
Confidence Interval
Simple Linear
• Sampling Module 8 Regression
Distribution Hypothesis Testing

Inferential statistics 3
Inferential statistics: a general set-up with
4 components
Parameter(s)
to describe a
characteristic of Population Find and
a population in
compute
answering some Sample Sample
questions in Statistic(s)
reality to estimate
the parameter

Infer the population


using the statistic(s)
4
Focus of this module
Parameter(s)
to describe a
characteristic of Population Find and
a population in
compute
answering some Sample Sample
questions in Statistic(s)
reality to estimate
the parameter

Infer the population


using the statistic(s)
5
A RV/distribution as a Probability Model
One of the major reasons of introducing some common RVs in
Module 4 is that some of the RVs are commonly assumed/used as
probability models (with unknown parameter(s)) for the
population, from which real data are observed, e.g.,
1. for binary data (i.e., a
categorical variable with 2 levels,
or, equivalently, a population
which consists of only 2 nominal
values)
2. for most continuous data
6
SAMPLES AND
SURVEYS

7
Samples and surveys
A survey gathers information of a subgroup of entities (i.e.,
sample) who belong to a much larger group of entities (i.e.,
population), providing necessary ingredients – THE DATA – for
parameter estimation

8
Example 1: use of surveys in daily lives
 When an election is approaching, there are nonstop reports and news
about the latest opinion poll
 The foreman of a warehouse will not accept a shipment of electronic
components unless virtually all the components in the shipment
operate correctly
 A retailer wants to know the market share of a brand before deciding
to stock the items on its shelves
 Managers in the human resources department determine the salary
for the new employees based on wages paid around the country

9
Data quality: garbage in, garbage out
Samples that distort the population (e.g., one that systematically
omits a portion of the population) are said to have sampling bias
A sample that presents a “good” snapshot of the population (i.e.,
showing/preserving systematic patterns of the population) is said
to be representative

Intuitively, to assure a representative sample, we must pick


entities/“members” of the population at random NOT sufficient

By simple random sampling, systematic sampling, stratified


sampling, cluster sampling and many others 10
Simple random sampling – gold standard
A sampling procedure that assumes every sample of size n from the
population equally likely produces a simple random sample (SRS)
Methods that give every entity in the population an equal chance
to be in the sample does NOT necessarily produce a
representative sample
• A company has an equal # of employees with and without university
education
• Flip a coin; if it lands head, select 100 employees who are U grad;
otherwise if it lands tail, select 100 who had no university education
• Every sample contains employees of the same education level – hardly
representative!
11
Sample statistic/measure
as a statistic
An unknown value, i.e.,
a number we don’t know

To estimate a population parameter (denoted by ), an intuitive


way of selecting a statistic is to use its sample counterpart, e.g.,
• use sample mean ( ) to estimate a population mean ( )
• use sample SD ( s ) to estimate a population SD ( )
Parameter in population Statistic from data
estimate, or
point estimates of

12
Sampling variation/sample-to-sample
variation
Sampling a small portion from a large population results in
sampling variation/sample-to-sample variation
Every time, only n values A SRS of size n

( ) are This time:


n different
n particular values values (never
observed in a sample
be the same)
Next time, when we Next time: Another n values
from time to
time, and a lot
select another sample more values or
from the population, we information in
will most likely get n …… the population
completely different values is missing!
13
Any statistic is a RV
Sample value of
A SRS of size n a certain statistic

Link
This time: n particular values 1.5 Probability of each
value depends on
Next time: Another n values 2 the probability of
getting the
corresponding
…… 100 SRS

All possible values Associated


of the statistic Probabilities
14
A point estimate ≠ value of
Sample value of In reality,
A SRS of size n a certain statistic
ONLY see 1
This time: n particular values
Link
1.5 sample of
size n & get
1 value,
…… say, 1.5
100

No way that a point estimate 1.5 equals ! Are they close?


Necessary to understand how the statistic behaves as a RV
(i.e., the probability distribution of a statistic)
15
SAMPLING
DISTRIBUTION OF
THE SAMPLE MEAN

16
Population: an (underlying) probability
model/distribution X or f(x)
Most statistical methods are developed by
Population
often, if not always, assuming an underlying
probability distribution X or f(x) (or p(x)) for
the population of interest:
Infinitely many values
• The population comprises infinitely
Histogram
many values (as categories or numbers)
• The red smooth curve for histogram An underlying
of all these values from a population probability model
of continuous values mimics the A probability
probability distribution f(x) distribution f(x)
or a random variable X
of a RV, say, X 17
Data: iid samples from X
Assume that the data arise as a representative sample of size n from
the population (with an underlying probability distribution f(x))
• Each data value is an independent realization from the
underlying probability model X, (i.e., like a random draw
with replacement from all values/figures which constitute
the histogram in Slide 17)
• The data are modeled as RVs, which are independent and
identically distributed (iid) samples/draws
from X or f(x), written as

18
Ch.4

Sample mean and Existing New

its sampling distribution RV’s Sum up/Average


RV

From now on, let’s confine our statistic of interest as the sample
mean (i.e., the parameter of interest is a population mean ),
defined by

• As all Xi’s are RVs, the sample mean is a RV, with its probability
distribution especially called the sampling distribution
Sampling distribution of the sample mean is the distribution of the
sample mean computed from a sample of size n. In theory, it can be
obtained from ALL possible samples of size n from the population
through repeated sampling (illustrated in the next Slide)
19
In theory:

Existing New
RV’s Sum up/Average
RV
20
In practice:
M
with # of
repetitions M
(say, M = 10,000)

Approximate

Histogram

Existing New
RV’s Sum up/Average
RV M M
21
Existing New
Example 2: RV’s
Sum up/Average
RV

sampling distribution
Rice Virtual Lab in Statistics:
https://onlinestatbook.com/stat_sim/sampling_dist/index.html

22
Existing New
Example 2 (cont’d) RV’s
Sum up/Average
RV

1 3
1 Population 2
under study
= Distribution
of RVs being
averaged
2 Samples from
set M population
= Values of RVs
being averaged
3 Approximate
Sampling
Sample size Distribution
= # of averaged RVs/values

# of summed/averaged RVs = sample size


Distribution of becomes closer to “bell” shape and narrower
23
Example 3: sampling distribution of the sample
mean

mean

24
Normality of the sample mean from
normal population
When the population is normal, the sample mean is always
normally distributed for all sample sizes (n = 1,2,3,…)
Sampling distributions of from

same

25
CENTRAL LIMIT
THEOREM

26
Central limit theorem (for other populations)
For a random sample of size n from a
population with mean and variance (both
finite), the sample mean is approximately
normal when n is large (≥30)

sample mean

Irrespective of the population and n, and


The sample sum: with
27
Standardized sample means for non-normal
populations

The standardized sample mean: when n ≥30


28
Importance & takeaway from CLT when n is large (≥30)

1. Justifies that the best way to


estimate a population mean
is to use the sample mean
─ Centers around
─ Smaller variation as n
─ Approximately bell-shaped

2. Provides a relatively easy way to compute approximately


probabilities of averages (or sums) of RVs
3. Explains the fact that many real data distribute like a bell-
shaped curve
29
Example 4: estimating a population mean
All weight losses:
Sample mean
weight loss
estimates :

By CLT

30
Example 5: CLT
A recent report stated that the day-care cost per week in a region is
$109. Suppose this figure is taken as the mean cost per week and
that the standard deviation is known to be $20
Find the probability that a sample of 50 day-care centers
would show a mean cost of $105 or less per week
Weekly cost in any day-care center in the region is the RV under
consideration, but the required probability is related to cost of
NOT only 1 center but costs of 50 of them
Treat weekly cost in all day-care center as the population denoted
by X, then the mean of a sample of 50 costs from X is of interest
31
Example 5 (cont’d) when n is large (≥30)

The population , with a distribution


unknown to us, may not be normal
Asthe sample size n = 50 is large, the CLT applies to give the
mean cost per week of 50 day-care centers

The required probability is

32
estimate

The sample proportion from a binary estimate

population is the sample mean An analogy

Consider a binary population, from which there are only 2 kinds of


values/outcomes (e.g., myopia or not, sibling(s) or not)
• Parameter of interest: proportion p of 1 outcome (say, ‘success’)
Such a population is modelled by X, a Bernoulli RV with probability
of success p (Recall: , so p is the population mean)
Based on a SRS of size n from the binary population, the sample
proportion defined by # of successes out of n

is the sample mean, used to estimate the population proportion 33p


estimate
Sampling distribution of estimate
the sample proportion Set
An analogy
An exception to the CLT result at Slide 27:
For a SRS of size n from a binary population with proportion/mean
p for success, the sampling distribution of the sample proportion
Has mean p
Has variance
Is approximately normal when np, n(1-p) ≥ 10

34
Example 6: sample proportion
 Coke bottles are filled by a machine so that contents X have a normal
distribution with mean 298ml and SD 3ml
<295ml?
What is the proportion of bottles with less than 295ml?
 Let X be the content (in ml) of any coke bottle, then

 The required proportion of bottles is equivalent to

100 bottles
 What if when we have a carton of 100 bottles of cokes

What is the chance of >15% of bottles with less than 295ml?


35
What is the chance of >15% of

Example 6 (cont’d)
bottles with less than 295ml
in a carton of 100 bottles?

binary population
Understand the last question:
1. Among all bottles of coke produced <295ml success
by the machine, the proportion of
bottles with <295ml is 15.87%. Call
≥295ml
bottles with <295ml as “success”
2. With a carton of 100 bottles of cokes, it means that we have
sampled/selected 100 bottles from this binary population. We want to
study, among these 100 sampled bottles, the proportion of bottles with
<295ml (i.e., “success”)

Such a proportion is the sample proportion


from a sample of size 100 36
What is the chance of >15% of

Example 6 (cont’d 2)
bottles with less than 295ml
in a carton of 100 bottles?

 This proportion is a RV as it depends on which 100 bottles we have selected


(i.e., sampling variation)
 In particular, we want to know the probability that this RV is greater than 15%

 The required probability is given by

 By CLT, since np = 100 x 0.1587 > 10, and n(1-p) > 10,

37
Student’s t statistic: replacing an Slide 28

unknown with s
When the population SD is unknown (being more realistic), the
standardized sample mean with replaced by s
1. when n ≥ 30

2. when n < 30 and the population is normal,

─ 3 conditions to be satisfied
38
SAMPLING
DISTRIBUTION OF
ANY STATISTIC

39
Are other statistics
approximately normal?
Every statistic has a sampling distribution but, other than the
sample mean (e.g., sample variance, sample median and so on),
the distribution may be very different from being bell-shaped

Construction of an approximate sampling distribution for ANY


statistic (valid for all sample sizes and all populations): See Slide 22
1. take repeated samples of the same size from the population
2. construct a histogram for the computed values of the
statistic over many (a large M) repeated samples
The next Example looks at the sample maximum
40
Example 7: winning the lottery by betting
on birthdays
Consider a lottery game of selecting 5 #s from integers 1 to 39:
Grand price won if match all 5 #s
Bet on 5 #s from birth days of month for 5 family members
No chance to win if the highest number drawn is 32 to 39
What is the probability of such a scenario?
The statistic of interest is the sample maximum: H = highest of 5
#’s randomly selected without replacement from 1 to 39
• e.g., if the 5 selected #s are 3, 12, 22, 26, 28, then H = 28; or, if the
5 selected #s are 3, 12, 22, 26, 37, then H = 37 41
Example 7 (cont’d) Possible
to win
Chances
to win

Approximate sampling
distribution of H is far
from being normal
Summarized next:
values of H for
sum
1,560 (= M) games

Highest # over 31 occurred in 72% (≈1-28.46%) of the games


• No grand price in 72% of the times once made the bet!
Unsurprisingly, most common value of H = 39 (in 13.5% of games)
42
Estimation process
Parameter Statistic
of interest
Sample
Mean Mean
(numerical)
population X To be continued
in the next slide

Summary
other Statistic
Refer to Slides 33-34 than other than
Sample Counterpart
for a binary population mean sample mean
43
Statistic Sampling distribution of the statistic

Known Yes
? Slide 25
Yes

No Sample No
Sample Normal size n
Mean X? ≥30? Slide 38

No Sample Yes
Yes
size n
≥30? or
No Slide 27 Slide 38

Summary
Statistic An approximate sampling distribution
other than constructed via repeated sampling
sample mean Slide 40
44
Takeaway
Sampling variation; repeated sampling
Any sample statistic is a RV
Use of sample counterpart as parameter estimate
Sampling distribution of the sample mean
Central Limit Theorem
Sampling distribution of the sample proportion
Sampling distribution of other sample statistics

45

You might also like