0% found this document useful (0 votes)
5 views

Sampling Distribution

Uploaded by

shilpi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Sampling Distribution

Uploaded by

shilpi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 45

Sampling and

Sampling Distributions

Dr. Kuldeep Lamba


IMI, New Delhi

Authors: Levine, Szabat, Stephan and Viswanathan


Chapter GS, Slide 1
CASE: Identifying Fraudulent Transactions
Risk Management Analytics team of ICICI Bank needs to
review and report fraud loss rate of credit card transactions
for every quarter, due to the increase in disputed credit card
charges and fraud loss rate. To start with the team has to
estimate the loss rate for Q3,2019. The management thinks
the fraud loss rate has gone up by 2 percent which is
substantially high compared to the industry standard.

Fraudulent transactions could be of following types.


-‘Mail Order Fraud’,
- ‘Counterfeit Fraud’,
- ‘Lost or Stolen’,
- ‘Internet Fraud’
- ‘Others’
The team needs to report overall fraud rate and fraud rate in each fraud
category in two days. The team cannot compromise on the quality of the
report!!
Authors: Levine, Szabat, Stephan and Viswanathan
Chapter GS, Slide 2
Fraudulent Transactions

- The number of transactions received per month is ~100 million.

- Basis the customers’ complaints, the fraud analytics team


flags the fraud categories mentioned earlier and stores in a
variable named “Fraud_Types”.

DISCUSS HOW SHOULD THE TEAM PROCEED?

Authors: Levine, Szabat, Stephan and Viswanathan


Chapter GS, Slide 3
Fraudulent Transactions
- Step 1: Should they work with POPULATION or SAMPLE?

- Step 2: Should they calculate Mean of Fraud Loss Rate


from Population or Sample?

- Step 3: If they work with SAMPLE, there should be no


‘Selection Bias’.

- Step 4: What Sampling Technique should be used?

- Step 5: What should be right Sample Size?

Authors: Levine, Szabat, Stephan and Viswanathan


Chapter GS, Slide 4
Population and Sample

Population Sample

Make
Make Onthe
On thebasis
basisof
of
generalizations
generalizations observationsof
observations ofaa
aboutthe
about the sample,aapart
sample, partof
ofaa
characteristicsof
characteristics ofaa population
population
population...
population...

Authors: Levine, Szabat, Stephan and Viswanathan


Chapter GS, Slide 5
Sample: Unbiased, Biased

Population: DELHI
Unbiased
Sample Unbiased,
representative
sample drawn at
CONG/BJP random from the
AAP entire population.

Biased
Sample
Biased

CONG/BJP
AAP
Population: DELHI
Authors: Levine, Szabat, Stephan and Viswanathan
Chapter GS, Slide 6
Population and Sample

An element/record is the entity on which data are


collected.

A population is a collection of all the elements of


interest.

A sample is a subset of the population.

The sampled population is the population from


which the sample is drawn.
Authors: Levine, Szabat, Stephan and Viswanathan
Chapter GS, Slide 7
Sample Selection

A random sample which is also called probability


sampling of size ‘n’ from a population of size ‘N’
(Finite/Infinite) is a sample selected such that

- Every unit/element of population has the same


probability of being included in the sample.
- Each Unit is selected independently of each other.

Four Random Sampling Methods:


- Simple Random Sample
- Stratified Random Sample
- Cluster (or Area) Sample
- Systematic Random Sample
Authors: Levine, Szabat, Stephan and Viswanathan
Chapter GS, Slide 8
Random Sampling

 In random sampling all the items in the


population have a chance of being chosen in
sample. Simple Random
Sampling

Systematic
Sampling
Random
Sampling
Stratified
Sampling

Cluster
Sampling
Authors: Levine, Szabat, Stephan and Viswanathan
Chapter GS, Slide 9
Random Sample Selection
Replacing each sampled element before selecting
subsequent elements is called sampling with
replacement.

Sampling without replacement is the procedure


used most often.

For a finite population, ‘Random Number Table’ (of five


digits) can be used to generate sample.

In large sampling projects, computer-generated


random numbers are often used to automate the
sample selection process.
Authors: Levine, Szabat, Stephan and Viswanathan
Chapter GS, Slide 10
Selection of a Random Sample
(finite population)

Suppose one would like to select a simple random


sample of 30 metropolitan cities areas to do an in-depth
study of the standard of living. A population of 100
metropolitan areas in US and Canada and their overall
rating are given. SELECT a RANDOM SAMPLE of 30
metro cities.

- Using Random Number Table


- Using Excel or Statistical Software

Authors: Levine, Szabat, Stephan and Viswanathan


Chapter GS, Slide 11
Selection of a Random Sample
using Random Number Table
- Mark/Number the population frame from 1 to 100.

- We need 30 random drawings

 In Random Number Table, proceed by selecting either row or column of 3-digit


number.
 Ignore any number > 100 (outside the range of population frame)

- Assume we start by rows. Start with first row, moving to the second, continue
until we get 30 random numbers.

- Assume we get the following random numbers :


155, 906, 929, 830, 850, 895, 105, 667, 641, 034, and so on.

- Ignore first 9 numbers.

- Continue till you get 30 random numbers.

- Then select the cities with the corresponding serial numbers.


Authors: Levine, Szabat, Stephan and Viswanathan
Chapter GS, Slide 12
Selection of a Random Sample using
Excel
=INDEX(array, row_num, col_num)
array is the population of interest,

row number can be generated either by using =RANDBETWEEN(1,N) ---N


being the number of rows in population. It can be determined by
=COUNTA function.

or by using =RANK functions in Excel. To use =RANK function, we first


assign a random number between 0 and 1 to each member of population
using =RAND() function and then rank each of these random numbers with
respect to the entire population of random numbers.

col_num is the column number of the population selected under array. If


the selected population is having just 1 column, then col_num will always
be 1, for arrays having multiple columns, appropriate column number
which consists the data of interest should be picked)
Authors: Levine, Szabat, Stephan and Viswanathan
Chapter GS, Slide 13
Stratified Random Sampling
The population is first divided into groups of elements called
strata. Each element in the population belongs to one and
only one stratum. A simple random sample is taken from each
stratum.
Best results are obtained when the elements within each
stratum are as much alike as possible (i.e. a homogenous
group). Population
Total Fraudulent
Transactions

Stratum 1 Stratum 2 Stratum 3 Stratum 4 Stratum 5


Mail Order Fraud Counterfeit Fraud Lost/Stolen Internet Fraud Others

Elements in each stratum are homogenous of nature. Draw simple


random sample from each stratum of size n, and then estimate fraud
loss rate Authors: Levine, Szabat, Stephan and Viswanathan
Chapter GS, Slide 14
Cluster Sampling
The population is first divided into separate groups of
elements called clusters. Ideally, each cluster is a
representative small-scale version of the population (i.e.
heterogeneous group).
A simple random sample of the clusters is then taken.

Population
Total Fraudulent
Transactions

Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster


Mail Order Fraud, Counterfeit Fraud (20%) (20%) (20%) 5
Lost/Stolen, Internet Fraud (20%)
Other types (20%)

A primary application is area sampling, where clusters are city


blocks or other well-defined
Authors:areas.
Levine, Szabat, Stephan and Viswanathan
Chapter GS, Slide 15
Systematic Sampling
If the population size is N and sample size is n, then N/n = k , where
k (say) is a rounded integer. We pick a number at random between
1 and k (say) ‘l’. We then pick the l th, (l+kth), (l+2kth) …… and so on
till we get sample size n. This works when the order of elements in
the population is in random order.

In the metro cities case, to select a sample of 30, from the


population of cities 900, we find 900/30=30. Then we pick a number
between 1 and 30, say, 15th. We select 15th city, then pick every 30th
city in the list and continue till we select a sample of 30 metro cities.

In ICICI Bank case, to select a sample of 1 million transactions out


of 100 million transactions, we know 100/1 =100. We then pick a
number between 1 and 100 say 30th. We select 30th record and pick
every 100th record from the record of transactions and continue to
select 1 million transaction.
Authors: Levine, Szabat, Stephan and Viswanathan
Chapter GS, Slide 16
Non-random Sampling
 They do not provide units in the population a
known chance of being selected in the sample.
The selection procedure is partially subjective.

Convenience Sampling

Judgment Sampling

Non-Random Sampling Quota Sampling

Shopping Mall Intercept


Sampling

Snowball Sampling

Authors: Levine, Szabat, Stephan and Viswanathan


Chapter GS, Slide 17
Sampling Distributions
 A sampling distribution is a distribution of all of the
possible values of a sample statistic for a given sample
size selected from a population.

 For example, suppose you sample 50 students from your


college regarding their mean GPA. If you obtained many
different samples of size 50, you will compute a different
mean for each sample. We are interested in the
distribution of all potential mean GPAs we might
calculate for any sample of 50 students.

Authors: Levine, Szabat, Stephan and Viswanathan


Chapter GS, Slide 18
Developing a Sampling Distribution

 Assume there is a population …


A C D
 Population size N=4 B
 Random variable, X,
is age of individuals
 Values of X: 18, 20,
22, 24 (years)

Authors: Levine, Szabat, Stephan and Viswanathan


Chapter GS, Slide 19
Developing a Sampling Distribution
(continued)

Summary Measures for the Population Distribution:

μ
 X i P(x)
N .3
18  20  22  24 .2
 21
4 .1
0
σ
 (X i  μ) 2

2.236
18 20 22 24 x
N A B C D
Uniform Distribution

Authors: Levine, Szabat, Stephan and Viswanathan


Chapter GS, Slide 20
Developing a Sampling Distribution
(continued)

Now consider all possible samples of size n=2

16 Sample
1st 2nd Observation
Obs Means
18 20 22 24
1st 2nd Observation
18 18,18 18,20 18,22 18,24 Obs 18 20 22 24
20 20,18 20,20 20,22 20,24 18 18 19 20 21
22 22,18 22,20 22,22 22,24
20 19 20 21 22
24 24,18 24,20 24,22 24,24
16 possible samples 22 20 21 22 23
(sampling with
replacement)
24 21 22 23 24
Authors: Levine, Szabat, Stephan and Viswanathan
Chapter GS, Slide 21
Developing a Sampling Distribution
(continued)

Sampling Distribution of All Sample Means

16 Sample Means Sample Means


Distribution
1st 2nd Observation _
P(X)
Obs 18 20 22 24
.3
18 18 19 20 21
.2
20 19 20 21 22
.1
22 20 21 22 23 _
0 18 19 20 21 22 23 24
24 21 22 23 24 X
(no longer uniform)
Authors: Levine, Szabat, Stephan and Viswanathan
Chapter GS, Slide 22
Developing A Sampling Distribution
(continued)

Summary Measures of this Sampling Distribution:

18  19  19    24
μX  21
16
(18 - 21) 2  (19 - 21) 2    (24 - 21) 2
σX  1.58
16

Note: Here we divide by 16 because there are 16


different samples of size 2.

Authors: Levine, Szabat, Stephan and Viswanathan


Chapter GS, Slide 23
Comparing the Population Distribution
to the Sample Means Distribution

Population Sample Means Distribution


N=4 n=2
μ 21 σ 2.236 μX 21 σ X 1.58
_
P(X) P(X)
.3 .3
.2 .2

.1 .1
0 X 0 18 19 20 21 22 23 24
_
18 20 22 24 X
A B C D
Authors: Levine, Szabat, Stephan and Viswanathan
Chapter GS, Slide 24
Sample Mean Sampling Distribution:
Standard Error of the Mean
 Different samples of the same size from the same
population will yield different sample means
 A measure of the variability in the mean from sample to
sample is given by the Standard Error of the Mean:
(This assumes that sampling is with replacement or
sampling is without replacement from an infinite
population)

σ
σX 
n
 Note that the standard error of the mean decreases as
the sample size increases
Authors: Levine, Szabat, Stephan and Viswanathan
Chapter GS, Slide 25
Sample Mean Sampling Distribution:
If the Population is Normal

 If a population is normal with mean μ and


standard deviation σ, the sampling distribution
of X is also normally distributed with

σ
μ X μ andσ
X

n

Authors: Levine, Szabat, Stephan and Viswanathan


Chapter GS, Slide 26
Sampling Distribution of
 The standard deviation of for a finite population (without
replacement) is
 The standard deviation of for a finite population (with
replacement) is
 A finite population is treated as being infinite if
 is the finite population multiplier (fpm)
 is referred to as the standard error of the mean.
 For infinite populations, fpm may be ignored as it becomes
approximately 1

Authors: Levine, Szabat, Stephan and Viswanathan


Chapter GS, Slide 27
Z-value for Sampling Distribution
of the Mean
 Z-value for the sampling distribution of X :

( X  μX ) ( X  μ)
Z 
σX σ
n
where: X = sample mean
μ = population mean
σ = population standard deviation
n = sample size

Authors: Levine, Szabat, Stephan and Viswanathan


Chapter GS, Slide 28
Sampling Distribution Properties

Normal Population
μx μ Distribution

μ x
(i.e. x is unbiased ) Normal Sampling
Distribution
(has the same mean)

μx
x
Authors: Levine, Szabat, Stephan and Viswanathan
Chapter GS, Slide 29
Sampling Distribution Properties
(continued)

As n increases, Larger
σ decreases
x
sample size

Smaller
sample size

μ x
Authors: Levine, Szabat, Stephan and Viswanathan
Chapter GS, Slide 30
Determining An Interval Including A
Fixed Proportion of the Sample Means

Find a symmetrically distributed interval around µ


that will include 95% of the sample means when µ
= 368, σ = 15, and n = 25.
 Since the interval contains 95% of the sample means
5% of the sample means will be outside the interval
 Since the interval is symmetric 2.5% will be above
the upper limit and 2.5% will be below the lower limit.
 From the standardized normal table, the Z score with
2.5% (0.0250) below it is -1.96 and the Z score with
2.5% (0.0250) above it is 1.96.

Authors: Levine, Szabat, Stephan and Viswanathan


Chapter GS, Slide 31
Determining An Interval Including A
Fixed Proportion of the Sample Means

 Calculating the lower limit of the interval


σ 15
X L μ  Z 368  ( 1.96) 362.12
n 25
 Calculating the upper limit of the interval
σ 15
X U μ  Z 368  (1.96) 373.88
n 25
 95% of all sample means of sample size 25 are
between 362.12 and 373.88

Authors: Levine, Szabat, Stephan and Viswanathan


Chapter GS, Slide 32
Sample Mean Sampling Distribution:
If the Population is not Normal

 We can apply the Central Limit Theorem:


 Even if the population is not normal,
 …sample means from the population will be
approximately normal as long as the sample size is
large enough.

Properties of the sampling distribution:

σ
μ x μ and σx 
n
Authors: Levine, Szabat, Stephan and Viswanathan
Chapter GS, Slide 33
Central Limit Theorem

the sampling
As the n↑ distribution of
sample the sample
size gets mean becomes
large almost normal
enough… regardless of
shape of
population

x
Authors: Levine, Szabat, Stephan and Viswanathan
Chapter GS, Slide 34
Sample Mean Sampling Distribution:
If the Population is not Normal
Population Distribution
Sampling distribution
properties:
Central Tendency
μ x μ
μ x
Variation Sampling Distribution
σ (becomes normal as n increases)
σx  Larger
n Smaller
sample size
sample
size

μx x
Authors: Levine, Szabat, Stephan and Viswanathan
Chapter GS, Slide 35
How Large is Large Enough?

 For most distributions, n > 30 will give a


sampling distribution that is nearly normal
 For fairly symmetric distributions, n > 15
 For a normal population distribution, the
sampling distribution of the mean is always
normally distributed

Authors: Levine, Szabat, Stephan and Viswanathan


Chapter GS, Slide 36
Example

 Suppose a population has mean μ = 8 and


standard deviation σ = 3. Suppose a random
sample of size n = 36 is selected.

 What is the probability that the sample mean is


between 7.8 and 8.2?

Authors: Levine, Szabat, Stephan and Viswanathan


Chapter GS, Slide 37
Example
(continued)

Solution:
 Even if the population is not normally
distributed, the central limit theorem can be
used (n > 30)
 … so the sampling distribution of x is
approximately normal
 … with mean μ = 8
x
 …and standard deviationσ x  σ  3 0.5
n 36
Authors: Levine, Szabat, Stephan and Viswanathan
Chapter GS, Slide 38
Example
(continued)
Solution (continued):
 
 7.8 - 8 X -μ 8.2 - 8 
P(7.8  X  8.2)  P   
 3 σ 3 
 36 n 36 
 P(-0.4  Z  0.4)  0.6554 - 0.3446  0.3108

Population Sampling Standard Normal


Distribution Distribution Distribution
???
? ??
? ? Sample Standardize
? ? ?
?
7.8 8.2 -0.4 0.4
μ 8 X μX 8 x μz 0 Z

Authors: Levine, Szabat, Stephan and Viswanathan


Chapter GS, Slide 39
Population Proportions

p = the proportion of the population having


some characteristic
 Sample proportion p̂provides an estimate
of p:

x number of items in the sample having the characteristic of interest


p̂  
n sample size
 0 ≤ p̂≤ 1
 is approximately distributed as a normal distribution
p̂when n is large
(assuming sampling with replacement from a finite population or
without replacement from an infinite population)

Authors: Levine, Szabat, Stephan and Viswanathan


Chapter GS, Slide 40
Sampling Distribution of p

 Approximated by a
pr Sampling Distribution
normal distribution if:

.3
.2
np 5 .1
0
and 0 .2 .4 .6 8 1 p
n(1  p ) 5


where
𝜇 ^𝑝 =𝑝
𝑝(1 − 𝑝)
𝜎 𝑝^ =
and

𝑛
(where p =Levine,
Authors: population proportion)
Szabat, Stephan and Viswanathan
Chapter GS, Slide 41
Z-Value for Proportions
Standardize p̂ to a Z value with the formula:

^ −𝑝
𝑝 ^ −𝑝
𝑝
𝑍= =


𝜎 ^𝑝 𝑝 (1− 𝑝)
𝑛

Authors: Levine, Szabat, Stephan and Viswanathan


Chapter GS, Slide 42
Example

 If the true proportion of voters who support


Proposition A is p = 0.4, what is the probability
that a sample of size 200 yields a sample
proportion between 0.40 and 0.45?

 i.e.: if p = 0.4 and n = 200, what is


P(0.40 p̂
≤ ≤ 0.45) ?

Authors: Levine, Szabat, Stephan and Viswanathan


Chapter GS, Slide 43
Example
(continued)
 if p = 0.4 and n = 200, what is
P(0.40p̂
≤ ≤ 0.45) ?

Find : p (1  p ) 0.4(1  0.4)


σ p̂   0.03464
σ p-hat n 200

Convert to  0.40  0.40 0.45  0.40 


P(0.40 p̂ 0.45)  P Z  
standardized  0.03464 0.03464 
normal:  P(0 Z 1.44)

Authors: Levine, Szabat, Stephan and Viswanathan


Chapter GS, Slide 44
Example
(continued)
 if p = 0.4 and n = 200, what is
P(0.40 p̂
≤ ≤ 0.45) ?
Utilize the cumulative normal table:
P(0 ≤ Z ≤ 1.44) = 0.9251 – 0.5000 = 0.4251

Standardized
Sampling Distribution Normal Distribution

0.4251

Standardize

0.40 0.45 0 1.44


p Z
Authors: Levine, Szabat, Stephan and Viswanathan
Chapter GS, Slide 45

You might also like