0% found this document useful (0 votes)
12 views51 pages

WEEK 7.1 - Discrete Probability Distribution (Poisson)

The document discusses various sampling methods used in business statistics, including simple random sampling, systematic sampling, stratified sampling, cluster sampling, multistage sampling, quota sampling, convenience sampling, and panel sampling. Each method is described in terms of its advantages and disadvantages, as well as its applicability to different research scenarios. The document emphasizes the importance of selecting representative samples to make accurate statistical inferences about a population.

Uploaded by

grace musa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views51 pages

WEEK 7.1 - Discrete Probability Distribution (Poisson)

The document discusses various sampling methods used in business statistics, including simple random sampling, systematic sampling, stratified sampling, cluster sampling, multistage sampling, quota sampling, convenience sampling, and panel sampling. Each method is described in terms of its advantages and disadvantages, as well as its applicability to different research scenarios. The document emphasizes the importance of selecting representative samples to make accurate statistical inferences about a population.

Uploaded by

grace musa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 51

BUS 6225B:

BUSINESS STATISTICS

LE C TURE SIX
S A M P L I N G M ET H O D S A N D
S A M P L I N G D I S T R I B UT I O N
L E C T U R ER : D R G R A C E M U SA
C L A SS: M B A 2 0 2 5
Sample and population
 A population: A complete listing or a collection of all
the elements of interest in a statistical study – Main
measure of population characteristics are the population
parameters (Means , , and Proportions) .
 A sample is a subset of the population.
 Good or bad samples.
 Representative or non-representative samples. A
researcher hopes to obtain a sample that represents
the population, at least in the variables of interest for
the issue being examined.
 Probabilistic samples are samples selected using the
principles of probability. This may allow a researcher
to determine the sampling distribution of a sample
statistic. If so, the researcher can determine the
probability of any given sampling error and make
statistical inferences about population characteristics.
Why sample?

 Time of researcher and those being surveyed.


 Cost to group or agency commissioning the survey.
 Confidentiality, anonymity, and other ethical issues.
 Non-interference with population. Large sample
could alter the nature of population, eg. opinion
surveys.
 Do not destroy population, eg. crash test only a
small sample of automobiles.
 Cooperation of respondents – individuals, firms,
administrative agencies.
 Partial data is all that is available, eg. fossils and
historical records, climate change.
SAMPLING METHODS

SIMPLE RANDOM SAMPLING


 Applicable when population is small, homogeneous & readily
available
 All subsets of the frame are given an equal probability. Each
element of the frame thus has an equal probability of selection.
 It provides for greatest number of possible samples. This is done
by assigning a number to each unit in the sampling frame.
 A table of random number or lottery system is used to determine
which units are to be selected.
SIMPLE RANDOM SAMPLING……..
 Estimates are easy to calculate.
 Simple random sampling is always an EPS design, but not
all EPS designs are simple random sampling.

 Disadvantages
 If sampling frame large, this method impracticable.
 Minority subgroups of interest in population may not be
present in sample in sufficient numbers for study.
REPLACEMENT OF SELECTED UNITS
 Sampling schemes may be without replacement ('WOR' - no
element can be selected more than once in the same sample) or
with replacement ('WR' - an element may appear multiple times in
the one sample).
 For example, if we catch fish, measure them, and immediately
return them to the water before continuing with the sample, this is
a WR design, because we might end up catching and measuring
the same fish more than once. However, if we do not return the
fish to the water (e.g. if we eat the fish), this becomes a WOR
design.
SYSTEMATIC SAMPLING
 Systematic sampling relies on arranging the target population
according to some ordering scheme and then selecting elements
at regular intervals through that ordered list.
 Systematic sampling involves a random start and then proceeds
with the selection of every kth element from then onwards. In
this case, k=(population size/sample size).
 It is important that the starting point is not automatically the first
in the list, but is instead randomly chosen from within the first
to the kth element in the list.
 A simple example would be to select every 10th name from the
telephone directory (an 'every 10th' sample, also referred to as
'sampling with a skip of 10').
SYSTEMATIC SAMPLING……
 ADVANTAGES:
 Sample easy to select
 Suitable sampling frame can be identified easily
 Sample evenly spread over entire reference population
 DISADVANTAGES:
 Sample may be biased if hidden periodicity in population coincides
with that of selection.
 Difficult to assess precision of estimate from one survey.
STRATIFIED SAMPLING
Where population embraces a number of distinct categories, the
frame can be organized into separate "strata." Each stratum is then
sampled as an independent sub-population, out of which individual
elements can be randomly selected.
 Every unit in a stratum has same chance of being selected.
 Using same sampling fraction for all strata ensures proportionate
representation in the sample.
 Adequate representation of minority subgroups of interest can be
ensured by stratification & varying sampling fraction between
strata as required.
STRATIFIED SAMPLING……
 Finally, since each stratum is treated as an independent
population, different sampling approaches can be applied to
different strata.

 Drawbacks to using stratified sampling.


 First, sampling frame of entire population has to be prepared
separately for each stratum
 Second, when examining multiple criteria, stratifying variables
may be related to some, but not to others, further complicating
the design, and potentially reducing the utility of the strata.
 Finally, in some cases (such as designs with a large number of
strata, or those with a specified minimum sample size per
group), stratified sampling can potentially require a larger
sample than would other methods
CLUSTER SAMPLING
 Cluster sampling is an example of 'two-stage sampling' .
 First stage a sample of areas is chosen;
 Second stage a sample of respondents within those areas is selected.
 Population divided into clusters of homogeneous units, usually based on
geographical contiguity.
 Sampling units are groups rather than individuals.
 A sample of such clusters is then selected.
 All units from the selected clusters are studied.
Advantages :
 Cuts down on the cost of preparing a sampling frame.
 This can reduce travel and other administrative costs.
 Disadvantages: sampling error is higher for a simple random sample of same size.
CLUSTER SAMPLING…….

Two types of cluster sampling methods.


One-stage sampling. All of the elements within selected clusters
are included in the sample.
Two-stage sampling. A subset of elements within selected clusters
are randomly selected for inclusion in the sample
Difference Between Strata and Clusters
 Although strata and clusters are both non-overlapping subsets of
the population, they differ in several ways.
 All strata are represented in the sample; but only a subset of
clusters are in the sample.
 With stratified sampling, the best survey results occur when
elements within strata are internally homogeneous. However,
with cluster sampling, the best results occur when elements
within clusters are internally heterogeneous
MULTISTAGE SAMPLING
 Complex form of cluster sampling in which two or more levels of
units are embedded one in the other.
 First stage, random number of districts chosen in all
Counties .
 Followed by random number of Divisions Then third stage units
will be houses.
All ultimate units (houses, for instance) selected at last step are
surveyed.
MULTISTAGE SAMPLING……..

 This technique, is essentially the process of taking random


samples of preceding random samples.
 Not as effective as true random sampling, but probably solves
more of the problems inherent to random sampling.
 An effective strategy because it banks on multiple
randomizations. As such, extremely useful.
 Multistage sampling used frequently when a complete list of all
members of the population not exists and is inappropriate.
 Moreover, by avoiding the use of all sample units in all selected
clusters, multistage sampling avoids the large, and perhaps
unnecessary, costs associated with traditional cluster sampling.
QUOTA SAMPLING
 The population is first segmented into mutually exclusive sub-
groups, just as in stratified sampling.
 Then judgment used to select subjects or units from each segment
based on a specified proportion.
 For example, an interviewer may be told to sample 200 females
and 300 males between the age of 45 and 60.
 It is this second step which makes the technique one of non-
probability sampling.
 In quota sampling the selection of the sample is non-random.
 For example interviewers might be tempted to interview those who
look most helpful. The problem is that these samples may be
biased because not everyone gets a chance of selection. This
random element is its greatest weakness and quota versus
probability has been a matter of controversy for many years
CONVENIENCE SAMPLING

 Sometimes known as grab or opportunity sampling or accidental or


haphazard sampling.
 A type of nonprobability sampling which involves the sample being drawn from
that part of the population which is close to hand. That is, readily available and
convenient.
 The researcher using such a sample cannot scientifically make generalizations
about the total population from this sample because it would not be representative
enough.
 For example, if the interviewer was to conduct a survey at a shopping center
early in the morning on a given day, the people that he/she could interview would
be limited to those given there at that given time, which would not represent the
views of other members of society in such an area, if the survey was to be
conducted at different times of day and several times per week.
 This type of sampling is most useful for pilot testing.
 In social science research, snowball sampling is a similar technique, where
existing study subjects are used to recruit more subjects into the sample.
PANEL SAMPLING
 Method of first selecting a group of participants through a random
sampling method and then asking that group for the same
information again several times over a period of time.
 Therefore, each participant is given same survey or interview at
two or more time points; each period of data collection called a
"wave".
 This sampling methodology often chosen for large scale or nation-
wide studies in order to gauge changes in the population with
regard to any number of variables from chronic illness to job stress
to weekly food expenditures.
 Panel sampling can also be used to inform researchers about
within-person health changes due to age or help explain changes in
continuous dependent variables such as spousal interaction.
 There have been several proposed methods of analyzing panel
sample data, including growth curves.
Obtaining a Random sample

Population elements are A, B, C, D. N=4, n=2.


 Sampling without replacement:
 If order matters, the 1 st element selected could be any one of the 4

elements and this leaves 3, so there are 4 x 3 = 12 possible samples,


each equally likely:
N!
AB, AC,4!
AD, BA, BC, BD, CA, CB, CD, DA, DB, DC.
N
P n

( N  n)!

( 4  2)!
12

 If the order of selection does not matter (ie. we are interested


only in what elements are selected), then this reduces to 6
combination. If {AB} is AB or BA, etc., then the equally likely
random samples are {AB}, {AC}, {AD}, {BC}, {BD}, {CD}. This
is the number of combinations N N! 4!
C n n!( N  n)! 2!(4  2)! 6
 

 After knowing the possible elements in the sample, that is the


sample space, one would then use various procedures to selected
the 2 elements required
Sample Space

 A firm can buy stationery and office supplies from


any of the 8 companies. If the firm decides to use
3 suppliers in a given year and wants to avoid
accusation of bias the sample of 3 should be
selected at random among the five.
 How many different samples of size 2 should the
firm choose from?
N N! 8!
Cn   56
n!( N  n)! 3!(8  3)!
Sample Space Cont...

 List them: Let the firms be donated by


A,B,C,D,E,F,G,H,
ABC ACF AEG BCG BEH CEF DEH
ABD ACG AEH BCH BFG CEG DFG
ABE ACH AFG BDE BFH CEH DFH
ABF ADE AFH BDF BGH CFG DGH
ABG ADF AGH BDG CDE CFH EFG
ABH ADG BCD BDH CDF CGH EFH
ACD ADH BCE BEF CDG DEF EGH
ACE AEF BCF BEG CDH DEG FGH
Each of the above 56 samples of size 3 has an equal
chance of 1/56 of being selected
Use of Random Numbers

Assign numbers 1,2,3…,8 to all the supplies


Decide where to start on a random number
table (row and column);
Since we only need a single digit, select
assign the first Random Number to the
sample item as long as it is any one of the
above, otherwise move to the next digit down
a column .
Generating RN from Excel

Formula Description (Result)

=RAND() A random number between 0


and 1 (varies)

A random number greater than


=RAND()*100 or equal to 0 but less than 100
(varies)
A random integral number
=RANDBETWEEN(low, high) between the lower number and
higher number, such as
RANDBETWEEN(50, 100).
RN & Excel Cont..

1.In a blank cell enter the


formula =RAND(), and
press Enter key, a number
0.2362950.3980810.8460240.693901
between 0 and 1 is generated
0.4542560.1907680.5636420.068409
randomly. Then select the cell and
0.0522470.3653510.0651410.534623
drag the fill handle to the
0.4083450.4708230.5581540.234203range
that you
0.413403 want to contain this
0.667170.9734470.206947
formula, and a range of random
0.3400830.9265640.4827830.528636
numbers have been inserted into
the Excel. For Example
RN Cont.

2. f you want to insert some random


numbers between 0 and 100, 45.4119
13.88595 81.74856 54.24615
please enter
this formula
41.58475 =RAND()*100,
34.20922 and you will
1.120129 42.70324
59.0315 20.30708 51.72761 33.72973
get the following result::
94.31772 72.28139 55.97761 92.67282
1.23235 94.36976 19.70861 27.39474
84.5411 9.224513 19.0643 92.01927
RN Cont.

3. Supposing, you need to insert some


random907
integral numbers
810 430
between 100
961
and 10000
285 in a885
range, please
730 use 139
this
238 952 105 381
formula866
=RANDBETWEEN(100,
854 143
10000),
495
and
167 you will
873 get this
777result: 201
328 686 242 651
Using RN IN LAST SLIDE

First N = 18 MFIs
In Kenya
1. MFI 1
2. MFI 2 Suppose you were asked to select a
3. MFI 3
4. MFI 4 simple random sample of size n =5
5. MFI 5
6. MFI 6 FROM the 18 cases, use RN in the
7.
8.
MFI 7
MFI 8
previous slide to choose the items .
9. MFI 9
10. MFI 10 .
11. MFI 11
MFI 12
12.
13. MFI 13 Always keep track of where you last
14.
15.
MFI 14
MFI 15
used the table and begin the next
16.
17.
MFI 16
MFI 17
selection at that point.
18. MFI 18
Sampling distributions
 Basic Concept
 The ultimate goal of generating a random sample is make
inferences about the nature of the population from which it is
drawn.
 Key objective is to estimate a numerical measure of a population
called a population parameter using a sample statistic.
 Note:
 The value of a population parameter is usually CONSTANT
albeit unknown and never vary from sample to sample
 Sample statistics usually vary with the sample selected (For
example, selecting the same sample size from different points
on a RN table) will result in different units hence different
statistics)
 Since statistics vary from sample to sample, any
inferences based on them will be subject to some
level of uncertainties
Basic Concepts Cont..

 Why then should one use such a measure to make inferences on the
population given this apparent reliability?
 Answer lies on the fact that UNCERTAINITY of a statistics is characterized
by know properties reflected in what is often called SAMPLING distribution
 . Each sample contains different elements so the value of the sample
statistic differs for each sample selected. These statistics provide different
estimates of the parameter. The sampling distribution describes how these
different values are distributed.
 Knowledge of a Sampling Distribution .of a particular statistic provides the
information about the performance over the long run
A sampling distribution of a sample statistic (based on n observations) is the
relative frequency distribution of the statistic theoretically generated by
taking repeated samples of size n from a population of size N and computing
the statistic for each sample.
Sampling distribution of the sample
mean

 Using a Random Number Table, select 5 different samples of size 5 and for each
calculate the sample mean of each sample
 Repeat the above but this time with 5 different samples of size 25 say;
 Find the mean and standard deviation of each set of sample sizes
 What do you notice? That is what happens as n increases??
 The same results would be obtained if f5equency distribution figure was used to
illustrate the results. That is lower n will results in higher variability in the sample
statistics
 When a sample is selected, the sampling method may allow the researcher to
determine the sampling distribution of the sample mean x. ͞ The researcher hopes

E (x ) 
that the mean of the sampling distribution will be μ, the mean of the population. If
this occurs, then the expected value of the statistic x ͞ is μ. This characteristic of the
sample mean is that of being an unbiased estimator of μ. In this case,
 If the variance of the sampling distribution can be determined, then the researcher
is able to determine how variable x ͞ is when there are repeated samples. The
researcher hopes to have a small variability for the sample means, so most
estimates of μ are close to μ.
Sampling distribution of the mean and
Central Limit Theorem

 We have noted that the sample mean is often used as


a tool to make inferences about the corresponding
population parameter.
 The following theorem provides information on how to
determine the actual sampling distribution of the
sampling mean
 Central Limit Theorem: If n is sufficiently large, then
the mean of a random sample from a population has a
sampling distribution that is approximately normal
regardless of the shape of the relative frequency
distribution of the target population. As the sample size
increases, the better will be the normal approximation
to the sampling distribution
Properties of Sampling Distribution of
Sampling Mean
 Assumed normal for large sample
 If is the sample mean of size n from a population
with mean and standard deviation then
 The sampling distribution of has a mean equal to that of
population from which it was drawn That is
E( ) =
 The sampling distribution of of has a standard deviation equal
to the standard deviation of the population from which it was
selected divided by the square root of sample size n .
Technically, this is sometimes referred to as the Standard Error
(SE)

SE 
n
Random sample from a normally
distributed population

Normally Sampling distribution of x


͞
distributed when sample is random
population
No. of elements N n
Mean μ μ


Standard σ x 
deviation n

Note: If n/N > 0.05, it may be best to use the


finite population correction factor
Large random sample from any population

Any Sampling distribution of


population ͞x when sample is
random
No. of elements N n
Mean μ μ

Standard σ x 
deviation n

A sample size n of greater than 100 is


generally considered sufficiently large to
use.
Example

 A Nairobi newspaper recently reported that for


families in its circulation zones, the distribution
of the weekly expenditure on food consumed
away from home is on average Ksh.140 and a
standard deviation of Ksh.4. In order to check
this claim you have randomly selected100
families in the area and monitored the
expenditure on food away form home.
 Assuming the papers claim is true, describe the
sampling distribution of the mean weekly expenditure
for food away from home for a random sample of size
100
 Assuming the papers claim is true, what is the
probability that the sample weekly expenditure for food
away will be at least 150 Ksh.
Solution
 Through there is no information on shape of
relative frequency of the distribution of weekly
expenditure, the CLT enables us to conclude that
the sampling distribution of sample mean of
weekly expenditure based on 100 observations is
approximately normal. Therefore
 E( ) = 140 and
 If the paper’s claim is true, then the
is as follows 150  140
p ( x 150)  p ( z  2.5
4
Sampling Distribution of a
Proportion…

The estimator of a population proportion of


successes is the sample proportion. That
is, we count the number of successes in a
sample and compute:

X is the number of successes, n is the


sample size.
Normal Approximation to Binomial…

Binomial distribution with n=20 and p=.5 with


a normal approximation superimposed ( =10
and =2.24)

Hence:
and
Normal Approximation to Binomial…

 Normal approximation to the binomial works best


when the number of experiments, n, (sample size)
is large, and the probability of success, p, is close
to 0.5

 For the approximation to provide good results two


conditions should be met:
1) np ≥ 5
 2) n(1–p) ≥ 5
Sampling Distribution of a Sample
Proportion…

Using the laws of expected value and


variance, we can determine the mean,
variance, and standard deviation of .
(The standard deviation of is called the
standard error of the proportion.)
Sample proportions can be standardized to a
standard normal distribution using this
formulation:

p

p for
Calculating a Probability

Consider a population of qualitative data


where p isthe proportion having a
particular attribute
p of interest. Let be
the corresponding proportion in a random
sample

of n observations. To find the
probability
p associated with , follow
these steps:

p

Step 1
 Define the sample proportion of interest in words. This is
important because every qualitative set of data has more than
one category, and you must be sure to identify the category, or
attribute, of interest. Also specify the values of the population
proportion p and the sample size n.

Step 2
 Find the values of the mean and standard
error of the sampling distribution of p hat
using
p 
p

p (1  p )
  
p n
Steps Cont..
Step 3
 Verify that the sampling distribution of p is
approximately normal by checking that the
following holds np>5 and n(1-p) > 5
Step 4
 Sketch a normal curve, and shade the area
corresponding to the probability of interest
Steps Cont
Step 5
 Calculate the z-scores corresponding to the appropriate values
of

p. (Remember p is the same as the mean)
p p
z 
 
p

Step 6
 Use table to find the area under the normal curve
corresponding to each calculated z-score.
Step 7
 With the help of the curve sketched in Step 4, find the
probability of interest by adding or subtracting appropriate
areas.

9.44

p

Example

p
Solution - Describe Sampling Distribution of
Population p = .52
Sample: Random, n = 300
 Sampling distribution:
p = .50
p(1  p)
 
 .0288
p n

normal because np = 300(.52) =


156 and
n(1-p) = 300(1-.52) = 144 (both
greater than 5)
Sampling distribution of p
Step 1: Suppose the possible values of p
are 0/5=0, 1/5=.2, 2/5=.4, 3/5=.6, 4/5=.8,
 Binomial
Probabilities
5/5=1
p(x) for n=5, p 0 .2 .4 .6 .8 1
p = 0.5
x p(x)
P(p) .03125 .15625 .3125 .3125 .15625 .03125
0 0.03125
1 0.15625 The above table is the probability distribution of
2 0.3125 p, the proportion of heads in 5 tosses of a fair
3 0.3125
4 0.15625
coin.
5 0.03125
Sampling distribution of p
(cont.)
p 0 .2 .4 .6 .8 1
P(p) .03125 .15625 .3125 .3125 .15625 .03125

E(p) =0*.03125+ 0.2*.15625+ 0.4*.3125


+0.6*.3125+ 0.8*.15625+ 1*.03125 =
0.5 = p (the prob of heads)
Var(p) = (0  .5)  .03125  (.2  .5)  .15625  (.4  .5)  .3125
2 2 2

2 2 2
(.6  .5)  .3125  (.8  .5)  .15625  (1  .5)  .03125

.05

So SD(p) = sqrt(.05) = .2236


NOTE THAT SD(p) =pq .5  .5 .5
  .2236
n 5 5
Expected Value and Standard Deviation of the Sampling Distribution of p

E(p) = p

p (1  p )
SD(p) =
n

where p is the “success” probability


in the sampled population and n is
the sample size
Shape of Sampling Distribution of p

The sampling distribution of p is


approximately normal when the sample
size n is large enough. n large enough
means np ≥ 10 and n(1-p) ≥ 10
The sampling distribution model for a sample
proportion p
Provided that the sampled values are independent and the
sample size n is large enough, the sampling distribution of
p is modeled by a normal distribution with E(p) = p and
pq
standard deviation nSD(p) = , that is
 pq 
pˆ ~ N  p, 
 n 
where q = 1 – p and where n large enough means np>=10 and
nq>=10
The Central Limit Theorem will be a formal statement of this
fact.
END/QA

You might also like