0% found this document useful (0 votes)
69 views

Lecture 13

The document discusses different types of sampling methods used in statistical inference. It introduces simple random sampling, where every member of the population has an equal chance of being selected for the sample. Systematic sampling selects elements at uniform intervals from a sampling frame. Stratified sampling divides the population into homogeneous subgroups and samples separately from each subgroup. Cluster sampling selects clusters randomly and samples from the selected clusters. These sampling methods allow statisticians to make inferences about unknown population parameters from sample data.

Uploaded by

ABHIJIT SAHOO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views

Lecture 13

The document discusses different types of sampling methods used in statistical inference. It introduces simple random sampling, where every member of the population has an equal chance of being selected for the sample. Systematic sampling selects elements at uniform intervals from a sampling frame. Stratified sampling divides the population into homogeneous subgroups and samples separately from each subgroup. Cluster sampling selects clusters randomly and samples from the selected clusters. These sampling methods allow statisticians to make inferences about unknown population parameters from sample data.

Uploaded by

ABHIJIT SAHOO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

Sampling and Sampling

Distribution
Introduction
 Statistical inference is the process of converting data into information.
 Parameters describe populations.
 Parameters are almost always unknown.
 We take a random sample of a population to obtain the necessary data.
 We calculate one or more statistics from the sample data.
 For example, to estimate a population mean, we compute the sample mean.
 Although there is very little chance that the sample mean and the population mean are identical,
we would expect them to be quite close. However, for the purposes of statistical inference, we
need to be able to measure how close.
 The sampling distribution provides this service. It plays a crucial role in the process because the
measure of proximity it provides is the key to statistical inference.
Problem Introduction
 Although there are 200 million TV viewers in the United States and somewhat over half that
many TV sets, only about 1000 of those sets are samples to determine what programs Americans
watch. Why select only about 1000 sets out of 100 million? Because time and the average cost
of an interview prohibit the rating companies from trying to reach millions of people and since
polls are reasonably accurate, interviewing everybody is unnecessary. In this domain, we will
examine questions such as-
 How many people should be interviewed?
 How should they be selected?
 How do we know when our sample accurately reflects the entire population?
Concept
 In statistical inference we are concerned with populations; the samples are of no interest
to us in their own right. We wish to use known random sample in the extraction of
information about the unknown population from which it is drawn.
 The information we extract is in the form of summary statistics: a sample mean, a sample
standard deviation, or other measures computed from the sample.
 A statistic such as the sample mean is considered an estimator of a population
parameter—the population mean.
 We will define sample estimators and population parameters. Then we explore the
relationship between statistics and parameters via the sampling distribution.
 Finally, we discuss desirable properties of statistical estimators.
Problem
Deans and other faculty members in professional schools often monitor how well the graduates
of their programs fare in the job market. Information about the types of jobs and their salaries
may provide useful information about the success of a program. In the advertisements for a
large university, the dean of the School of Business claims that the average salary of the
school’s graduates one year after graduation is $800 per week, with a standard deviation of
$100. A second-year student in the business school who has just completed his statistics course
would like to check whether the claim about the mean is correct. He does a survey of 25 people
who graduated one year earlier and determines their weekly salary. He discovers the sample
mean to be $750. To interpret his finding, he needs to calculate the probability that a sample of
25 graduates would have a mean of $750 or less when the population mean is $800 and the
standard deviation is $100. After calculating the probability, he needs to draw some conclusion.
Reason to Sample
 Time saving
Example-A sample poll using regular staff and field interviews of a professional polling firm would take only 1 or 2 days.
Think what time will it consume for a population of 100 million?
 Cost Effective
Example-Public opinion polls and consumer testing organizations, such as Gallup Polls and Roper ASW, usually contact
fewer than 2000 of nearly 60 million families in US. One consumer panel-type organization charges about $40000 to mail
samples and tabulate responses in order to test a product (such as breakfast cereal or perfume). The same product using all
60 million families would cost $1 billion.
 Physical Impossibility of checking all items
Example-It would be impossible to check all the water in a lake for determining the bacterial level.
 Destructive nature of some test
Example-In a plant, steel plates, wires etc must have a certain minimum tensile strength. To ensure the product meets the
minimum standard, the Quality Assurance Department selects a sample from the current production and is stretched until
it breaks and the breaking point is noted. If all the units are tested, nothing would be available for sale. For same reason,
only a sample of photographic film is selected and tested by Kodak to determine the quality of all the film produced.
Types of Sampling
Judgement Sampling
 Personal knowledge and opinion are used to identify the items from the population that are to be
included in the sample. The process of selecting a sample using judgmental sampling involves the
researchers carefully picking and choosing each individual to be a part of the sample. The
researcher’s knowledge is primary in this sampling process as the members of the sample are not
randomly chosen.
 Example-Consider a scenario where a panel decides to understand what are the fa ctors which lead a
person to select ethical hacking as a profession. Ethical hacking is a skill which has been recently
attracting youth. More and more people are selecting it as a profession. The researchers who understand
what ethical hacking is will be able to decide who should form the sample to learn about it as a profession.
That is when judgmental sampling is implemented. Researchers can easily filter out those participants who
can be eligible to be a part of the research sample.
Types of Sampling
Random or Probability Sampling
All the items in the population have an equal chance of being chosen in the sample. The following methods can be
adopted for the random sampling-
i) Simple Random Sampling
ii) Systematic Sampling
iii) Stratified Sampling
iv) Cluster Sampling
v) Bootstrap Aggregating (Bagging)
Simple Random Sampling
 Simple random sampling selects samples by methods that allow each possible sample to have an
equal probability of being picked and each item in the entire population to have an equal chance of
being included in the sample.
 To obtain a random sample from the entire population, we need a list of all the elements in the population
of interest. Such a list is called a frame. The frame allows us to draw elements from the population by
randomly generating the numbers of the elements to be included in the sample.
 Simple random sampling can be viewed as the basis for the other random sampling techniques. With
simple random sampling, each unit of the frame is numbered from 1 to N. Next, a table of random
numbers or a random number generator is used to select n items into the sample. A random number
generator is usually a computer program that allows computer-calculated output to yield random numbers.
 Suppose we need a simple random sample of 100 people from a population of 7,000. We make a list of all
7,000 people and assign each person an identification number. This gives us a list of 7,000 numbers—our
frame for the experiment. Then we generate by computer or by other means a set of 100 random numbers
in the range of values from 1 to 7,000. This procedure gives every set of 100 people in the population an
equal chance of being included in the sample.
Systematic Sampling
 In systematic sampling, elements are selected from the population at a uniform interval that is
measured in time, order or space. If we wanted to interview every twentieth student on a campus, we
would choose a random starting point in the first 20 names in the student directory and then pick
every twentieth name thereafter.
 Each sample does not have the equal chance of being selected.
 Suppose you are studying a particular product from a departmental store for every Monday. Think about
the problem.
 Systematic sampling may require less time and sometimes results in lower costs than simple random
sampling.
Stratified Sampling
 The technique of stratified sampling is useful when the population can be divided into relatively
homogenous group or strata and random sampling is made only on the strata of interest. The groups are
mutually exclusive and exhaustive of the population. Example-the strata may be students, people of a
certain community, male or female, socio-economic levels, affiliated to manufacturing or service sectors,
etc.
 Stratified sampling is used because it accurately reflects the characteristics of the target population.
Example-People in a certain economic class has interest to buy sports car, people in a specific region may
have a special choice of music,
 We use stratified sampling when each group has small variation within itself but there is a large variation
between the groups. Example-The choice of perfume vary with gender, the choice of transport within
different classes etc.
 Stratified sampling is necessary when the population is heterogenous and creating homogenous stratum
before sampling is recommended for precise estimation of population parameters.
Cluster Sampling
 In cluster sampling, we divide the population into groups or clusters and then select a random sample of
these clusters.
 We assume that these clusters are representative of the population as a whole.
 Example- If a market research team is attempting to determine by sampling the average number of
television sets per household in a large city, they could use a city map to divide the territory into blocks
and then choose a certain number of blocks (clusters) for interviewing. Every household in each of these
blocks would be interviewed.
 A well-designed cluster sampling procedure can produce a more precise sample at considerably less cost
than that of simple random sampling.
 In both stratified and cluster sampling, the population is divided into well-defined groups. The major
difference is that in a stratified sample, all strata will be represented in the sample, whereas in a cluster
sampling, not all clusters will be represented.
 Cluster sample is used when clusters are large in number. For example- assume that we are interested in
impact of demonetization on Indian industry. There are large number of industrial sectors. We can focus on
any major two.
Bootstrap Aggregating (Bagging)
 Bootstrap Aggregation (Bagging) is ampling with replacement used in machine learning approaches.
 In Bagging, several samples are generated with replacement from the population and analytical models are
developed using each sample.
Sampling Distribution
 Sampling distribution refers to the probability distribution of a statistic such as sample mean, sample
standard deviation computed from several random samples of same size.

 A sampling distribution is the distribution of statistics that would be produced in repeated random
sampling from the same population.

 Sampling distributions are used to calculate the probability that sample statistics could have occurred by
chance and thus to decide whether something that is true of a sample statistic is also likely to be true of a
population parameter.

 It allow the researcher to come to conclusions about a population on the basis of descriptive statistics
about a sample.
Sampling Distribution-Example

 Your sample says that a candidate gets support from 47%.


 Inferential statistics allow you to say that the candidate gets support from 47% of
the population with a margin of error of +/- 4%.
 This means that the support in the population is likely somewhere between 43%
and 51%.
Sampling Distribution- Example
 Your sample says that a candidate gets support from 47%.
 Inferential statistics allow you to say that the candidate gets support from 47% of the population with a
margin of error of +/- 4%.
 This means that the support in the population is likely somewhere between 43% and 51%.
 Margin of error is taken directly from a sampling distribution.

95% of Possible Sample Means

47%
43% 51% Your Sample Mean
Sampling Distribution

Let’s create a sampling distribution of means…


Take a sample of size 1,500 from the US. Record the mean income. The census said
the mean is $30K.

$30K
Sampling Distribution

Let’s create a sampling distribution of means…


Take another sample of size 1,500 from the US. Record the mean income. The
census said the mean is $30K.

$30K
Sampling Distribution

Let’s create a sampling distribution of means…


Take another sample of size 1,500 from the US. Record the mean income. The
census said the mean is $30K.

$30K
Sampling Distribution

Let’s create a sampling distribution of means…


Take another sample of size 1,500 from the US. Record the mean income. The
census said the mean is $30K.

$30K
Sampling Distribution

Let’s create a sampling distribution of means…


Take another sample of size 1,500 from the US. Record the mean income. The
census said the mean is $30K.

$30K
Sampling Distribution

Let’s create a sampling distribution of means…


Take another sample of size 1,500 from the US. Record the mean income. The
census said the mean is $30K.

$30K
Sampling Distribution

Let’s create a sampling distribution of means…


Let’s repeat sampling of sizes 1,500 from the US. Record the mean incomes. The
census said the mean is $30K.

$30K
Sampling Distribution

Let’s create a sampling distribution of means…


Let’s repeat sampling of sizes 1,500 from the US. Record the mean incomes. The
census said the mean is $30K.

$30K
Sampling Distribution

Let’s create a sampling distribution of means…


Let’s repeat sampling of sizes 1,500 from the US. Record the mean incomes. The
census said the mean is $30K.

$30K
Sampling Distribution

Let’s create a sampling distribution of means…


Let’s repeat sampling of sizes 1,500 from the US. Record the mean incomes. The
census said the mean is $30K.
The sample means would stack
up in a normal curve. A normal
sampling distribution.

$30K
A Sampling Distribution

Say that the standard deviation of this distribution is $10K.


Think back to the empirical rule. What are the odds you would get a sample mean
that is more than $20K off.
The sample means would stack up in
a normal curve. A normal sampling
distribution.

$30K

-3z -2z -1z 0z 1z 2z 3z


Sampling Distribution

Say that the standard deviation of this distribution is $10K.


Think back to the empirical rule. What are the odds you would get a sample mean that is more
than $20K off.
The sample means would stack up in a
normal curve. A normal sampling
distribution.
2.5% 2.5%

$30K

-3z -2z -1z 0z 1z 2z 3z


Sampling Distribution

 Our graphic display indicates that chances are good that the mean of our one sample will not
precisely represent the population’s mean. This is called sampling error.

 If we can determine the variability (standard deviation) of the sampling distribution, we can
make estimates of how far off our sample’s mean will be from the population’s mean.
 The standard deviation of a normal sampling distribution is called the standard error.
Sampling Distribution

 Knowing the likely variability of the sample means from repeated sampling gives
us a context within which to judge how much we can trust the number we got from
our sample.

 For example, if the variability is low, , we can trust our number more than if
the variability is high, .
Sampling Distribution
 Statisticians have found that the standard error of a sampling distribution is quite
directly affected by the number of cases in the sample(s), and the variability of the
population distribution.
 Population Variability:
For example, Americans’ incomes are quite widely distributed, from $0 to Bill Gates’.

 Americans’ car values are less widely distributed, from about $50 to about $50K.

 The standard error of the latter’s sampling distribution will be a lot less variable.
Sampling Distribution

Population Variability:

Population
Cars Income
Sampling Distribution

The standard error of income’s sampling distribution will be a


lot higher than car price’s.
Standard Deviation/Error of the 33
Sampling Distribution of x
 The standard deviation of the sampling distribution of x
is

x 
n
 where σ is the standard deviation of the population and n is the sample size. This formula is used when n /N
≤ .05, where N is the population size.
 As a rule of thumb, we will treat any population that is at least 20 times larger than the sample size
as large. In practice, most applications involve populations that qualify as large because if the
population is small, it may be possible to investigate each member of the population, and in so
doing, calculate the parameters precisely.
 The equation is designed for situations in which population is infinite or we sample from a
finite population with replacement.
Standard Deviation/error of the
Sampling Distribution of x
If the condition n /N ≤ .05 is not satisfied, we use the following formula to
calculate-


𝜎 𝑥¯ =

𝜎 𝑁 −𝑛
√ 𝑛 𝑁 −1
where the factor N  n is called the finite population correction factor.
N 1

 The formula is designed when population is finite and we sample without


replacement. Example-employees in a given company, the clients of a city
social-service agency, the students in a specific class.
Sampling Distribution

If the population income were distributed with mean,  = $30K with standard deviation,  =
$10K

n = 2,500, X-bar= $10K/50 = $200

n = 25, X-bar= $10K/5 = $2,000 Population


Distribution

$30k
…the sampling distribution changes for varying sample sizes
Relationships between Population Parameters and the
Sampling Distribution of the Sample Mean
 The expected value of the sample mean is equal to the population mean-

E( X )    X X

 The variance of the sample mean is equal to the population variance divided by the
sample size- 2

V(X)  2
 X
X
n

 The standard deviation of the sample mean, known as the standard error of the mean, is
equal to the population standard deviation divided by the square root of the sample size-

SD( X )    X
X
n
Sampling from a Normal Population
When sampling from a normal population with mean  and standard deviation , the
sample mean, ¯𝑥 , has a normal sampling distribution-
2

X ~ N (, )
n

This means that, as the sample size increases, the Sampling Distribution of the Sample Mean

sampling distribution of the sample mean remains 0.4

Sampling Distribution: n =16

centered on the population mean, but becomes more 0.3


Sampling Distribution: n =4

compactly distributed around that population mean.

f(X)
0.2

Sampling Distribution: n =2
0.1

Normal population
0.0


Problem

 The mean wage for all 5000 employees who work at a large company is
$17.50 and the standard deviation is $2.90. Let ¯𝑥 be the mean wage per
hour for a random sample of certain employees selected from this
company. Find the mean and standard deviation of 𝑥¯ for a sample size of
a) 30 b) 75 c) 200
As the standard error decreases, the value of the sample mean will
probably be closer the value of the population mean.
Important Observations 39

The spread of the sampling distribution of x is smaller than the spread


of the corresponding population distribution, i.e.
x x
The standard deviation of the sampling distribution of x decreases as the sample
size increases.

It is true that sampling more items will decrease the standard error, but this benefit
may not worth the cost. The increased precision may not worth the additional
sampling cost. Managers should always assess both the worth and the cost of the
additional precision they will obtain from a larger sample before they commit
resources to take it.
Central Limit Theorem 40

The sampling distribution of the mean of a random sample drawn from any
population is approximately normal for a sufficiently large sample size. The
larger the sample size, the more closely the sampling distribution of will
resemble a normal distribution.
The Central Limit Theorem Applies to Sampling
Distributions from Any Population
Normal Uniform Skewed General

Population

n=2

n = 32

 X  X  X  X
Central Limit Theorem

 Central Limit theorem is the basis for hypothesis tests such as Z-test and t-test. In
many cases, we will have to access to only a sample and the inferences about the
population has to be made based on sample statistic.
 An important assumption of Central Limit Theorem is that the random variables
have to be independent and identically distributed.
 Regardless of the population distribution, the sampling distribution of large
sample (n>30) will follow the normal distribution with mean same and standard
error .
Problem

Suppose that during any hour in a large department store, the average number
of shoppers is 448, with a standard deviation of 21 shoppers. What is the
probability that a random sample of 49 different shopping hours will yield a
sample mean between 441 and 446 shoppers?
Thank You

You might also like