0% found this document useful (0 votes)
2 views37 pages

Lecture17 Sampling 1

The document discusses the concepts of population, sample, and sampling in the context of statistical analysis, emphasizing the importance of sampling techniques due to the impracticality of analyzing entire populations. It categorizes sampling methods into non-probabilistic and probabilistic approaches, detailing various techniques such as convenience, judgmental, quota, snowball, systematic, simple random, stratified, and cluster sampling. The document highlights the applications of these sampling methods in real-life scenarios, particularly in big data analytics.

Uploaded by

okstudyshivi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views37 pages

Lecture17 Sampling 1

The document discusses the concepts of population, sample, and sampling in the context of statistical analysis, emphasizing the importance of sampling techniques due to the impracticality of analyzing entire populations. It categorizes sampling methods into non-probabilistic and probabilistic approaches, detailing various techniques such as convenience, judgmental, quota, snowball, systematic, simple random, stratified, and cluster sampling. The document highlights the applications of these sampling methods in real-life scenarios, particularly in big data analytics.

Uploaded by

okstudyshivi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

Big Data Visual Analytics (CS 661)

Instructor: Soumya Dutta


Department of Computer Science and Engineering
Indian Institute of Technology Kanpur (IITK)
email: [email protected]
Mar 11, 2023
Population, Sample, and
Sampling
• “Population” is the entire set of items from which you
draw data for a statistical study (sampling frame)

c
• A “sample” is a subset of a population
• “Sampling” is the process of selecting a subset from a
population and is called sample
• Goal of sampling:
• The primary objective of sampling is to select a subset of
data from a large population, which might be impossible to Population
handle and sometimes we cannot even have access to the
entire population
Inference Sampling
• Data reduction and representative selection for performing
statistical analysis and inference about the population
• Save time, space, and money Sample
• Practical approach for solving challenging problems

IITK CS661: Big Data Visual Analytics: Soumya Dutta 2


Real-life Motivating Applications
• Applications of Sampling and sample-based analysis is widespread
• Sampling is everywhere since it is impossible to keep all the data

Sampling is the fundamental tool for any kind of survey Sampling to further scientific discovery

• Suppose we want to measure the average height of the males in • Suppose we wish to predict climate accurately for the near
India future or want to understand fundamental physics or want to
assess the impact of an asteroid hitting our earth!
• Based on our capability, we can measure heights for 10,000
males/day • We use large-scale computational simulations that attempts to
model these phenomena accurately
• India has around 717 million male population*
• These simulations generate petabytes (1015 bytes) of data,
• It would take 71,700 days, roughly 197 years! soon to reach exabyte (1018)
• Would you do it? Do you think this is practical approach even • We simply cannot keep/analyze all the data and even if we try,
with more resources? the cost and resource will be prohibitive
• What if tomorrow I want to know the average height of the
female population in India?

How can/should we sample big data to achieve our goals above?


IITK CS661: Big Data Visual Analytics: Soumya Dutta *
https://statisticstimes.com/demographics/country/india-sex-ratio.php 3
Classification of Sampling
Techniques
• Non-probabilistic Non-probabilistic
Approaches
Probabilistic
Approaches
Advanced/Multi-stage
Approaches
approaches
• Items selected by not Simple
Convenience Rejection
considering their probability Sampling
Random
Sampling
Sampling
of occurrence
Importance-
Judgmental Systematic
based
Sampling Sampling
Sampling

Quota Stratified Blue Noise


Sampling Sampling Sampling

Snowball Cluster Many other


Sampling Sampling approaches

IITK CS661: Big Data Visual Analytics: Soumya Dutta 4


Classification of Sampling
Techniques
• Non-probabilistic Non-probabilistic
Approaches
Probabilistic
Approaches
Advanced/Multi-stage
Approaches
approaches
• Items selected by not Simple
Convenience Rejection
considering their probability Sampling
Random
Sampling
Sampling
of occurrence
• Probabilistic approaches Judgmental Systematic
Importance-
based
• Items selected based on Sampling Sampling
Sampling
their occurrence in the
population Quota Stratified Blue Noise
• Prevalent in Data Science Sampling Sampling Sampling
applications
• Gives good estimations of Snowball Cluster Many other
statistic Sampling Sampling approaches

IITK CS661: Big Data Visual Analytics: Soumya Dutta 5


Classification of Sampling
Techniques
• Non-probabilistic approaches
Non-probabilistic Probabilistic Advanced/Multi-stage
• Items selected by not considering Approaches Approaches Approaches
their probability of occurrence
• Probabilistic approaches Simple
Convenience Rejection
• Items selected based on their Sampling
Random
Sampling
occurrence in the population Sampling
• Prevalent in Data Science
applications Importance-
Judgmental Systematic
based
• Gives good estimations of statistic Sampling Sampling
Sampling
• Advanced approaches
• Largely probabilistic Quota Stratified Blue Noise
Sampling Sampling Sampling
• Often data-driven
• Sometimes application-specific
Snowball Cluster Many other
Sampling Sampling approaches

IITK CS661: Big Data Visual Analytics: Soumya Dutta 6


Non-Probabilistic
Sampling Approaches

IITK CS661: Big Data Visual Analytics: Soumya Dutta 7


Convenience sampling
• Convenience sampling
• It is one of the easiest and common form of sampling
• Sample observations are selected based on ease of accessibility and
convenience
• Sample is not a true representation of the population
• Generalization and statistical inference using samples generated by
convenience sampling may not be accurate
• Can be used for initial or informal pilot study
• Also known as grab sampling or accidental sampling

IITK CS661: Big Data Visual Analytics: Soumya Dutta 8


Convenience sampling

IITK CS661: Big Data Visual Analytics: Soumya Dutta 9


Judgmental sampling
• Judgmental sampling
• It is a non-probabilistic method where existing knowledge is used to select
sample observations from the population
• Sample is not a true representation of the population
• Generalization and statistical inference using samples generated by
convenience sampling may not be accurate
• Judgmental sampling could be computationally less expensive than others
and gives the sample set where the user is interested in

IITK CS661: Big Data Visual Analytics: Soumya Dutta 10


Judgmental sampling

IITK CS661: Big Data Visual Analytics: Soumya Dutta 11


Quota sampling
• Quota sampling
• It is a non-probabilistic method where sample observations are selected
based on some pre-defined ‘quota’
• First the population is divided into mutually exclusive groups based on certain
characteristics and traits
• Then judgmental sampling is performed inside each group to select
observations to satisfy a pre-defined criterion
• May have bias in selected sample
• This is a non-probabilistic version of stratified sampling

IITK CS661: Big Data Visual Analytics: Soumya Dutta 12


Quota sampling

IITK CS661: Big Data Visual Analytics: Soumya Dutta 13


Quota sampling

IITK CS661: Big Data Visual Analytics: Soumya Dutta 14


Snowball sampling
• Snowball sampling
• It is a non-probabilistic sampling method where the current selected
observations dictate how subsequent observations will be selected
• This is also known as chain sampling or referral sampling
• The sampled observations grow like a rolling snowball, hence the name
• The sampling process starts with a small pool of observations and then the
selection process propagates via nominations of the initial observations
• This method is heavily used in social computing, graph sampling
applications
• Produces a biased estimate of the population but can often reveal
hidden patterns

IITK CS661: Big Data Visual Analytics: Soumya Dutta 15


Snowball sampling

IITK CS661: Big Data Visual Analytics: Soumya Dutta 16


Probabilistic
Sampling
Approaches

IITK CS661: Big Data Visual Analytics: Soumya Dutta 17


Systematic Sampling
• Observations or data points are selected at regular interval from the
population
• Steps:
• Calculate the sampling interval ( I = N/n)
• Draw a random number (<=I) for the starting data point
• Draw every Ith data point from the starting point
• Ensures a good representativeness of the population in the selected
sample

IITK CS661: Big Data Visual Analytics: Soumya Dutta 18


Systematic Sampling
• Observations or data points are selected at regular interval from the
population

9X9 = 81 data points before sampling 25 data points selected after sampling

IITK CS661: Big Data Visual Analytics: Soumya Dutta 19


Simple Random Sampling (SRS)
• The most basic sampling technique, widely used with
favorable properties
• Provides theoretical basis for the more complicated
methods
Number of points = 100000
• Idea: Every item/point in the population has equal Average (mean): 5.0
probability of being selected Standard Deviation = 2.0

• If we have N points and we wish to select a sample of n


points ( n <= N) then each point has initially probability 1/N
to get selected
• In practice, we can generate random numbers using the
indices of points to select the desired number of points
• Random sampling gives unbiased estimations about
the population
• Statistic estimated on sample faithfully reflects the statistic 10% sample
about population Average (mean): 4.974
Standard Deviation = 2.008
• Mean, variance, higher order moments etc.
IITK CS661: Big Data Visual Analytics: Soumya Dutta 20
Simple Random Sampling (SRS)
• Randomly select points from population

9X9 = 81 data points before sampling 25 data points selected after sampling

IITK CS661: Big Data Visual Analytics: Soumya Dutta 21


Randomization Theory for Simple Random Sampling
• Simple Random Sampling (SRS) gives unbiased estimator about the population
• Let us show that(sample mean) is an unbiased estimator of (population mean)
• We are selecting n items from a population of size N, s is the set of items selected
• Let be an indicator random variable and is defined as follows

{
𝑍 𝑖= 1𝑖𝑓 𝑖𝑡𝑒𝑚 𝑖 𝑖𝑠 𝑖𝑛 𝑡h𝑒 𝑠𝑎𝑚𝑝𝑙𝑒
0 𝑜𝑡h𝑒𝑟𝑤𝑖𝑠𝑒
𝑁
𝑦𝑖 𝑦𝑖
Then we have 𝑦 =∑ =∑ 𝑍 𝑖
𝑖∈ 𝑆 𝑛 𝑖=1 𝑛

When we select n items out of the N items in the population, and {, ,…., } are identically
distributed Bernoulli random variables, the probability of this event is:
𝑃 ( 𝑍 𝑖=1 ) =𝑃 ( 𝑠𝑒𝑙𝑒𝑐𝑡 𝑖𝑡𝑒𝑚 𝑖 𝑖𝑛 𝑠𝑎𝑚𝑝𝑙𝑒 ) =𝑛/ 𝑁

IITK CS661: Big Data Visual Analytics: Soumya Dutta 22


Randomization Theory for Simple Random Sampling

{
𝑍 𝑖= 1𝑖𝑓 𝑖𝑡𝑒𝑚 𝑖 𝑖𝑠 𝑖𝑛 𝑡h𝑒 𝑠𝑎𝑚𝑝𝑙𝑒
0 𝑜𝑡h𝑒𝑟𝑤𝑖𝑠𝑒 𝑃 ( 𝑍 𝑖=1 ) =𝑃 ( 𝑠𝑒𝑙𝑒𝑐𝑡 𝑖𝑡𝑒𝑚 𝑖 𝑖𝑛 𝑠𝑎𝑚𝑝𝑙𝑒 ) =𝑛/ 𝑁

𝑁
𝑦𝑖 𝑦𝑖
Then we have 𝑦 =∑ =∑ 𝑍 𝑖
𝑖∈ 𝑆 𝑛 𝑖=1 𝑛

We also see that since is a Bernoulli random variable,

𝐸 [ 𝑍 𝑖 ] =P ( 𝑍 𝑖=1 )=n / N

[ ]
𝑁 𝑁 𝑁 𝑁
𝑦𝑖 𝑦𝑖 𝑛 𝑦𝑖 𝑦𝑖
and finally, we have 𝐸 [ 𝑦 ] =𝐸 ∑ 𝑍 𝑖 =∑ 𝐸[ 𝑍𝑖] =∑ =∑ = 𝑦 𝑈
𝑖=1 𝑛 𝑖=1 𝑛 𝑖=1 𝑁 𝑛 𝑖=1 𝑁

IITK CS661: Big Data Visual Analytics: Soumya Dutta 23


Stratified Sampling
• Classify the population into several homogeneous strata
• This process is called stratification
• Determine the sample size
• Randomly sample points from each strata
• Disproportionate sampling
• Proportionate sampling
• Combine sampled results from each strata

IITK CS661: Big Data Visual Analytics: Soumya Dutta 24


Stratified Sampling
• Classify the population into several homogeneous strata
• Determine the sample size
• Randomly sample points from each strata
• Disproportionate sampling
• Proportionate sampling
• Combine results from each strata

Proportionate sampling

Population Strata

Disproportionate sampling
IITK CS661: Big Data Visual Analytics: Soumya Dutta 25
Cluster Sampling
• The population is first clustered into mutually
exclusive heterogeneous groups
• The clustering is done based on some global
criteria
• Each cluster must represent the population as
best as possible
• Clusters are internally heterogeneous but
externally homogeneous
• Then sample is selected from a randomly
selected single or multiple group of clusters

IITK CS661: Big Data Visual Analytics: Soumya Dutta https://statisticsbyjim.com/basics/cluster-sampling/ 26


Single-stage Cluster Sampling
• After the clusters are formed,
either a single or a set of clusters
are selected at random
• All the data points in such clusters
are combined for analysis
• This is suitable when the data set
is not too large, and a subset of
clusters can be handled efficiently

IITK CS661: Big Data Visual Analytics: Soumya Dutta 27


Two-stage Cluster Sampling
• After the clusters are formed, either a single or a set of clusters are
selected at random
• Simple random sampling is performed inside each cluster to select
subset of data points from each selected cluster
• All the selected points are are combined for analysis
• This is more of a practical approach that can handle relatively large
data sets as two steps of filtering is applied

IITK CS661: Big Data Visual Analytics: Soumya Dutta 28


Cluster Sampling: Advantages
• If the cluster generation process produces clusters that are very
similar to the entire population and each cluster can represent it well,
then using cluster sampling method reliable results can be produced
• Often suitable for large scale data sets
• Applicable when the entire population is impossible to access

IITK CS661: Big Data Visual Analytics: Soumya Dutta 29


Advanced Sampling
Approaches

IITK CS661: Big Data Visual Analytics: Soumya Dutta 30


Inverse Transform Sampling
• What happens behind the scene when we generate sample points
from a specific type of distribution
• Set up:
• We have numbers between coming from a Uniform distribution
• We want points that follow distribution

Pdf of exponential distribution Cdf of exponential distribution

IITK CS661: Big Data Visual Analytics: Soumya Dutta 31


Inverse Transform Sampling
• We have uniform numbers between coming from a Uniform distribution
• We want points that follow Exponential() distribution

• We want so that we transform uniform numbers to Exponential distribution

So, we have, hence,

IITK CS661: Big Data Visual Analytics: Soumya Dutta 32


Remember: Distribution
Transformation Property
• If is a uniform random variable (i.e., ) and is a CDF of random variable , then
its inverse function corresponds to the random variable (i.e.,)

1.0
−1
𝑈 𝐹𝑋

0.0

-2.0 0.0 2.0

𝑋
IITK CS661: Big Data Visual Analytics: Soumya Dutta 33
Rejection Sampling
• Rejection sampling is a method for generating samples from a distribution
with density by drawing sample points from another distribution that is
easier to sample
• When we do not have or its CDF is difficult to compute
• Steps:
• Generate a sample point from 𝑓 (𝑥) C is a constant to ensure𝐶 ∗ h ( 𝑥 ) ≥ 𝑓 (𝑥)
• Accept the sample point with acceptance prob: 𝐶 ∗ h( 𝑥 )

𝑓 ( 𝑥 ) =𝑝𝑙𝑜𝑡 𝑜𝑓 𝑓 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛(𝑅𝑒𝑑) Intuition: High acceptance probability for a


𝐶 ∗ h ( 𝑥 )=𝑁𝑜𝑟𝑚𝑎𝑙 𝐷𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛(𝐵𝑙𝑎𝑐𝑘)specific sample point drawn from indicates
that the sample is highly likely for
distribution and so accept it.
Density

• is a normalizing constant to ensure as


the ratio is interpreted as probability
value
Values
IITK CS661: Big Data Visual Analytics: Soumya Dutta 34
Why Rejection Sampling Works?
• We have and the PDF of is . We do not know !
• General form of NC is normalizing constant
• General form of
• From Bayes’ theorem:
=

So,

IITK CS661: Big Data Visual Analytics: Soumya Dutta 35


Blue Noise Sampling
• White noise: Random noise
• Blue noise: Characterized by a power spectral density that decreases logarithmically with
frequency
• Has more energy at lower frequencies and progressively less energy at higher frequencies
• Blue noise sampling: A distribution of points where the energy of the high frequencies is
minimized, creating points which are evenly distributed and visually pleasing

Systematic Random Blue Noise


IITK CS661: Big Data Visual Analytics: Soumya Dutta 36
Blue Noise Sampling
• Poisson disk / Dart Throwing for Blue noise sampling
• No two samples withing a radius r are allowed
• Sample points are picked from a uniform distribution, and the sample
points that obey the minimum distance property with respect to the
sample points currently in the set are kept, while the others are
discarded.

IITK CS661: Big Data Visual Analytics: Soumya Dutta 37

You might also like