Lecture17 Sampling 1
Lecture17 Sampling 1
c
• A “sample” is a subset of a population
• “Sampling” is the process of selecting a subset from a
population and is called sample
• Goal of sampling:
• The primary objective of sampling is to select a subset of
data from a large population, which might be impossible to Population
handle and sometimes we cannot even have access to the
entire population
Inference Sampling
• Data reduction and representative selection for performing
statistical analysis and inference about the population
• Save time, space, and money Sample
• Practical approach for solving challenging problems
Sampling is the fundamental tool for any kind of survey Sampling to further scientific discovery
• Suppose we want to measure the average height of the males in • Suppose we wish to predict climate accurately for the near
India future or want to understand fundamental physics or want to
assess the impact of an asteroid hitting our earth!
• Based on our capability, we can measure heights for 10,000
males/day • We use large-scale computational simulations that attempts to
model these phenomena accurately
• India has around 717 million male population*
• These simulations generate petabytes (1015 bytes) of data,
• It would take 71,700 days, roughly 197 years! soon to reach exabyte (1018)
• Would you do it? Do you think this is practical approach even • We simply cannot keep/analyze all the data and even if we try,
with more resources? the cost and resource will be prohibitive
• What if tomorrow I want to know the average height of the
female population in India?
9X9 = 81 data points before sampling 25 data points selected after sampling
9X9 = 81 data points before sampling 25 data points selected after sampling
{
𝑍 𝑖= 1𝑖𝑓 𝑖𝑡𝑒𝑚 𝑖 𝑖𝑠 𝑖𝑛 𝑡h𝑒 𝑠𝑎𝑚𝑝𝑙𝑒
0 𝑜𝑡h𝑒𝑟𝑤𝑖𝑠𝑒
𝑁
𝑦𝑖 𝑦𝑖
Then we have 𝑦 =∑ =∑ 𝑍 𝑖
𝑖∈ 𝑆 𝑛 𝑖=1 𝑛
When we select n items out of the N items in the population, and {, ,…., } are identically
distributed Bernoulli random variables, the probability of this event is:
𝑃 ( 𝑍 𝑖=1 ) =𝑃 ( 𝑠𝑒𝑙𝑒𝑐𝑡 𝑖𝑡𝑒𝑚 𝑖 𝑖𝑛 𝑠𝑎𝑚𝑝𝑙𝑒 ) =𝑛/ 𝑁
{
𝑍 𝑖= 1𝑖𝑓 𝑖𝑡𝑒𝑚 𝑖 𝑖𝑠 𝑖𝑛 𝑡h𝑒 𝑠𝑎𝑚𝑝𝑙𝑒
0 𝑜𝑡h𝑒𝑟𝑤𝑖𝑠𝑒 𝑃 ( 𝑍 𝑖=1 ) =𝑃 ( 𝑠𝑒𝑙𝑒𝑐𝑡 𝑖𝑡𝑒𝑚 𝑖 𝑖𝑛 𝑠𝑎𝑚𝑝𝑙𝑒 ) =𝑛/ 𝑁
𝑁
𝑦𝑖 𝑦𝑖
Then we have 𝑦 =∑ =∑ 𝑍 𝑖
𝑖∈ 𝑆 𝑛 𝑖=1 𝑛
𝐸 [ 𝑍 𝑖 ] =P ( 𝑍 𝑖=1 )=n / N
[ ]
𝑁 𝑁 𝑁 𝑁
𝑦𝑖 𝑦𝑖 𝑛 𝑦𝑖 𝑦𝑖
and finally, we have 𝐸 [ 𝑦 ] =𝐸 ∑ 𝑍 𝑖 =∑ 𝐸[ 𝑍𝑖] =∑ =∑ = 𝑦 𝑈
𝑖=1 𝑛 𝑖=1 𝑛 𝑖=1 𝑁 𝑛 𝑖=1 𝑁
Proportionate sampling
Population Strata
Disproportionate sampling
IITK CS661: Big Data Visual Analytics: Soumya Dutta 25
Cluster Sampling
• The population is first clustered into mutually
exclusive heterogeneous groups
• The clustering is done based on some global
criteria
• Each cluster must represent the population as
best as possible
• Clusters are internally heterogeneous but
externally homogeneous
• Then sample is selected from a randomly
selected single or multiple group of clusters
1.0
−1
𝑈 𝐹𝑋
0.0
𝑋
IITK CS661: Big Data Visual Analytics: Soumya Dutta 33
Rejection Sampling
• Rejection sampling is a method for generating samples from a distribution
with density by drawing sample points from another distribution that is
easier to sample
• When we do not have or its CDF is difficult to compute
• Steps:
• Generate a sample point from 𝑓 (𝑥) C is a constant to ensure𝐶 ∗ h ( 𝑥 ) ≥ 𝑓 (𝑥)
• Accept the sample point with acceptance prob: 𝐶 ∗ h( 𝑥 )
So,