0% found this document useful (0 votes)

2 views37 pages

Lecture17 Sampling 1

The document discusses the concepts of population, sample, and sampling in the context of statistical analysis, emphasizing the importance of sampling techniques due to the impracticality of analyzing entire populations. It categorizes sampling methods into non-probabilistic and probabilistic approaches, detailing various techniques such as convenience, judgmental, quota, snowball, systematic, simple random, stratified, and cluster sampling. The document highlights the applications of these sampling methods in real-life scenarios, particularly in big data analytics.

Uploaded by

okstudyshivi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views37 pages

Lecture17 Sampling 1

Uploaded by

okstudyshivi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 37

Big Data Visual Analytics (CS 661)

Instructor: Soumya Dutta

Department of Computer Science and Engineering
Indian Institute of Technology Kanpur (IITK)
email: [email protected]
Mar 11, 2023
Population, Sample, and
Sampling
• “Population” is the entire set of items from which you
draw data for a statistical study (sampling frame)

c
• A “sample” is a subset of a population
• “Sampling” is the process of selecting a subset from a
population and is called sample
• Goal of sampling:
• The primary objective of sampling is to select a subset of
data from a large population, which might be impossible to Population
handle and sometimes we cannot even have access to the
entire population
Inference Sampling
• Data reduction and representative selection for performing
statistical analysis and inference about the population
• Save time, space, and money Sample
• Practical approach for solving challenging problems

IITK CS661: Big Data Visual Analytics: Soumya Dutta 2

Real-life Motivating Applications
• Applications of Sampling and sample-based analysis is widespread
• Sampling is everywhere since it is impossible to keep all the data

Sampling is the fundamental tool for any kind of survey Sampling to further scientific discovery

• Suppose we want to measure the average height of the males in • Suppose we wish to predict climate accurately for the near
India future or want to understand fundamental physics or want to
assess the impact of an asteroid hitting our earth!
• Based on our capability, we can measure heights for 10,000
males/day • We use large-scale computational simulations that attempts to
model these phenomena accurately
• India has around 717 million male population*
• These simulations generate petabytes (1015 bytes) of data,
• It would take 71,700 days, roughly 197 years! soon to reach exabyte (1018)
• Would you do it? Do you think this is practical approach even • We simply cannot keep/analyze all the data and even if we try,
with more resources? the cost and resource will be prohibitive
• What if tomorrow I want to know the average height of the
female population in India?

How can/should we sample big data to achieve our goals above?

IITK CS661: Big Data Visual Analytics: Soumya Dutta *
https://statisticstimes.com/demographics/country/india-sex-ratio.php 3
Classification of Sampling
Techniques
• Non-probabilistic Non-probabilistic
Approaches
Probabilistic
Approaches
Advanced/Multi-stage
Approaches
approaches
• Items selected by not Simple
Convenience Rejection
considering their probability Sampling
Random
Sampling
Sampling
of occurrence
Importance-
Judgmental Systematic
based
Sampling Sampling
Sampling

Quota Stratified Blue Noise

Sampling Sampling Sampling

Snowball Cluster Many other

Sampling Sampling approaches

IITK CS661: Big Data Visual Analytics: Soumya Dutta 4

Classification of Sampling
Techniques
• Non-probabilistic Non-probabilistic
Approaches
Probabilistic
Approaches
Advanced/Multi-stage
Approaches
approaches
• Items selected by not Simple
Convenience Rejection
considering their probability Sampling
Random
Sampling
Sampling
of occurrence
• Probabilistic approaches Judgmental Systematic
Importance-
based
• Items selected based on Sampling Sampling
Sampling
their occurrence in the
population Quota Stratified Blue Noise
• Prevalent in Data Science Sampling Sampling Sampling
applications
• Gives good estimations of Snowball Cluster Many other
statistic Sampling Sampling approaches

IITK CS661: Big Data Visual Analytics: Soumya Dutta 5

Classification of Sampling
Techniques
• Non-probabilistic approaches
Non-probabilistic Probabilistic Advanced/Multi-stage
• Items selected by not considering Approaches Approaches Approaches
their probability of occurrence
• Probabilistic approaches Simple
Convenience Rejection
• Items selected based on their Sampling
Random
Sampling
occurrence in the population Sampling
• Prevalent in Data Science
applications Importance-
Judgmental Systematic
based
• Gives good estimations of statistic Sampling Sampling
Sampling
• Advanced approaches
• Largely probabilistic Quota Stratified Blue Noise
Sampling Sampling Sampling
• Often data-driven
• Sometimes application-specific
Snowball Cluster Many other
Sampling Sampling approaches

IITK CS661: Big Data Visual Analytics: Soumya Dutta 6

Non-Probabilistic
Sampling Approaches

IITK CS661: Big Data Visual Analytics: Soumya Dutta 7

Convenience sampling
• Convenience sampling
• It is one of the easiest and common form of sampling
• Sample observations are selected based on ease of accessibility and
convenience
• Sample is not a true representation of the population
• Generalization and statistical inference using samples generated by
convenience sampling may not be accurate
• Can be used for initial or informal pilot study
• Also known as grab sampling or accidental sampling

IITK CS661: Big Data Visual Analytics: Soumya Dutta 8

Convenience sampling

IITK CS661: Big Data Visual Analytics: Soumya Dutta 9

Judgmental sampling
• Judgmental sampling
• It is a non-probabilistic method where existing knowledge is used to select
sample observations from the population
• Sample is not a true representation of the population
• Generalization and statistical inference using samples generated by
convenience sampling may not be accurate
• Judgmental sampling could be computationally less expensive than others
and gives the sample set where the user is interested in

IITK CS661: Big Data Visual Analytics: Soumya Dutta 10

Judgmental sampling

IITK CS661: Big Data Visual Analytics: Soumya Dutta 11

Quota sampling
• Quota sampling
• It is a non-probabilistic method where sample observations are selected
based on some pre-defined ‘quota’
• First the population is divided into mutually exclusive groups based on certain
characteristics and traits
• Then judgmental sampling is performed inside each group to select
observations to satisfy a pre-defined criterion
• May have bias in selected sample
• This is a non-probabilistic version of stratified sampling

IITK CS661: Big Data Visual Analytics: Soumya Dutta 12

Quota sampling

IITK CS661: Big Data Visual Analytics: Soumya Dutta 13

Quota sampling

IITK CS661: Big Data Visual Analytics: Soumya Dutta 14

Snowball sampling
• Snowball sampling
• It is a non-probabilistic sampling method where the current selected
observations dictate how subsequent observations will be selected
• This is also known as chain sampling or referral sampling
• The sampled observations grow like a rolling snowball, hence the name
• The sampling process starts with a small pool of observations and then the
selection process propagates via nominations of the initial observations
• This method is heavily used in social computing, graph sampling
applications
• Produces a biased estimate of the population but can often reveal
hidden patterns

IITK CS661: Big Data Visual Analytics: Soumya Dutta 15

Snowball sampling

IITK CS661: Big Data Visual Analytics: Soumya Dutta 16

Probabilistic
Sampling
Approaches

IITK CS661: Big Data Visual Analytics: Soumya Dutta 17

Systematic Sampling
• Observations or data points are selected at regular interval from the
population
• Steps:
• Calculate the sampling interval ( I = N/n)
• Draw a random number (<=I) for the starting data point
• Draw every Ith data point from the starting point
• Ensures a good representativeness of the population in the selected
sample

IITK CS661: Big Data Visual Analytics: Soumya Dutta 18

Systematic Sampling
• Observations or data points are selected at regular interval from the
population

9X9 = 81 data points before sampling 25 data points selected after sampling

IITK CS661: Big Data Visual Analytics: Soumya Dutta 19

Simple Random Sampling (SRS)
• The most basic sampling technique, widely used with
favorable properties
• Provides theoretical basis for the more complicated
methods
Number of points = 100000
• Idea: Every item/point in the population has equal Average (mean): 5.0
probability of being selected Standard Deviation = 2.0

• If we have N points and we wish to select a sample of n

points ( n <= N) then each point has initially probability 1/N
to get selected
• In practice, we can generate random numbers using the
indices of points to select the desired number of points
• Random sampling gives unbiased estimations about
the population
• Statistic estimated on sample faithfully reflects the statistic 10% sample
about population Average (mean): 4.974
Standard Deviation = 2.008
• Mean, variance, higher order moments etc.
IITK CS661: Big Data Visual Analytics: Soumya Dutta 20
Simple Random Sampling (SRS)
• Randomly select points from population

9X9 = 81 data points before sampling 25 data points selected after sampling

IITK CS661: Big Data Visual Analytics: Soumya Dutta 21

Randomization Theory for Simple Random Sampling
• Simple Random Sampling (SRS) gives unbiased estimator about the population
• Let us show that(sample mean) is an unbiased estimator of (population mean)
• We are selecting n items from a population of size N, s is the set of items selected
• Let be an indicator random variable and is defined as follows

{
𝑍 𝑖= 1𝑖𝑓 𝑖𝑡𝑒𝑚 𝑖 𝑖𝑠 𝑖𝑛 𝑡h𝑒 𝑠𝑎𝑚𝑝𝑙𝑒
0 𝑜𝑡h𝑒𝑟𝑤𝑖𝑠𝑒
𝑁
𝑦𝑖 𝑦𝑖
Then we have 𝑦 =∑ =∑ 𝑍 𝑖
𝑖∈ 𝑆 𝑛 𝑖=1 𝑛

When we select n items out of the N items in the population, and {, ,…., } are identically
distributed Bernoulli random variables, the probability of this event is:
𝑃 ( 𝑍 𝑖=1 ) =𝑃 ( 𝑠𝑒𝑙𝑒𝑐𝑡 𝑖𝑡𝑒𝑚 𝑖 𝑖𝑛 𝑠𝑎𝑚𝑝𝑙𝑒 ) =𝑛/ 𝑁

IITK CS661: Big Data Visual Analytics: Soumya Dutta 22

Randomization Theory for Simple Random Sampling

{
𝑍 𝑖= 1𝑖𝑓 𝑖𝑡𝑒𝑚 𝑖 𝑖𝑠 𝑖𝑛 𝑡h𝑒 𝑠𝑎𝑚𝑝𝑙𝑒
0 𝑜𝑡h𝑒𝑟𝑤𝑖𝑠𝑒 𝑃 ( 𝑍 𝑖=1 ) =𝑃 ( 𝑠𝑒𝑙𝑒𝑐𝑡 𝑖𝑡𝑒𝑚 𝑖 𝑖𝑛 𝑠𝑎𝑚𝑝𝑙𝑒 ) =𝑛/ 𝑁

𝑁
𝑦𝑖 𝑦𝑖
Then we have 𝑦 =∑ =∑ 𝑍 𝑖
𝑖∈ 𝑆 𝑛 𝑖=1 𝑛

We also see that since is a Bernoulli random variable,

𝐸 [ 𝑍 𝑖 ] =P ( 𝑍 𝑖=1 )=n / N

[ ]
𝑁 𝑁 𝑁 𝑁
𝑦𝑖 𝑦𝑖 𝑛 𝑦𝑖 𝑦𝑖
and finally, we have 𝐸 [ 𝑦 ] =𝐸 ∑ 𝑍 𝑖 =∑ 𝐸[ 𝑍𝑖] =∑ =∑ = 𝑦 𝑈
𝑖=1 𝑛 𝑖=1 𝑛 𝑖=1 𝑁 𝑛 𝑖=1 𝑁

IITK CS661: Big Data Visual Analytics: Soumya Dutta 23

Stratified Sampling
• Classify the population into several homogeneous strata
• This process is called stratification
• Determine the sample size
• Randomly sample points from each strata
• Disproportionate sampling
• Proportionate sampling
• Combine sampled results from each strata

IITK CS661: Big Data Visual Analytics: Soumya Dutta 24

Stratified Sampling
• Classify the population into several homogeneous strata
• Determine the sample size
• Randomly sample points from each strata
• Disproportionate sampling
• Proportionate sampling
• Combine results from each strata

Proportionate sampling

Population Strata

Disproportionate sampling
IITK CS661: Big Data Visual Analytics: Soumya Dutta 25
Cluster Sampling
• The population is first clustered into mutually
exclusive heterogeneous groups
• The clustering is done based on some global
criteria
• Each cluster must represent the population as
best as possible
• Clusters are internally heterogeneous but
externally homogeneous
• Then sample is selected from a randomly
selected single or multiple group of clusters

IITK CS661: Big Data Visual Analytics: Soumya Dutta https://statisticsbyjim.com/basics/cluster-sampling/ 26

Single-stage Cluster Sampling
• After the clusters are formed,
either a single or a set of clusters
are selected at random
• All the data points in such clusters
are combined for analysis
• This is suitable when the data set
is not too large, and a subset of
clusters can be handled efficiently

IITK CS661: Big Data Visual Analytics: Soumya Dutta 27

Two-stage Cluster Sampling
• After the clusters are formed, either a single or a set of clusters are
selected at random
• Simple random sampling is performed inside each cluster to select
subset of data points from each selected cluster
• All the selected points are are combined for analysis
• This is more of a practical approach that can handle relatively large
data sets as two steps of filtering is applied

IITK CS661: Big Data Visual Analytics: Soumya Dutta 28

Cluster Sampling: Advantages
• If the cluster generation process produces clusters that are very
similar to the entire population and each cluster can represent it well,
then using cluster sampling method reliable results can be produced
• Often suitable for large scale data sets
• Applicable when the entire population is impossible to access

IITK CS661: Big Data Visual Analytics: Soumya Dutta 29

Advanced Sampling
Approaches

IITK CS661: Big Data Visual Analytics: Soumya Dutta 30

Inverse Transform Sampling
• What happens behind the scene when we generate sample points
from a specific type of distribution
• Set up:
• We have numbers between coming from a Uniform distribution
• We want points that follow distribution

Pdf of exponential distribution Cdf of exponential distribution

IITK CS661: Big Data Visual Analytics: Soumya Dutta 31

Inverse Transform Sampling
• We have uniform numbers between coming from a Uniform distribution
• We want points that follow Exponential() distribution

• We want so that we transform uniform numbers to Exponential distribution

So, we have, hence,

IITK CS661: Big Data Visual Analytics: Soumya Dutta 32

Remember: Distribution
Transformation Property
• If is a uniform random variable (i.e., ) and is a CDF of random variable , then
its inverse function corresponds to the random variable (i.e.,)

1.0
−1
𝑈 𝐹𝑋

0.0

-2.0 0.0 2.0

𝑋
IITK CS661: Big Data Visual Analytics: Soumya Dutta 33
Rejection Sampling
• Rejection sampling is a method for generating samples from a distribution
with density by drawing sample points from another distribution that is
easier to sample
• When we do not have or its CDF is difficult to compute
• Steps:
• Generate a sample point from 𝑓 (𝑥) C is a constant to ensure𝐶 ∗ h ( 𝑥 ) ≥ 𝑓 (𝑥)
• Accept the sample point with acceptance prob: 𝐶 ∗ h( 𝑥 )

𝑓 ( 𝑥 ) =𝑝𝑙𝑜𝑡 𝑜𝑓 𝑓 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛(𝑅𝑒𝑑) Intuition: High acceptance probability for a

𝐶 ∗ h ( 𝑥 )=𝑁𝑜𝑟𝑚𝑎𝑙 𝐷𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛(𝐵𝑙𝑎𝑐𝑘)specific sample point drawn from indicates
that the sample is highly likely for
distribution and so accept it.
Density

• is a normalizing constant to ensure as

the ratio is interpreted as probability
value
Values
IITK CS661: Big Data Visual Analytics: Soumya Dutta 34
Why Rejection Sampling Works?
• We have and the PDF of is . We do not know !
• General form of NC is normalizing constant
• General form of
• From Bayes’ theorem:
=

So,

IITK CS661: Big Data Visual Analytics: Soumya Dutta 35

Blue Noise Sampling
• White noise: Random noise
• Blue noise: Characterized by a power spectral density that decreases logarithmically with
frequency
• Has more energy at lower frequencies and progressively less energy at higher frequencies
• Blue noise sampling: A distribution of points where the energy of the high frequencies is
minimized, creating points which are evenly distributed and visually pleasing

Systematic Random Blue Noise

IITK CS661: Big Data Visual Analytics: Soumya Dutta 36
Blue Noise Sampling
• Poisson disk / Dart Throwing for Blue noise sampling
• No two samples withing a radius r are allowed
• Sample points are picked from a uniform distribution, and the sample
points that obey the minimum distance property with respect to the
sample points currently in the set are kept, while the others are
discarded.

IITK CS661: Big Data Visual Analytics: Soumya Dutta 37

Introduction To Computational Data Analytics
No ratings yet
Introduction To Computational Data Analytics
11 pages
A Study On Awareness of Jan-Aushadhi Medical Store.
100% (3)
A Study On Awareness of Jan-Aushadhi Medical Store.
41 pages
IE5005 Lecture 01
No ratings yet
IE5005 Lecture 01
58 pages
Lecture7 TF Design
No ratings yet
Lecture7 TF Design
37 pages
Unit Iii
100% (1)
Unit Iii
36 pages
Unit 4 Big Data Complete Notes
No ratings yet
Unit 4 Big Data Complete Notes
32 pages
Lecture 4 - Data Wrangling
No ratings yet
Lecture 4 - Data Wrangling
41 pages
Statistics For Applied Science 200l
No ratings yet
Statistics For Applied Science 200l
122 pages
Lecture13 Stats Refresher
No ratings yet
Lecture13 Stats Refresher
35 pages
Lecture2 VA Handling Data
No ratings yet
Lecture2 VA Handling Data
51 pages
Introduction To Data Analytics
No ratings yet
Introduction To Data Analytics
76 pages
Lecture 2
No ratings yet
Lecture 2
55 pages
Lecture9 InfoVis Intro
No ratings yet
Lecture9 InfoVis Intro
34 pages
Lecture1 Introduction
No ratings yet
Lecture1 Introduction
59 pages
Data Strategy
No ratings yet
Data Strategy
41 pages
Comparing Inference Methods For Non Prob
No ratings yet
Comparing Inference Methods For Non Prob
22 pages
Lecture 3
No ratings yet
Lecture 3
55 pages
Imp 1
No ratings yet
Imp 1
13 pages
Unit4 Sampling Methods
No ratings yet
Unit4 Sampling Methods
15 pages
C Final Report
No ratings yet
C Final Report
28 pages
Ssis Interview Questions
No ratings yet
Ssis Interview Questions
114 pages
Sampling For Big Data 02
No ratings yet
Sampling For Big Data 02
21 pages
Sampling
No ratings yet
Sampling
14 pages
Elementary Statistics
From Everand
Elementary Statistics
jay prakash Maheshwari
5/5 (1)
Elements of Statistics MODULE 1.1
No ratings yet
Elements of Statistics MODULE 1.1
28 pages
Theory
No ratings yet
Theory
11 pages
Data Preprocessing
No ratings yet
Data Preprocessing
8 pages
Sia2206 Data Analytics Notes
No ratings yet
Sia2206 Data Analytics Notes
42 pages
ML Lecture 6 7 Preprocess
No ratings yet
ML Lecture 6 7 Preprocess
43 pages
Population and Sampling
No ratings yet
Population and Sampling
45 pages
Intro To Statistics and Probability
No ratings yet
Intro To Statistics and Probability
20 pages
Lecture 1
No ratings yet
Lecture 1
58 pages
BigDataSamplingTechniques AState of The Artsurvey
No ratings yet
BigDataSamplingTechniques AState of The Artsurvey
12 pages
Big Data Imp Notes of Big Dats
No ratings yet
Big Data Imp Notes of Big Dats
17 pages
The Internals of PostgreSQL - Chapter 1 Database Cluster, Databases, and Tables
No ratings yet
The Internals of PostgreSQL - Chapter 1 Database Cluster, Databases, and Tables
10 pages
Society 5.0 Unit Ii
No ratings yet
Society 5.0 Unit Ii
26 pages
Quantitative Methods in Management
No ratings yet
Quantitative Methods in Management
150 pages
BD.1ST Mid
No ratings yet
BD.1ST Mid
8 pages
ZKTeco Attendance Management System
No ratings yet
ZKTeco Attendance Management System
7 pages
Big Data Part-I
No ratings yet
Big Data Part-I
15 pages
3sampling and Simulation
No ratings yet
3sampling and Simulation
52 pages
Sampling For Big Data: Graham Cormode, University of Warwick Nick Duffield, Texas A&M University
No ratings yet
Sampling For Big Data: Graham Cormode, University of Warwick Nick Duffield, Texas A&M University
50 pages
Bda Ut2 Que Ans
No ratings yet
Bda Ut2 Que Ans
14 pages
Israr Report Ts
No ratings yet
Israr Report Ts
19 pages
KCA 034 - Unit 3
No ratings yet
KCA 034 - Unit 3
21 pages
Data Science and Visualization
No ratings yet
Data Science and Visualization
37 pages
Sampling and Its Types
No ratings yet
Sampling and Its Types
27 pages
Module2.2 Sampling Techniques
No ratings yet
Module2.2 Sampling Techniques
28 pages
Unit 6: Big Data Analytics Using R: 6.0 Overview
No ratings yet
Unit 6: Big Data Analytics Using R: 6.0 Overview
32 pages
1 Statistics
No ratings yet
1 Statistics
24 pages
Data Sampling
No ratings yet
Data Sampling
3 pages
Complete Basic Stats
No ratings yet
Complete Basic Stats
18 pages
Data Sampling
No ratings yet
Data Sampling
4 pages
4marks BA
No ratings yet
4marks BA
5 pages
Lecture Module 2
No ratings yet
Lecture Module 2
9 pages
Sampling Methods
No ratings yet
Sampling Methods
11 pages
A
No ratings yet
A
3 pages
Dsbda Ut6
No ratings yet
Dsbda Ut6
11 pages
Data Modeling March 16
No ratings yet
Data Modeling March 16
29 pages
Data Analysis
No ratings yet
Data Analysis
13 pages
DS Technical Brief Qlik Sense Enabling The New Enterprise en
No ratings yet
DS Technical Brief Qlik Sense Enabling The New Enterprise en
12 pages
Data Strategy Feb 9 Part 2
No ratings yet
Data Strategy Feb 9 Part 2
36 pages
Mysql Exam
No ratings yet
Mysql Exam
12 pages
Rubrik DB Info
No ratings yet
Rubrik DB Info
26 pages
Big Data
No ratings yet
Big Data
5 pages
Learn Excel Pivot Tables
100% (4)
Learn Excel Pivot Tables
188 pages
Lecture8 Parallel Volren
No ratings yet
Lecture8 Parallel Volren
44 pages
CAED Lab Manual Safia Tahir
No ratings yet
CAED Lab Manual Safia Tahir
71 pages
Lecture 01 Introduction To Database PART I BACKGROUND
No ratings yet
Lecture 01 Introduction To Database PART I BACKGROUND
39 pages
Es Water Distribution System 2
No ratings yet
Es Water Distribution System 2
23 pages
Big Data Analytics For R-2017 by ArunPrasath S., Sriram Kumar K., Krishna Sankar P.
No ratings yet
Big Data Analytics For R-2017 by ArunPrasath S., Sriram Kumar K., Krishna Sankar P.
7 pages
H12 CE372A RC Columns LSM 2023
No ratings yet
H12 CE372A RC Columns LSM 2023
6 pages
Mobility Data: Modeling, Management, and Understanding: Tutorial, October 26, 2010, Toronto
No ratings yet
Mobility Data: Modeling, Management, and Understanding: Tutorial, October 26, 2010, Toronto
59 pages
ITDBS Lab Session 03
No ratings yet
ITDBS Lab Session 03
8 pages
Text Document File
No ratings yet
Text Document File
1 page
Note-Grade 7-AI Project Cycle
No ratings yet
Note-Grade 7-AI Project Cycle
4 pages
H19 CE372A Two Way Slab Design Example 2023
No ratings yet
H19 CE372A Two Way Slab Design Example 2023
4 pages
Harnham - Data & Analytics Recruitment - US Salary Guide 2021
No ratings yet
Harnham - Data & Analytics Recruitment - US Salary Guide 2021
33 pages
Big Data's End Run Around Anonymity and Consent
No ratings yet
Big Data's End Run Around Anonymity and Consent
33 pages
Es Water Distribution System 5
No ratings yet
Es Water Distribution System 5
6 pages
H15 CE372A RC Slender Column Example LSM 2023
No ratings yet
H15 CE372A RC Slender Column Example LSM 2023
6 pages
Lab 6: Georeferencing and Digitization: 1.0 Overview
No ratings yet
Lab 6: Georeferencing and Digitization: 1.0 Overview
30 pages
How To Build Custom Module in Odoo 15
No ratings yet
How To Build Custom Module in Odoo 15
27 pages
Dbms20222-M14-A1-11-Rizkysuryaalfarizy - (SQL Oracle Developer)
No ratings yet
Dbms20222-M14-A1-11-Rizkysuryaalfarizy - (SQL Oracle Developer)
14 pages
Chapter 8 Pointers: Lecturer: Mrs Rohani Hassan
No ratings yet
Chapter 8 Pointers: Lecturer: Mrs Rohani Hassan
20 pages
H02 CE372A Intro RC Structures 2023
No ratings yet
H02 CE372A Intro RC Structures 2023
8 pages
Drift Survey Paper JETIR2411319
No ratings yet
Drift Survey Paper JETIR2411319
9 pages
H06 CE372A RC Beams LSM 2023
No ratings yet
H06 CE372A RC Beams LSM 2023
2 pages
H17 CE372A Flexure Reinf Curtailment LSM 2023
No ratings yet
H17 CE372A Flexure Reinf Curtailment LSM 2023
5 pages
Tutorial - Itree Eco Python Tools
No ratings yet
Tutorial - Itree Eco Python Tools
6 pages
H04 CE372A RC Beams 2023
No ratings yet
H04 CE372A RC Beams 2023
4 pages
It6006 Data Analytics Syllabus
No ratings yet
It6006 Data Analytics Syllabus
1 page
Function: (H, Pvalue, Ksstatistic) Kstest2 (X1, X2, Alpha, Tail)
No ratings yet
Function: (H, Pvalue, Ksstatistic) Kstest2 (X1, X2, Alpha, Tail)
9 pages
Most Frequently Asked SQL Interview Questions With Their Answers
No ratings yet
Most Frequently Asked SQL Interview Questions With Their Answers
2 pages
Practical First
No ratings yet
Practical First
4 pages
Module R5 Memory Management: 1. Introduction
No ratings yet
Module R5 Memory Management: 1. Introduction
6 pages
Problem Identification
No ratings yet
Problem Identification
4 pages
5 D 8 D 1 F 1 D 55 D
No ratings yet
5 D 8 D 1 F 1 D 55 D
3 pages
HI5033-Tutorial 6
No ratings yet
HI5033-Tutorial 6
2 pages
Lab Evaluation Rubric
No ratings yet
Lab Evaluation Rubric
2 pages

Lecture17 Sampling 1

Uploaded by

Lecture17 Sampling 1

Uploaded by

Big Data Visual Analytics (CS 661)

Instructor: Soumya Dutta

IITK CS661: Big Data Visual Analytics: Soumya Dutta 2

How can/should we sample big data to achieve our goals above?

Quota Stratified Blue Noise

Snowball Cluster Many other

IITK CS661: Big Data Visual Analytics: Soumya Dutta 4

IITK CS661: Big Data Visual Analytics: Soumya Dutta 5

IITK CS661: Big Data Visual Analytics: Soumya Dutta 6

IITK CS661: Big Data Visual Analytics: Soumya Dutta 7

IITK CS661: Big Data Visual Analytics: Soumya Dutta 8

IITK CS661: Big Data Visual Analytics: Soumya Dutta 9

IITK CS661: Big Data Visual Analytics: Soumya Dutta 10

IITK CS661: Big Data Visual Analytics: Soumya Dutta 11

IITK CS661: Big Data Visual Analytics: Soumya Dutta 12

IITK CS661: Big Data Visual Analytics: Soumya Dutta 13

IITK CS661: Big Data Visual Analytics: Soumya Dutta 14

IITK CS661: Big Data Visual Analytics: Soumya Dutta 15

IITK CS661: Big Data Visual Analytics: Soumya Dutta 16

IITK CS661: Big Data Visual Analytics: Soumya Dutta 17

IITK CS661: Big Data Visual Analytics: Soumya Dutta 18

IITK CS661: Big Data Visual Analytics: Soumya Dutta 19

• If we have N points and we wish to select a sample of n

IITK CS661: Big Data Visual Analytics: Soumya Dutta 21

IITK CS661: Big Data Visual Analytics: Soumya Dutta 22

We also see that since is a Bernoulli random variable,

IITK CS661: Big Data Visual Analytics: Soumya Dutta 23

IITK CS661: Big Data Visual Analytics: Soumya Dutta 24

IITK CS661: Big Data Visual Analytics: Soumya Dutta https://statisticsbyjim.com/basics/cluster-sampling/ 26

IITK CS661: Big Data Visual Analytics: Soumya Dutta 27

IITK CS661: Big Data Visual Analytics: Soumya Dutta 28

IITK CS661: Big Data Visual Analytics: Soumya Dutta 29

IITK CS661: Big Data Visual Analytics: Soumya Dutta 30

Pdf of exponential distribution Cdf of exponential distribution

IITK CS661: Big Data Visual Analytics: Soumya Dutta 31

• We want so that we transform uniform numbers to Exponential distribution

So, we have, hence,

IITK CS661: Big Data Visual Analytics: Soumya Dutta 32

-2.0 0.0 2.0

𝑓 ( 𝑥 ) =𝑝𝑙𝑜𝑡 𝑜𝑓 𝑓 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛(𝑅𝑒𝑑) Intuition: High acceptance probability for a

• is a normalizing constant to ensure as

IITK CS661: Big Data Visual Analytics: Soumya Dutta 35

Systematic Random Blue Noise

IITK CS661: Big Data Visual Analytics: Soumya Dutta 37

You might also like