0% found this document useful (0 votes)

25 views33 pages

Sampling and Standard Error

This lecture discusses sampling and standard error. It begins with a review of inferential statistics using random samples to make inferences about populations. The lecture then covers probability sampling methods like simple random sampling and stratified sampling. Examples of temperature data from various US cities are presented. The central concept discussed is the standard error of the mean (SEM), which is the standard deviation of sample means. It is shown that as sample size increases, the SEM decreases, allowing for tighter confidence intervals around the population mean when estimating from a single sample.

Uploaded by

scribd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views33 pages

Sampling and Standard Error

Uploaded by

scribd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Lecture:

Sampling and
Standard Error

6.0002 LECTURE 8
1
Announcements
§Relevant reading: Chapter 17
§No lecture Wednesday of next week!

6.0002 LECTURE 8 2
Recall Inferential Statistics
§Inferential statistics: making inferences about a
populations by examining one or more random
samples drawn from that population
§With Monte Carlo simulation we can generate lots of
random samples, and use them to compute confidence
intervals
§But suppose we can’t create samples by simulation?
◦ “According to the most recent poll Clinton leads Trump by
3.2 percentage points in swing states. The registered
voter sample is 835 with with a margin of error of plus or
minus 4 percentage points.” – October 2016

6.0002 LECTURE 8 3
Probability Sampling
§Each member of the population has a nonzero
probability of being included in a sample
§Simple random sampling: each member has an equal
chance of being chosen
§Not always appropriate
◦ Are MIT undergraduates nerds?
◦ Consider a random sample of 100 students

6.0002 LECTURE 8 4
Stratified Sampling

§Stratified sampling
◦ Partition population into subgroups
◦ Take a simple random sample from each subgroup

6.0002 LECTURE 8 5
Stratified Sampling
§When there are small subgroups that should be
represented
§When it is important that subgroups be represented
proportionally to their size in the population
§Can be used to reduced the needed size of sample
◦ Variability of subgroups less than of entire
population
§Requires care to do properly
§Well stick to simple random samples

6.0002 LECTURE 8 6
Data
§From U.S. National Centers for Environmental
Information (NCEI)
§Daily high and low temperatures for
◦ 21 different US cities
◦ ALBUQUERQUE, BALTIMORE, BOSTON, CHARLOTTE, CHICAGO,
DALLAS, DETROIT, LAS VEGAS, LOS ANGELES, MIAMI, NEW ORLEANS,
NEW YORK, PHILADELPHIA, PHOENIX, PORTLAND, SAN DIEGO, SAN
FRANCISCO, SAN JUAN, SEATTLE, ST LOUIS, TAMPA
◦ 1961 – 2015
◦ 421,848 data points (examples)
§Let’s use some code to look at the data

6.0002 LECTURE 8 7
New in Code
§numpy.std is function in the numpy module that
returns the standard deviation
§random.sample(population, sampleSize) returns a list
containing sampleSize randomly chosen distinct
elements of population
◦ Sampling without replacement

6.0002 LECTURE 8 8
Histogram of Entire Population

σ = ~9.4

6.0002 LECTURE 8 9
Histogram of Random Sample of Size 100

σ = ~10.4

6.0002 LECTURE 8 10
Means and Standard Deviations
§Population mean = 16.3
§Sample mean = 17.1
§Standard deviation of population = 9.44
§Standard deviation of sample = 10.4
§A happy accident, or something we should expect?
§Let’s try it 1000 times and plot the results

6.0002 LECTURE 8 11
New in Code
§pylab.axvline(x = popMean, color = 'r') draws a red
vertical line at popMean on the x-axis
§There’s also a pylab.axhline function

6.0002 LECTURE 8 12
Try It 1000 Times

6.0002 LECTURE 8 13
Try It 1000 Times

What’s the 95%

confidence interval?
16.28 +- 1.96*0.94
14.5 - 18.1
±

Includes population
mean, but pretty
wide

Mean of sample Means = 16.3 Suppose we want a

Standard deviation of sample means = 0.94 tighter bound?

6.0002 LECTURE 8 14
Getting a Tighter Bound
§Will drawing more samples help?
◦ Let’s try increasing from 1000 to 2000
◦ Standard deviation goes from 0.943 to 0.946
§How about larger samples?
◦ Let’s try increasing sample size from 100 to 200
◦ Standard deviation goes from 0.943 to 0.662

6.0002 LECTURE 8 15
Error Bars, a Digression
§Graphical representation of the variability of data
§Way to visualize uncertainty When confidence
intervals don’t overlap,
we can conclude that
means are statistically
significantly different at
95% level.

https://upload.wikimedia.org/wikipedia/commons/1/1d/Pulse_Rate_Error_Bar_By_Exercise_Level.png

6.0002 LECTURE 8 16
Let’s Look at Error Bars for Temperatures

pylab.errorbar(xVals, sizeMeans,
yerr = 1.96*pylab.array(sizeSDs),
fmt = 'o',
label = '95% Confidence Interval')

6.0002 LECTURE 8 17
Sample Size and Standard Deviation

6.0002 LECTURE 8 18
Larger Samples Seem to Be Better
§Going from a sample size of 50 to 600 reduced the
confidence interval from about 1.2C to about 0.34C.
§But we are now looking at 600*100 = 600k examples
◦ What has sampling bought us?
◦ Absolutely Nothing!
◦ Entire population contained ~422k samples

6.0002 LECTURE 8 19
What Can We Conclude from 1 Sample?
§More than you might think
§Thanks to the Central Limit
Theorem

6.0002 LECTURE 8 20
Recall Central Limit Theorem
§Given a sufficiently large sample:
◦1) The means of the samples in a set of samples (the
sample means) will be approximately normally
distributed,
◦2) This normal distribution will have a mean close to the
mean the population, and
◦3) The variance of the sample means will be close to the
variance of the population divided by the sample size.
§Time to use the 3rd feature
§Compute standard error of the mean (SEM or SE)

6.0002 LECTURE 8 21
Standard Error of the Mean

σ
SE =
n

def sem(popSD, sampleSize):

return popSD/sampleSize**0.5

§Does it work?

6.0002 LECTURE 8 22
Testing the SEM
sampleSizes = (25, 50, 100, 200, 300, 400, 500, 600)
numTrials = 50
population = getHighs()
popSD = numpy.std(population)
sems = []
sampleSDs = []
for size in sampleSizes:
sems.append(sem(popSD, size))
means = []
for t in range(numTrials):
sample = random.sample(population, size)
means.append(sum(sample)/len(sample))
sampleSDs.append(numpy.std(means))
pylab.plot(sampleSizes, sampleSDs,
label = 'Std of ' + str(numTrials) + ' means')
pylab.plot(sampleSizes, sems, 'r--', label = 'SEM')
pylab.xlabel('Sample Size')
pylab.ylabel('Std and SEM')
pylab.title('SD for ' + str(numTrials) + ' Means and SEM')
pylab.legend()
6.0002 LECTURE 8 23
Standard Error of the Mean

σ
SE =
n

But, we don’t
know standard
deviation of
population
How might we
approximate it?

6.0002 LECTURE 8 24
Sample SD vs. Population SD

6.0002 LECTURE 8 25
The Point
§Once sample reaches a reasonable size, sample
standard deviation is a pretty good approximation to
population standard deviation
§True only for this example?
◦ Distribution of population?
◦ Size of population?

6.0002 LECTURE 8 26
Looking at Distributions

def plotDistributions():
uniform, normal, exp = [], [], []
for i in range(100000):
uniform.append(random.random())
normal.append(random.gauss(0, 1))
exp.append(random.expovariate(0.5))
makeHist(uniform, 'Uniform', 'Value', 'Frequency')
pylab.figure()
makeHist(normal, 'Gaussian', 'Value', 'Frequency')
pylab.figure()
makeHist(exp, 'Exponential', 'Value', 'Frequency')

6.0002 LECTURE 8 27
Three Different Distributions

random.random()
random.gauss(0, 1)

random.expovariate(0.5)

6.0002 LECTURE 8 28
Does Distribution Matter?

Skew, a measure
of the asymmetry
of a probability
distribution,
matters

6.0002 LECTURE 8 29
Does Population Size Matter?

6.0002 LECTURE 8 30
To Estimate Mean from a Single Sample
§1) Choose sample size based on estimate of skew in
population
§2) Chose a random sample from the population
§3) Compute the mean and standard deviation of that
sample
§4) Use the standard deviation of that sample to
estimate the SE
§5) Use the estimated SE to generate confidence
intervals around the sample mean
Works great when we choose independent random samples.
Not always so easy to do, as political pollsters keep learning.
6.0002 LECTURE 8 31
Are 200 Samples Enough?

numBad = 0
for t in range(numTrials):
sample = random.sample(temps, sampleSize)
sampleMean = sum(sample)/sampleSize
se = numpy.std(sample)/sampleSize**0.5
if abs(popMean - sampleMean) > 1.96*se:
numBad += 1
print('Fraction outside 95% confidence interval =',
numBad/numTrials)

Fraction outside 95% confidence interval = 0.0511

6.00.2X LECTURE 32
MIT OpenCourseWare
https://ocw.mit.edu

6.0002 Introduction to Computational Thinking and Data Science

Fall 2016

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.

Programming Python Statistics
No ratings yet
Programming Python Statistics
7 pages
Elementary Statistical Methods 7th Edition - 678
No ratings yet
Elementary Statistical Methods 7th Edition - 678
383 pages
Lecture 4 - Data Wrangling
No ratings yet
Lecture 4 - Data Wrangling
41 pages
The Role of Statistics in Engineering
No ratings yet
The Role of Statistics in Engineering
37 pages
Chapter 6 - Sampling and Estimation
No ratings yet
Chapter 6 - Sampling and Estimation
36 pages
CIA Part 2 Exam
100% (8)
CIA Part 2 Exam
55 pages
Virtual COMSATS Inferential Statistics Lecture-6: Ossam Chohan CIIT Abbottabad
100% (1)
Virtual COMSATS Inferential Statistics Lecture-6: Ossam Chohan CIIT Abbottabad
35 pages
H1.1 Definitions, Measures, Plots, CLT
No ratings yet
H1.1 Definitions, Measures, Plots, CLT
83 pages
Essentials of Statistics
No ratings yet
Essentials of Statistics
272 pages
Chapter 4 Sampling Distributions PDF
No ratings yet
Chapter 4 Sampling Distributions PDF
74 pages
Lecture 30 - Sample and Population Mean
No ratings yet
Lecture 30 - Sample and Population Mean
49 pages
Course Code Course Name Acts: STAT2054 Statistics For Engineers 4,00
No ratings yet
Course Code Course Name Acts: STAT2054 Statistics For Engineers 4,00
41 pages
SAS 2130 Statistics 2021
No ratings yet
SAS 2130 Statistics 2021
212 pages
Sampling Distribution and Central Limit Theorem: Session 2
No ratings yet
Sampling Distribution and Central Limit Theorem: Session 2
19 pages
UNL STAT318 Notes Chapter 1-4 (2020)
No ratings yet
UNL STAT318 Notes Chapter 1-4 (2020)
66 pages
Sampling and Estimation
No ratings yet
Sampling and Estimation
36 pages
Stat 509 Notes
100% (1)
Stat 509 Notes
195 pages
EE311 Lecture #2 Descriptive Statistics
No ratings yet
EE311 Lecture #2 Descriptive Statistics
47 pages
DS Chapter - 2
No ratings yet
DS Chapter - 2
73 pages
Statistics With MATLABOctave
No ratings yet
Statistics With MATLABOctave
46 pages
Introduction To Probabilty
No ratings yet
Introduction To Probabilty
212 pages
Module 4 (301 SI-2)
No ratings yet
Module 4 (301 SI-2)
24 pages
Igual-SeguÃ 2017 Chapter StatisticalInference
No ratings yet
Igual-SeguÃ 2017 Chapter StatisticalInference
15 pages
Chapter 1 - F2021 - IE 242
No ratings yet
Chapter 1 - F2021 - IE 242
35 pages
Prob & Stats (Slides) PDF
No ratings yet
Prob & Stats (Slides) PDF
101 pages
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
No ratings yet
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
20 pages
Intro To Essential Stats With Python
No ratings yet
Intro To Essential Stats With Python
51 pages
Tuesday, 16 January 2024 2:58 PM
No ratings yet
Tuesday, 16 January 2024 2:58 PM
46 pages
اسايمنت
No ratings yet
اسايمنت
28 pages
Lectorial Slides 6a
No ratings yet
Lectorial Slides 6a
30 pages
Stats-And-Prob-Reviewer (Grade 11 Stem)
100% (1)
Stats-And-Prob-Reviewer (Grade 11 Stem)
5 pages
Slides Prepared by John S. Loucks St. Edward's University: 1 Slide © 2003 Thomson/South-Western
No ratings yet
Slides Prepared by John S. Loucks St. Edward's University: 1 Slide © 2003 Thomson/South-Western
54 pages
Stats 201 Midterm Sheet
No ratings yet
Stats 201 Midterm Sheet
2 pages
1 Class Topics: Syllabus of Statistics 6C 2013-14 Prof. M. Romanazzi March 8, 2014
No ratings yet
1 Class Topics: Syllabus of Statistics 6C 2013-14 Prof. M. Romanazzi March 8, 2014
2 pages
Evans Analytics2e PPT 06 Final
100% (1)
Evans Analytics2e PPT 06 Final
36 pages
Workshop 5: PDF Sampling and Statistics: Preview: Generating Random Numbers
No ratings yet
Workshop 5: PDF Sampling and Statistics: Preview: Generating Random Numbers
10 pages
Ch-1.Ppt Business Statx
No ratings yet
Ch-1.Ppt Business Statx
66 pages
Lecture Notes Ma12003 PDF
100% (1)
Lecture Notes Ma12003 PDF
105 pages
BN2102 1-6 Notes
No ratings yet
BN2102 1-6 Notes
38 pages
QMT 11 Notes
No ratings yet
QMT 11 Notes
150 pages
MECH 262 - Notes (Statistics)
No ratings yet
MECH 262 - Notes (Statistics)
7 pages
ML Lab Final R22
No ratings yet
ML Lab Final R22
67 pages
Lecture Notes
No ratings yet
Lecture Notes
80 pages
Statistics With R Programming PDF
No ratings yet
Statistics With R Programming PDF
53 pages
Notes PDF
No ratings yet
Notes PDF
54 pages
Point Estimation: Statistics (MAST20005) & Elements of Statistics (MAST90058) Semester 2, 2018
No ratings yet
Point Estimation: Statistics (MAST20005) & Elements of Statistics (MAST90058) Semester 2, 2018
12 pages
Theory of Estimation
100% (1)
Theory of Estimation
30 pages
MAT 211 Introduction To Business Statistics I Lecture Notes
No ratings yet
MAT 211 Introduction To Business Statistics I Lecture Notes
69 pages
Confidence Intervals
No ratings yet
Confidence Intervals
3 pages
Lecture Notes Statistics
100% (2)
Lecture Notes Statistics
117 pages
MATH1208AnnotatedBook Imp
No ratings yet
MATH1208AnnotatedBook Imp
145 pages
Intro To Probability and Statistics
No ratings yet
Intro To Probability and Statistics
147 pages
Short Quiz Stat
69% (13)
Short Quiz Stat
21 pages
Stat Post Test
No ratings yet
Stat Post Test
3 pages
Business Statistics For Contemporary Decision Making 8th Edition Black Solutions Manual 1
100% (75)
Business Statistics For Contemporary Decision Making 8th Edition Black Solutions Manual 1
31 pages
A Comparative Marketing Study of LG Electronics
No ratings yet
A Comparative Marketing Study of LG Electronics
131 pages
Prof. Ed15 - Special Topic 3
No ratings yet
Prof. Ed15 - Special Topic 3
2 pages
Monitoring and Evaluation Guide
No ratings yet
Monitoring and Evaluation Guide
34 pages
Chap.S 8 & 9-Audit Sampling: An Application To Tests of Controls & Substantive Testings
No ratings yet
Chap.S 8 & 9-Audit Sampling: An Application To Tests of Controls & Substantive Testings
25 pages
Assignment Module04 Part2
50% (2)
Assignment Module04 Part2
4 pages
Blood Donation Practice and Associated Factors Among Adults in Rural 9791
No ratings yet
Blood Donation Practice and Associated Factors Among Adults in Rural 9791
9 pages
Sampling
No ratings yet
Sampling
32 pages
Essential Statistics 4th Edition by Rees ISBN 1584880074 978-1584880073instant Download
100% (3)
Essential Statistics 4th Edition by Rees ISBN 1584880074 978-1584880073instant Download
73 pages
18.02.MSA Attribute (Advance)
No ratings yet
18.02.MSA Attribute (Advance)
71 pages
Business Statistics 1st Edition Donnelly Test Bankdownload
100% (11)
Business Statistics 1st Edition Donnelly Test Bankdownload
47 pages
Psych 110 Chapter 6 Notes
No ratings yet
Psych 110 Chapter 6 Notes
6 pages
Sample Size Estimation
No ratings yet
Sample Size Estimation
14 pages
Resarch - Proposal
100% (1)
Resarch - Proposal
27 pages
MATH 1281 Discussion Assignment Unit 2
No ratings yet
MATH 1281 Discussion Assignment Unit 2
3 pages
Nexus Between Tribalism, Ethnicity, Nepotism, Favoritism and Sustainable Performance of Public Universities in Kenya
No ratings yet
Nexus Between Tribalism, Ethnicity, Nepotism, Favoritism and Sustainable Performance of Public Universities in Kenya
26 pages
Impact of Coconut Production On The Environment
No ratings yet
Impact of Coconut Production On The Environment
13 pages
7 Estimation Describing A Single Population
No ratings yet
7 Estimation Describing A Single Population
92 pages
Research Proposal
No ratings yet
Research Proposal
27 pages
ME202 Week 7 Homework Confidence Intervals T-Tests Prediction Intervals-1
No ratings yet
ME202 Week 7 Homework Confidence Intervals T-Tests Prediction Intervals-1
11 pages
Agilent 5335A Universal Counter, 200 MHZ: Data Sheet
No ratings yet
Agilent 5335A Universal Counter, 200 MHZ: Data Sheet
4 pages
Teaching Event Study PDF
No ratings yet
Teaching Event Study PDF
11 pages
Research Project 4
No ratings yet
Research Project 4
22 pages
Shabib 2017
No ratings yet
Shabib 2017
31 pages
Math 221 Week 8 Final Exam
No ratings yet
Math 221 Week 8 Final Exam
2 pages
Probability and Statistics 2019 June QB
No ratings yet
Probability and Statistics 2019 June QB
16 pages
Statistics II for Dummies
From Everand
Statistics II for Dummies
Deborah J. Rumsey
3.5/5 (31)
This is The Statistics Handbook your Professor Doesn't Want you to See. So Easy, it's Practically Cheating...
From Everand
This is The Statistics Handbook your Professor Doesn't Want you to See. So Easy, it's Practically Cheating...
S. Deviant
4.5/5 (6)
Psychology Statistics For Dummies
From Everand
Psychology Statistics For Dummies
Martin Dempster
5/5 (1)
Digital Signal Processing (DSP) with Python Programming
From Everand
Digital Signal Processing (DSP) with Python Programming
Maurice Charbit
No ratings yet
Statistics I Essentials
From Everand
Statistics I Essentials
Emil G. Milewski
No ratings yet
Statistics II Essentials
From Everand
Statistics II Essentials
Emil Milewski
2.5/5 (1)
Sampling in Statistics
From Everand
Sampling in Statistics
Stephanie Glen
No ratings yet
Chi Squared for Beginners
From Everand
Chi Squared for Beginners
Stephanie Glen
No ratings yet
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet

Sampling and Standard Error

Uploaded by

Sampling and Standard Error

Uploaded by

Lecture:

What’s the 95%

Mean of sample Means = 16.3 Suppose we want a

def sem(popSD, sampleSize):

Fraction outside 95% confidence interval = 0.0511

6.0002 Introduction to Computational Thinking and Data Science

You might also like