Sampling in Python
Sampling in Python
estimates
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Estimating the population of France
A census asks every household how many
people live there.
SAMPLING IN PYTHON
There are lots of people in France
Censuses are really expensive!
SAMPLING IN PYTHON
Sampling households
Cheaper to ask a small number of households
and use statistics to estimate the population
SAMPLING IN PYTHON
Population vs. sample
The population is the complete dataset
SAMPLING IN PYTHON
Coffee rating dataset
total_cup_points variety country_of_origin aroma flavor aftertaste body balance
90.58 NA Ethiopia 8.67 8.83 8.67 8.50 8.42
89.92 Other Ethiopia 8.75 8.67 8.50 8.42 8.42
... ... ... ... ... ... ... ...
73.75 NA Vietnam 6.75 6.67 6.5 6.92 6.83
1338 rows
SAMPLING IN PYTHON
Points vs. flavor: population
pts_vs_flavor_pop = coffee_ratings[["total_cup_points", "flavor"]]
total_cup_points flavor
0 90.58 8.83
1 89.92 8.67
2 89.75 8.50
3 89.00 8.58
4 88.83 8.50
... ... ...
1333 78.75 7.58
1334 78.08 7.67
1335 77.17 7.33
1336 75.08 6.83
1337 73.75 6.67
SAMPLING IN PYTHON
Points vs. flavor: 10 row sample
pts_vs_flavor_samp = pts_vs_flavor_pop.sample(n=10)
total_cup_points flavor
1088 80.33 7.17
1157 79.67 7.42
1267 76.17 7.33
506 83.00 7.67
659 82.50 7.42
817 81.92 7.50
1050 80.67 7.42
685 82.42 7.50
1027 80.92 7.25
62 85.58 8.17
SAMPLING IN PYTHON
Python sampling for Series
Use .sample() for pandas DataFrames and Series
cup_points_samp = coffee_ratings['total_cup_points'].sample(n=10)
1088 80.33
1157 79.67
1267 76.17
... ...
685 82.42
1027 80.92
62 85.58
Name: total_cup_points, dtype: float64
SAMPLING IN PYTHON
Population parameters & point estimates
A population parameter is a calculation made on the population dataset
import numpy as np
np.mean(pts_vs_flavor_pop['total_cup_points'])
82.15120328849028
np.mean(cup_points_samp)
81.31800000000001
SAMPLING IN PYTHON
Point estimates with pandas
pts_vs_flavor_pop['flavor'].mean()
7.526046337817639
pts_vs_flavor_samp['flavor'].mean()
7.485000000000001
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Convenience
sampling
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
The Literary Digest election prediction
SAMPLING IN PYTHON
Finding the mean age of French people
Survey 10 people at Disneyland Paris
Mean age of 24.6 years
SAMPLING IN PYTHON
How accurate was the survey?
Year Average French Age 24.6 years is a poor estimate
1975 31.6 People who visit Disneyland aren't
1985 33.6 representative of the whole population
1995 36.2
2005 38.9
2015 41.2
SAMPLING IN PYTHON
Convenience sampling coffee ratings
coffee_ratings["total_cup_points"].mean()
82.15120328849028
coffee_ratings_first10 = coffee_ratings.head(10)
coffee_ratings_first10["total_cup_points"].mean()
89.1
SAMPLING IN PYTHON
Visualizing selection bias
import matplotlib.pyplot as plt
import numpy as np
coffee_ratings["total_cup_points"].hist(bins=np.arange(59, 93, 2))
plt.show()
SAMPLING IN PYTHON
Distribution of a population and of a convenience
sample
Population: Convenience sample:
SAMPLING IN PYTHON
Visualizing selection bias for a random sample
coffee_sample = coffee_ratings.sample(n=10)
coffee_sample["total_cup_points"].hist(bins=np.arange(59, 93, 2))
plt.show()
SAMPLING IN PYTHON
Distribution of a population and of a simple random
sample
Population: Random Sample:
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Pseudo-random
number generation
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
What does random mean?
{adjective} made, done, happening, or chosen without method or conscious decision.
1 Oxford Languages
SAMPLING IN PYTHON
True random numbers
Generated from physical processes, like flipping coins
Hotbits uses radioactive decay
1 https://www.fourmilab.ch/hotbits 2 https://www.random.org
SAMPLING IN PYTHON
Pseudo-random number generation
Pseudo-random number generation is cheap and fast
Next "random" number calculated from previous "random" number
SAMPLING IN PYTHON
Pseudo-random number generation example
seed = 1
calc_next_random(seed)
calc_next_random(3)
calc_next_random(2)
SAMPLING IN PYTHON
Random number generating functions
Prepend with numpy.random , such as numpy.random.beta()
SAMPLING IN PYTHON
Visualizing random numbers
randoms = np.random.beta(a=2, b=2, size=5000)
randoms
SAMPLING IN PYTHON
Random numbers seeds
np.random.seed(20000229) np.random.seed(20000229)
SAMPLING IN PYTHON
Using a different seed
np.random.seed(20000229) np.random.seed(20041004)
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Simple random and
systematic sampling
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Simple random sampling
SAMPLING IN PYTHON
Simple random sampling of coffees
SAMPLING IN PYTHON
Simple random sampling with pandas
coffee_ratings.sample(n=5, random_state=19000113)
SAMPLING IN PYTHON
Systematic sampling
SAMPLING IN PYTHON
Systematic sampling - defining the interval
sample_size = 5
pop_size = len(coffee_ratings)
print(pop_size)
1338
267
SAMPLING IN PYTHON
Systematic sampling - selecting the rows
coffee_ratings.iloc[::interval]
body balance
0 8.50 8.42
267 7.75 7.75
534 7.92 7.83
801 7.50 7.33
1068 7.17 7.25
SAMPLING IN PYTHON
The trouble with systematic sampling
coffee_ratings_with_id = coffee_ratings.reset_index()
coffee_ratings_with_id.plot(x="index", y="aftertaste", kind="scatter")
plt.show()
Systematic sampling is only safe if we don't see a pattern in this scatter plot
SAMPLING IN PYTHON
Making systematic sampling safe
shuffled = coffee_ratings.sample(frac=1)
shuffled = shuffled.reset_index(drop=True).reset_index()
shuffled.plot(x="index", y="aftertaste", kind="scatter")
plt.show()
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Stratified and
weighted random
sampling
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Coffees by country
top_counts = coffee_ratings['country_of_origin'].value_counts()
top_counts.head(6)
country_of_origin
Mexico 236
Colombia 183
Guatemala 181
Brazil 132
Taiwan 75
United States (Hawaii) 73
dtype: int64
1 The dataset lists Hawaii and Taiwan as countries for convenience, as they are notable coffee-growing regions.
SAMPLING IN PYTHON
Filtering for 6 countries
top_counted_countries = ["Mexico", "Colombia", "Guatemala",
"Brazil", "Taiwan", "United States (Hawaii)"]
top_counted_subset = coffee_ratings['country_of_origin'].isin(top_counted_countries)
coffee_ratings_top = coffee_ratings[top_counted_subset]
SAMPLING IN PYTHON
Counts of a simple random sample
coffee_ratings_samp = coffee_ratings_top.sample(frac=0.1, random_state=2021)
coffee_ratings_samp['country_of_origin'].value_counts(normalize=True)
country_of_origin
Mexico 0.250000
Guatemala 0.204545
Colombia 0.181818
Brazil 0.181818
United States (Hawaii) 0.102273
Taiwan 0.079545
dtype: float64
SAMPLING IN PYTHON
Comparing proportions
Population: 10% simple random sample:
SAMPLING IN PYTHON
Proportional stratified sampling
coffee_ratings_strat = coffee_ratings_top.groupby("country_of_origin")\
.sample(frac=0.1, random_state=2021)
coffee_ratings_strat['country_of_origin'].value_counts(normalize=True)
Mexico 0.272727
Guatemala 0.204545
Colombia 0.204545
Brazil 0.147727
Taiwan 0.090909
United States (Hawaii) 0.079545
Name: country_of_origin, dtype: float64
SAMPLING IN PYTHON
Equal counts stratified sampling
coffee_ratings_eq = coffee_ratings_top.groupby("country_of_origin")\
.sample(n=15, random_state=2021)
coffee_ratings_eq['country_of_origin'].value_counts(normalize=True)
Taiwan 0.166667
Brazil 0.166667
United States (Hawaii) 0.166667
Guatemala 0.166667
Mexico 0.166667
Colombia 0.166667
Name: country_of_origin, dtype: float64
SAMPLING IN PYTHON
Weighted random sampling
Specify weights to adjust the relative probability of a row being sampled
import numpy as np
coffee_ratings_weight = coffee_ratings_top
condition = coffee_ratings_weight['country_of_origin'] == "Taiwan"
coffee_ratings_weight['weight'] = np.where(condition, 2, 1)
SAMPLING IN PYTHON
Weighted random sampling results
10% weighted sample:
coffee_ratings_weight['country_of_origin'].value_counts(normalize=True)
Brazil 0.261364
Mexico 0.204545
Guatemala 0.204545
Taiwan 0.170455
Colombia 0.090909
United States (Hawaii) 0.068182
Name: country_of_origin, dtype: float64
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Cluster sampling
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Stratified sampling vs. cluster sampling
Stratified sampling
Split the population into subgroups
Cluster sampling
Use simple random sampling to pick some subgroups
SAMPLING IN PYTHON
Varieties of coffee
varieties_pop = list(coffee_ratings['variety'].unique())
SAMPLING IN PYTHON
Stage 1: sampling for subgroups
import random
varieties_samp = random.sample(varieties_pop, k=3)
SAMPLING IN PYTHON
Stage 2: sampling each group
variety_condition = coffee_ratings['variety'].isin(varieties_samp)
coffee_ratings_cluster = coffee_ratings[variety_condition]
coffee_ratings_cluster['variety'] = coffee_ratings_cluster['variety'].cat.remove_unused_categories()
coffee_ratings_cluster.groupby("variety")\
.sample(n=5, random_state=2021)
SAMPLING IN PYTHON
Stage 2 output
total_cup_points variety country_of_origin ...
variety
Bourbon 575 82.83 Bourbon Guatemala
560 82.83 Bourbon Guatemala
524 83.00 Bourbon Guatemala
1140 79.83 Bourbon Guatemala
318 83.67 Bourbon Brazil
Hawaiian Kona 1291 73.67 Hawaiian Kona United States (Hawaii)
1266 76.25 Hawaiian Kona United States (Hawaii)
488 83.08 Hawaiian Kona United States (Hawaii)
461 83.17 Hawaiian Kona United States (Hawaii)
117 84.83 Hawaiian Kona United States (Hawaii)
SL28 137 84.67 SL28 Kenya
452 83.17 SL28 Kenya
224 84.17 SL28 Kenya
66 85.50 SL28 Kenya
559 82.83 SL28 Kenya
SAMPLING IN PYTHON
Multistage sampling
Cluster sampling is a type of multistage sampling
Can have > 2 stages
E.g., countrywide surveys may sample states, counties, cities, and neighborhoods
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Comparing
sampling methods
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Review of sampling techniques - setup
top_counted_countries = ["Mexico", "Colombia", "Guatemala",
"Brazil", "Taiwan", "United States (Hawaii)"]
subset_condition = coffee_ratings['country_of_origin'].isin(top_counted_countries)
coffee_ratings_top = coffee_ratings[subset_condition]
coffee_ratings_top.shape
(880, 8)
SAMPLING IN PYTHON
Review of simple random sampling
coffee_ratings_srs = coffee_ratings_top.sample(frac=1/3, random_state=2021)
coffee_ratings_srs.shape
(293, 8)
SAMPLING IN PYTHON
Review of stratified sampling
coffee_ratings_strat = coffee_ratings_top.groupby("country_of_origin")\
.sample(frac=1/3, random_state=2021)
coffee_ratings_strat.shape
(293, 8)
SAMPLING IN PYTHON
Review of cluster sampling
import random
top_countries_samp = random.sample(top_counted_countries, k=2)
top_condition = coffee_ratings_top['country_of_origin'].isin(top_countries_samp)
coffee_ratings_cluster = coffee_ratings_top[top_condition]
coffee_ratings_cluster['country_of_origin'] = coffee_ratings_cluster['country_of_origin']\
.cat.remove_unused_categories()
coffee_ratings_clust = coffee_ratings_cluster.groupby("country_of_origin")\
.sample(n=len(coffee_ratings_top) // 6)
coffee_ratings_clust.shape
(292, 8)
SAMPLING IN PYTHON
Calculating mean cup points
Population Simple random sample
coffee_ratings_top['total_cup_points'].mean() coffee_ratings_srs['total_cup_points'].mean()
81.94700000000002 81.95982935153583
81.92566552901025 82.03246575342466
SAMPLING IN PYTHON
Mean cup points by country: simple random
Population: Simple random sample:
coffee_ratings_top.groupby("country_of_origin")\ coffee_ratings_srs.groupby("country_of_origin")\
['total_cup_points'].mean() ['total_cup_points'].mean()
country_of_origin country_of_origin
Brazil 82.405909 Brazil 82.414878
Colombia 83.106557 Colombia 82.925536
Guatemala 81.846575 Guatemala 82.045385
Mexico 80.890085 Mexico 81.100714
Taiwan 82.001333 Taiwan 81.744333
United States (Hawaii) 81.820411 United States (Hawaii) 82.008000
Name: total_cup_points, dtype: float64 Name: total_cup_points, dtype: float64
SAMPLING IN PYTHON
Mean cup points by country: stratified
Population: Stratified sample:
coffee_ratings_top.groupby("country_of_origin")\ coffee_ratings_strat.groupby("country_of_origin")\
['total_cup_points'].mean() ['total_cup_points'].mean()
country_of_origin country_of_origin
Brazil 82.405909 Brazil 82.499773
Colombia 83.106557 Colombia 83.288197
Guatemala 81.846575 Guatemala 81.727667
Mexico 80.890085 Mexico 80.994684
Taiwan 82.001333 Taiwan 81.846800
United States (Hawaii) 81.820411 United States (Hawaii) 81.051667
Name: total_cup_points, dtype: float64 Name: total_cup_points, dtype: float64
SAMPLING IN PYTHON
Mean cup points by country: cluster
Population: Cluster sample:
coffee_ratings_top.groupby("country_of_origin")\ coffee_ratings_clust.groupby("country_of_origin")\
['total_cup_points'].mean() ['total_cup_points'].mean()
country_of_origin country_of_origin
Brazil 82.405909 Colombia 83.128904
Colombia 83.106557 Mexico 80.936027
Guatemala 81.846575 Name: total_cup_points, dtype: float64
Mexico 80.890085
Taiwan 82.001333
United States (Hawaii) 81.820411
Name: total_cup_points, dtype: float64
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Relative error of
point estimates
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Sample size is number of rows
len(coffee_ratings.sample(n=300)) len(coffee_ratings.sample(frac=0.25))
300 334
SAMPLING IN PYTHON
Various sample sizes
coffee_ratings['total_cup_points'].mean()
82.15120328849028
coffee_ratings.sample(n=10)['total_cup_points'].mean()
83.027
coffee_ratings.sample(n=100)['total_cup_points'].mean()
82.4897
coffee_ratings.sample(n=1000)['total_cup_points'].mean()
82.1186
SAMPLING IN PYTHON
Relative errors
Population parameter:
population_mean = coffee_ratings['total_cup_points'].mean()
Point estimate:
sample_mean = coffee_ratings.sample(n=sample_size)['total_cup_points'].mean()
SAMPLING IN PYTHON
Relative error vs. sample size
import matplotlib.pyplot as plt
errors.plot(x="sample_size",
y="relative_error",
kind="line")
plt.show()
Properties:
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Creating a sampling
distribution
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Same code, different answer
coffee_ratings.sample(n=30)['total_cup_points'].mean() coffee_ratings.sample(n=30)['total_cup_points'].mean()
82.53066666666668 81.97566666666667
coffee_ratings.sample(n=30)['total_cup_points'].mean() coffee_ratings.sample(n=30)['total_cup_points'].mean()
82.68 81.675
SAMPLING IN PYTHON
Same code, 1000 times
mean_cup_points_1000 = []
for i in range(1000):
mean_cup_points_1000.append(
coffee_ratings.sample(n=30)['total_cup_points'].mean()
)
print(mean_cup_points_1000)
SAMPLING IN PYTHON
Distribution of sample means for size 30
import matplotlib.pyplot as plt
plt.hist(mean_cup_points_1000, bins=30)
plt.show()
SAMPLING IN PYTHON
Different sample sizes
Sample size: 6 Sample size: 150
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Approximate
sampling
distributions
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
4 dice
die1 die2 die3 die4
0 1 1 1 1
1 1 1 1 2
2 1 1 1 3
3 1 1 1 4
4 1 1 1 5
dice = expand_grid( ... ... ... ... ...
{'die1': [1, 2, 3, 4, 5, 6], 1291 6 6 6 2
'die2': [1, 2, 3, 4, 5, 6], 1292 6 6 6 3
'die3': [1, 2, 3, 4, 5, 6], 1293 6 6 6 4
'die4': [1, 2, 3, 4, 5, 6] 1294 6 6 6 5
} 1295 6 6 6 6
)
[1296 rows x 4 columns]
SAMPLING IN PYTHON
Mean roll
dice['mean_roll'] = (dice['die1'] + die1 die2 die3 die4 mean_roll
dice['die2'] + 0 1 1 1 1 1.00
dice['die3'] + 1 1 1 1 2 1.25
dice['die4']) / 4 2 1 1 1 3 1.50
print(dice) 3 1 1 1 4 1.75
4 1 1 1 5 2.00
... ... ... ... ... ...
1291 6 6 6 2 5.00
1292 6 6 6 3 5.25
1293 6 6 6 4 5.50
1294 6 6 6 5 5.75
1295 6 6 6 6 6.00
SAMPLING IN PYTHON
Exact sampling distribution
dice['mean_roll'] = dice['mean_roll'].astype('category')
dice['mean_roll'].value_counts(sort=False).plot(kind="bar")
SAMPLING IN PYTHON
The number of outcomes increases fast
n_dice = list(range(1, 101))
n_outcomes = []
for n in n_dice:
n_outcomes.append(6**n)
outcomes = pd.DataFrame(
{"n_dice": n_dice,
"n_outcomes": n_outcomes})
outcomes.plot(x="n_dice",
y="n_outcomes",
kind="scatter")
plt.show()
SAMPLING IN PYTHON
Simulating the mean of four dice rolls
import numpy as np
SAMPLING IN PYTHON
Simulating the mean of four dice rolls
import numpy as np
sample_means_1000 = []
for i in range(1000):
sample_means_1000.append(
np.random.choice(list(range(1, 7)), size=4, replace=True).mean()
)
print(sample_means_1000)
[3.25, 3.25, 1.75, 2.0, 2.0, 1.0, 1.0, 2.75, 2.75, 2.5, 3.0, 2.0, 2.75,
...
1.25, 2.0, 2.5, 2.5, 3.75, 1.5, 1.75, 2.25, 2.0, 1.5, 3.25, 3.0, 3.5]
SAMPLING IN PYTHON
Approximate sampling distribution
plt.hist(sample_means_1000, bins=20)
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Standard errors and
the Central Limit
Theorem
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Sampling distribution of mean cup points
Sample size: 5 Sample size: 20
SAMPLING IN PYTHON
Consequences of the central limit theorem
SAMPLING IN PYTHON
Population & sampling distribution means
coffee_ratings['total_cup_points'].mean() Use np.mean() on each approximate
sampling distribution:
82.15120328849028
Sample size Mean sample mean
5 82.18420719999999
20 82.1558634
80 82.14510154999999
320 82.154017925
SAMPLING IN PYTHON
Population & sampling distribution standard deviations
coffee_ratings['total_cup_points'].std(ddof=0) Sample size Std dev sample mean
5 1.1886358227738543
2.685858187306438
20 0.5940321141669805
80 0.2934024263916487
SAMPLING IN PYTHON
Population mean over square root sample size
Sample size Std dev sample mean Calculation Result
5 1.1886358227738543 2.685858187306438 / sqrt(5) 1.201
SAMPLING IN PYTHON
Standard error
Standard deviation of the sampling distribution
Important tool in understanding sampling variability
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Introduction to
bootstrapping
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
With or without
Sampling without replacement: Sampling with replacement ("resampling"):
SAMPLING IN PYTHON
Simple random sampling without replacement
Population: Sample:
SAMPLING IN PYTHON
Simple random sampling with replacement
Population: Resample:
SAMPLING IN PYTHON
Why sample with replacement?
coffee_ratings : a sample of a larger population of all coffees
Each coffee in our sample represents many different hypothetical population coffees
SAMPLING IN PYTHON
Coffee data preparation
coffee_focus = coffee_ratings[["variety", "country_of_origin", "flavor"]]
coffee_focus = coffee_focus.reset_index()
SAMPLING IN PYTHON
Resampling with .sample()
coffee_resamp = coffee_focus.sample(frac=1, replace=True)
SAMPLING IN PYTHON
Repeated coffees
coffee_resamp["index"].value_counts() 658 5
167 4
363 4
357 4
1047 4
..
771 1
770 1
766 1
764 1
0 1
Name: index, Length: 868, dtype: int64
SAMPLING IN PYTHON
Missing coffees
num_unique_coffees = len(coffee_resamp.drop_duplicates(subset="index"))
868
len(coffee_ratings) - num_unique_coffees
470
SAMPLING IN PYTHON
Bootstrapping
The opposite of sampling from a
population
SAMPLING IN PYTHON
Bootstrapping process
1. Make a resample of the same size as the original sample
2. Calculate the statistic of interest for this bootstrap sample
The resulting statistics are bootstrap statistics, and they form a bootstrap distribution
SAMPLING IN PYTHON
Bootstrapping coffee mean flavor
import numpy as np
mean_flavors_1000 = []
for i in range(1000):
mean_flavors_1000.append(
np.mean(coffee_sample.sample(frac=1, replace=True)['flavor'])
)
SAMPLING IN PYTHON
Bootstrap distribution histogram
import matplotlib.pyplot as plt
plt.hist(mean_flavors_1000)
plt.show()
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Comparing
sampling and
bootstrap
distributions
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Coffee focused subset
coffee_sample = coffee_ratings[["variety", "country_of_origin", "flavor"]]\
.reset_index().sample(n=500)
SAMPLING IN PYTHON
The bootstrap of mean coffee flavors
import numpy as np
mean_flavors_5000 = []
for i in range(5000):
mean_flavors_5000.append(
np.mean(coffee_sample.sample(frac=1, replace=True)['flavor'])
)
bootstrap_distn = mean_flavors_5000
SAMPLING IN PYTHON
Mean flavor bootstrap distribution
import matplotlib.pyplot as plt
plt.hist(bootstrap_distn, bins=15)
plt.show()
SAMPLING IN PYTHON
Sample, bootstrap distribution, population means
Sample mean: Estimated population mean:
coffee_sample['flavor'].mean() np.mean(bootstrap_distn)
7.5132200000000005 7.513357731999999
coffee_ratings['flavor'].mean()
7.526046337817639
SAMPLING IN PYTHON
Interpreting the means
Bootstrap distribution mean:
SAMPLING IN PYTHON
Sample sd vs. bootstrap distribution sd
Sample standard deviation: Estimated population standard deviation?
0.3540883911928703 0.015768474367958217
SAMPLING IN PYTHON
Sample, bootstrap dist'n, pop'n standard deviations
Sample standard deviation: Estimated population standard deviation:
coffee_ratings['flavor'].std(ddof=0)
0.3525938058821761
SAMPLING IN PYTHON
Interpreting the standard errors
Estimated standard error → standard deviation of the bootstrap distribution for a sample
statistic
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Confidence intervals
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Confidence intervals
"Values within one standard deviation of the mean" includes a large number of values from
each of these distributions
SAMPLING IN PYTHON
Predicting the weather
Rapid City, South Dakota in the United
States has the least predictable weather
SAMPLING IN PYTHON
Our weather prediction
Point estimate = 47°F (8.3°C)
Range of plausible high temperature values = 40 to 54°F (4.4 to 12.8°C)
SAMPLING IN PYTHON
We just reported a confidence interval!
40 to 54°F is a confidence interval
Sometimes written as 47 °F (40°F, 54°F) or 47°F [40°F, 54°F]
SAMPLING IN PYTHON
Bootstrap distribution of mean flavor
import matplotlib.pyplot as plt
plt.hist(coffee_boot_distn, bins=15)
plt.show()
SAMPLING IN PYTHON
Mean of the resamples
import numpy as np
np.mean(coffee_boot_distn)
7.513452892
SAMPLING IN PYTHON
Mean plus or minus one standard deviation
np.mean(coffee_boot_distn)
7.513452892
7.497385709174466
7.529520074825534
SAMPLING IN PYTHON
Quantile method for confidence intervals
np.quantile(coffee_boot_distn, 0.025)
7.4817195
np.quantile(coffee_boot_distn, 0.975)
7.5448805
SAMPLING IN PYTHON
Inverse cumulative distribution function
PDF: The bell curve
SAMPLING IN PYTHON
Standard error method for confidence interval
point_estimate = np.mean(coffee_boot_distn)
7.513452892
0.016067182825533724
(7.481961792328933, 7.544943991671067)
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Congratulations!
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Recap
Chapter 1 Chapter 3
SAMPLING IN PYTHON
The most important things
The std. deviation of a bootstrap statistic is a good approximation of the standard error
Can assume bootstrap distributions are normally distributed for confidence intervals
SAMPLING IN PYTHON
What's next?
Experimental Design in Python and Customer Analytics and A/B Testing in Python
Hypothesis Testing in Python
SAMPLING IN PYTHON
Happy learning!
SAMPLING IN PYTHON