0% found this document useful (0 votes)
22 views140 pages

Sampling in Python

Sampling in Python is the cornerstone of inference statistics and hypothesis testing. It's a powerful skill used in survey analysis and experimental design to draw conclusions without surveying an entire population. In this Sampling in Python course, you’ll discover when to use sampling and how to perform common types of sampling—from simple random sampling to more complex methods like stratified and cluster sampling.

Uploaded by

jcmayac
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
0% found this document useful (0 votes)
22 views140 pages

Sampling in Python

Sampling in Python is the cornerstone of inference statistics and hypothesis testing. It's a powerful skill used in survey analysis and experimental design to draw conclusions without surveying an entire population. In this Sampling in Python course, you’ll discover when to use sampling and how to perform common types of sampling—from simple random sampling to more complex methods like stratified and cluster sampling.

Uploaded by

jcmayac
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
You are on page 1/ 140

Sampling and point

estimates
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Estimating the population of France
A census asks every household how many
people live there.

SAMPLING IN PYTHON
There are lots of people in France
Censuses are really expensive!

SAMPLING IN PYTHON
Sampling households
Cheaper to ask a small number of households
and use statistics to estimate the population

Working with a subset of the whole population


is called sampling

SAMPLING IN PYTHON
Population vs. sample
The population is the complete dataset

Doesn't have to refer to people

Typically, don't know what the whole population is

The sample is the subset of data you calculate on

SAMPLING IN PYTHON
Coffee rating dataset
total_cup_points variety country_of_origin aroma flavor aftertaste body balance
90.58 NA Ethiopia 8.67 8.83 8.67 8.50 8.42
89.92 Other Ethiopia 8.75 8.67 8.50 8.42 8.42
... ... ... ... ... ... ... ...
73.75 NA Vietnam 6.75 6.67 6.5 6.92 6.83

Each row represents 1 coffee

1338 rows

We'll treat this as the population

SAMPLING IN PYTHON
Points vs. flavor: population
pts_vs_flavor_pop = coffee_ratings[["total_cup_points", "flavor"]]

total_cup_points flavor
0 90.58 8.83
1 89.92 8.67
2 89.75 8.50
3 89.00 8.58
4 88.83 8.50
... ... ...
1333 78.75 7.58
1334 78.08 7.67
1335 77.17 7.33
1336 75.08 6.83
1337 73.75 6.67

[1338 rows x 2 columns]

SAMPLING IN PYTHON
Points vs. flavor: 10 row sample
pts_vs_flavor_samp = pts_vs_flavor_pop.sample(n=10)

total_cup_points flavor
1088 80.33 7.17
1157 79.67 7.42
1267 76.17 7.33
506 83.00 7.67
659 82.50 7.42
817 81.92 7.50
1050 80.67 7.42
685 82.42 7.50
1027 80.92 7.25
62 85.58 8.17

[10 rows x 2 columns]

SAMPLING IN PYTHON
Python sampling for Series
Use .sample() for pandas DataFrames and Series

cup_points_samp = coffee_ratings['total_cup_points'].sample(n=10)

1088 80.33
1157 79.67
1267 76.17
... ...
685 82.42
1027 80.92
62 85.58
Name: total_cup_points, dtype: float64

SAMPLING IN PYTHON
Population parameters & point estimates
A population parameter is a calculation made on the population dataset

import numpy as np
np.mean(pts_vs_flavor_pop['total_cup_points'])

82.15120328849028

A point estimate or sample statistic is a calculation made on the sample dataset

np.mean(cup_points_samp)

81.31800000000001

SAMPLING IN PYTHON
Point estimates with pandas
pts_vs_flavor_pop['flavor'].mean()

7.526046337817639

pts_vs_flavor_samp['flavor'].mean()

7.485000000000001

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Convenience
sampling
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
The Literary Digest election prediction

Prediction: Landon gets 57%; Roosevelt gets 43%

Actual results: Landon got 38%; Roosevelt got 62%

Sample not representative of population, causing sample bias

Collecting data by the easiest method is called convenience sampling

SAMPLING IN PYTHON
Finding the mean age of French people
Survey 10 people at Disneyland Paris
Mean age of 24.6 years

Will this be a good estimate for all of


France?

1 Image by Sean MacEntee

SAMPLING IN PYTHON
How accurate was the survey?
Year Average French Age 24.6 years is a poor estimate
1975 31.6 People who visit Disneyland aren't
1985 33.6 representative of the whole population

1995 36.2
2005 38.9
2015 41.2

SAMPLING IN PYTHON
Convenience sampling coffee ratings
coffee_ratings["total_cup_points"].mean()

82.15120328849028

coffee_ratings_first10 = coffee_ratings.head(10)

coffee_ratings_first10["total_cup_points"].mean()

89.1

SAMPLING IN PYTHON
Visualizing selection bias
import matplotlib.pyplot as plt
import numpy as np
coffee_ratings["total_cup_points"].hist(bins=np.arange(59, 93, 2))
plt.show()

coffee_ratings_first10["total_cup_points"].hist(bins=np.arange(59, 93, 2))


plt.show()

SAMPLING IN PYTHON
Distribution of a population and of a convenience
sample
Population: Convenience sample:

SAMPLING IN PYTHON
Visualizing selection bias for a random sample
coffee_sample = coffee_ratings.sample(n=10)
coffee_sample["total_cup_points"].hist(bins=np.arange(59, 93, 2))
plt.show()

SAMPLING IN PYTHON
Distribution of a population and of a simple random
sample
Population: Random Sample:

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Pseudo-random
number generation
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
What does random mean?
{adjective} made, done, happening, or chosen without method or conscious decision.

1 Oxford Languages

SAMPLING IN PYTHON
True random numbers
Generated from physical processes, like flipping coins
Hotbits uses radioactive decay

RANDOM.ORG uses atmospheric noise

True randomness is expensive

1 https://www.fourmilab.ch/hotbits 2 https://www.random.org

SAMPLING IN PYTHON
Pseudo-random number generation
Pseudo-random number generation is cheap and fast
Next "random" number calculated from previous "random" number

The first "random" number calculated from a seed

The same seed value yields the same random numbers

SAMPLING IN PYTHON
Pseudo-random number generation example
seed = 1
calc_next_random(seed)

calc_next_random(3)

calc_next_random(2)

SAMPLING IN PYTHON
Random number generating functions
Prepend with numpy.random , such as numpy.random.beta()

function distribution function distribution


.beta Beta .hypergeometric Hypergeometric
.binomial Binomial .lognormal Lognormal
.chisquare Chi-squared .negative_binomial Negative binomial
.exponential Exponential .normal Normal
.f F .poisson Poisson
.gamma Gamma .standard_t t
.geometric Geometric .uniform Uniform

SAMPLING IN PYTHON
Visualizing random numbers
randoms = np.random.beta(a=2, b=2, size=5000)
randoms

array([0.6208281 , 0.73216171, 0.44298403, ...,


0.13411873, 0.52198411, 0.72355098])

plt.hist(randoms, bins=np.arange(0, 1, 0.05))


plt.show()

SAMPLING IN PYTHON
Random numbers seeds
np.random.seed(20000229) np.random.seed(20000229)

np.random.normal(loc=2, scale=1.5, size=2) np.random.normal(loc=2, scale=1.5, size=2)

array([-0.59030264, 1.87821258]) array([-0.59030264, 1.87821258])

np.random.normal(loc=2, scale=1.5, size=2) np.random.normal(loc=2, scale=1.5, size=2)

array([2.52619561, 4.9684949 ]) array([2.52619561, 4.9684949 ])

SAMPLING IN PYTHON
Using a different seed
np.random.seed(20000229) np.random.seed(20041004)

np.random.normal(loc=2, scale=1.5, size=2) np.random.normal(loc=2, scale=1.5, size=2)

array([-0.59030264, 1.87821258]) array([1.09364337, 4.55285159])

np.random.normal(loc=2, scale=1.5, size=2) np.random.normal(loc=2, scale=1.5, size=2)

array([2.52619561, 4.9684949 ]) array([2.67038916, 2.36677492])

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Simple random and
systematic sampling
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Simple random sampling

SAMPLING IN PYTHON
Simple random sampling of coffees

SAMPLING IN PYTHON
Simple random sampling with pandas
coffee_ratings.sample(n=5, random_state=19000113)

total_cup_points variety country_of_origin aroma flavor \


437 83.25 None Colombia 7.92 7.75
285 83.83 Yellow Bourbon Brazil 7.92 7.50
784 82.08 None Colombia 7.50 7.42
648 82.58 Caturra Colombia 7.58 7.50
155 84.58 Caturra Colombia 7.42 7.67

aftertaste body balance


437 7.25 7.83 7.58
285 7.33 8.17 7.50
784 7.42 7.67 7.42
648 7.42 7.67 7.42
155 7.75 8.08 7.83

SAMPLING IN PYTHON
Systematic sampling

SAMPLING IN PYTHON
Systematic sampling - defining the interval
sample_size = 5
pop_size = len(coffee_ratings)
print(pop_size)

1338

interval = pop_size // sample_size


print(interval)

267

SAMPLING IN PYTHON
Systematic sampling - selecting the rows
coffee_ratings.iloc[::interval]

total_cup_points variety country_of_origin aroma flavor aftertaste \


0 90.58 None Ethiopia 8.67 8.83 8.67
267 83.92 None Colombia 7.83 7.75 7.58
534 82.92 Bourbon El Salvador 7.50 7.50 7.75
801 82.00 Typica Taiwan 7.33 7.50 7.17
1068 80.50 Other Taiwan 7.17 7.17 7.17

body balance
0 8.50 8.42
267 7.75 7.75
534 7.92 7.83
801 7.50 7.33
1068 7.17 7.25

SAMPLING IN PYTHON
The trouble with systematic sampling
coffee_ratings_with_id = coffee_ratings.reset_index()
coffee_ratings_with_id.plot(x="index", y="aftertaste", kind="scatter")
plt.show()

Systematic sampling is only safe if we don't see a pattern in this scatter plot

SAMPLING IN PYTHON
Making systematic sampling safe
shuffled = coffee_ratings.sample(frac=1)
shuffled = shuffled.reset_index(drop=True).reset_index()
shuffled.plot(x="index", y="aftertaste", kind="scatter")
plt.show()

Shuffling rows + systematic sampling is the same as simple random sampling

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Stratified and
weighted random
sampling
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Coffees by country
top_counts = coffee_ratings['country_of_origin'].value_counts()
top_counts.head(6)

country_of_origin
Mexico 236
Colombia 183
Guatemala 181
Brazil 132
Taiwan 75
United States (Hawaii) 73
dtype: int64

1 The dataset lists Hawaii and Taiwan as countries for convenience, as they are notable coffee-growing regions.

SAMPLING IN PYTHON
Filtering for 6 countries
top_counted_countries = ["Mexico", "Colombia", "Guatemala",
"Brazil", "Taiwan", "United States (Hawaii)"]

top_counted_subset = coffee_ratings['country_of_origin'].isin(top_counted_countries)

coffee_ratings_top = coffee_ratings[top_counted_subset]

SAMPLING IN PYTHON
Counts of a simple random sample
coffee_ratings_samp = coffee_ratings_top.sample(frac=0.1, random_state=2021)

coffee_ratings_samp['country_of_origin'].value_counts(normalize=True)

country_of_origin
Mexico 0.250000
Guatemala 0.204545
Colombia 0.181818
Brazil 0.181818
United States (Hawaii) 0.102273
Taiwan 0.079545
dtype: float64

SAMPLING IN PYTHON
Comparing proportions
Population: 10% simple random sample:

Mexico 0.268182 Mexico 0.250000


Colombia 0.207955 Guatemala 0.204545
Guatemala 0.205682 Colombia 0.181818
Brazil 0.150000 Brazil 0.181818
Taiwan 0.085227 United States (Hawaii) 0.102273
United States (Hawaii) 0.082955 Taiwan 0.079545
Name: country_of_origin, dtype: float64 Name: country_of_origin, dtype: float64

SAMPLING IN PYTHON
Proportional stratified sampling
coffee_ratings_strat = coffee_ratings_top.groupby("country_of_origin")\
.sample(frac=0.1, random_state=2021)

coffee_ratings_strat['country_of_origin'].value_counts(normalize=True)

Mexico 0.272727
Guatemala 0.204545
Colombia 0.204545
Brazil 0.147727
Taiwan 0.090909
United States (Hawaii) 0.079545
Name: country_of_origin, dtype: float64

SAMPLING IN PYTHON
Equal counts stratified sampling
coffee_ratings_eq = coffee_ratings_top.groupby("country_of_origin")\
.sample(n=15, random_state=2021)

coffee_ratings_eq['country_of_origin'].value_counts(normalize=True)

Taiwan 0.166667
Brazil 0.166667
United States (Hawaii) 0.166667
Guatemala 0.166667
Mexico 0.166667
Colombia 0.166667
Name: country_of_origin, dtype: float64

SAMPLING IN PYTHON
Weighted random sampling
Specify weights to adjust the relative probability of a row being sampled

import numpy as np
coffee_ratings_weight = coffee_ratings_top
condition = coffee_ratings_weight['country_of_origin'] == "Taiwan"

coffee_ratings_weight['weight'] = np.where(condition, 2, 1)

coffee_ratings_weight = coffee_ratings_weight.sample(frac=0.1, weights="weight")

SAMPLING IN PYTHON
Weighted random sampling results
10% weighted sample:

coffee_ratings_weight['country_of_origin'].value_counts(normalize=True)

Brazil 0.261364
Mexico 0.204545
Guatemala 0.204545
Taiwan 0.170455
Colombia 0.090909
United States (Hawaii) 0.068182
Name: country_of_origin, dtype: float64

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Cluster sampling
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Stratified sampling vs. cluster sampling
Stratified sampling
Split the population into subgroups

Use simple random sampling on every subgroup

Cluster sampling
Use simple random sampling to pick some subgroups

Use simple random sampling on only those subgroups

SAMPLING IN PYTHON
Varieties of coffee
varieties_pop = list(coffee_ratings['variety'].unique())

[None, 'Other', 'Bourbon', 'Catimor',


'Ethiopian Yirgacheffe','Caturra',
'SL14', 'Sumatra', 'SL34', 'Hawaiian Kona',
'Yellow Bourbon', 'SL28', 'Gesha', 'Catuai',
'Pacamara', 'Typica', 'Sumatra Lintong',
'Mundo Novo', 'Java', 'Peaberry', 'Pacas',
'Mandheling', 'Ruiru 11', 'Arusha',
'Ethiopian Heirlooms', 'Moka Peaberry',
'Sulawesi', 'Blue Mountain', 'Marigojipe',
'Pache Comun']

SAMPLING IN PYTHON
Stage 1: sampling for subgroups
import random
varieties_samp = random.sample(varieties_pop, k=3)

['Hawaiian Kona', 'Bourbon', 'SL28']

SAMPLING IN PYTHON
Stage 2: sampling each group
variety_condition = coffee_ratings['variety'].isin(varieties_samp)
coffee_ratings_cluster = coffee_ratings[variety_condition]

coffee_ratings_cluster['variety'] = coffee_ratings_cluster['variety'].cat.remove_unused_categories()

coffee_ratings_cluster.groupby("variety")\
.sample(n=5, random_state=2021)

SAMPLING IN PYTHON
Stage 2 output
total_cup_points variety country_of_origin ...
variety
Bourbon 575 82.83 Bourbon Guatemala
560 82.83 Bourbon Guatemala
524 83.00 Bourbon Guatemala
1140 79.83 Bourbon Guatemala
318 83.67 Bourbon Brazil
Hawaiian Kona 1291 73.67 Hawaiian Kona United States (Hawaii)
1266 76.25 Hawaiian Kona United States (Hawaii)
488 83.08 Hawaiian Kona United States (Hawaii)
461 83.17 Hawaiian Kona United States (Hawaii)
117 84.83 Hawaiian Kona United States (Hawaii)
SL28 137 84.67 SL28 Kenya
452 83.17 SL28 Kenya
224 84.17 SL28 Kenya
66 85.50 SL28 Kenya
559 82.83 SL28 Kenya

SAMPLING IN PYTHON
Multistage sampling
Cluster sampling is a type of multistage sampling
Can have > 2 stages

E.g., countrywide surveys may sample states, counties, cities, and neighborhoods

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Comparing
sampling methods
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Review of sampling techniques - setup
top_counted_countries = ["Mexico", "Colombia", "Guatemala",
"Brazil", "Taiwan", "United States (Hawaii)"]

subset_condition = coffee_ratings['country_of_origin'].isin(top_counted_countries)
coffee_ratings_top = coffee_ratings[subset_condition]

coffee_ratings_top.shape

(880, 8)

SAMPLING IN PYTHON
Review of simple random sampling
coffee_ratings_srs = coffee_ratings_top.sample(frac=1/3, random_state=2021)

coffee_ratings_srs.shape

(293, 8)

SAMPLING IN PYTHON
Review of stratified sampling
coffee_ratings_strat = coffee_ratings_top.groupby("country_of_origin")\
.sample(frac=1/3, random_state=2021)

coffee_ratings_strat.shape

(293, 8)

SAMPLING IN PYTHON
Review of cluster sampling
import random
top_countries_samp = random.sample(top_counted_countries, k=2)
top_condition = coffee_ratings_top['country_of_origin'].isin(top_countries_samp)
coffee_ratings_cluster = coffee_ratings_top[top_condition]
coffee_ratings_cluster['country_of_origin'] = coffee_ratings_cluster['country_of_origin']\
.cat.remove_unused_categories()

coffee_ratings_clust = coffee_ratings_cluster.groupby("country_of_origin")\
.sample(n=len(coffee_ratings_top) // 6)

coffee_ratings_clust.shape

(292, 8)

SAMPLING IN PYTHON
Calculating mean cup points
Population Simple random sample
coffee_ratings_top['total_cup_points'].mean() coffee_ratings_srs['total_cup_points'].mean()

81.94700000000002 81.95982935153583

Stratified sample Cluster sample


coffee_ratings_strat['total_cup_points'].mean() coffee_ratings_clust['total_cup_points'].mean()

81.92566552901025 82.03246575342466

SAMPLING IN PYTHON
Mean cup points by country: simple random
Population: Simple random sample:

coffee_ratings_top.groupby("country_of_origin")\ coffee_ratings_srs.groupby("country_of_origin")\
['total_cup_points'].mean() ['total_cup_points'].mean()

country_of_origin country_of_origin
Brazil 82.405909 Brazil 82.414878
Colombia 83.106557 Colombia 82.925536
Guatemala 81.846575 Guatemala 82.045385
Mexico 80.890085 Mexico 81.100714
Taiwan 82.001333 Taiwan 81.744333
United States (Hawaii) 81.820411 United States (Hawaii) 82.008000
Name: total_cup_points, dtype: float64 Name: total_cup_points, dtype: float64

SAMPLING IN PYTHON
Mean cup points by country: stratified
Population: Stratified sample:

coffee_ratings_top.groupby("country_of_origin")\ coffee_ratings_strat.groupby("country_of_origin")\
['total_cup_points'].mean() ['total_cup_points'].mean()

country_of_origin country_of_origin
Brazil 82.405909 Brazil 82.499773
Colombia 83.106557 Colombia 83.288197
Guatemala 81.846575 Guatemala 81.727667
Mexico 80.890085 Mexico 80.994684
Taiwan 82.001333 Taiwan 81.846800
United States (Hawaii) 81.820411 United States (Hawaii) 81.051667
Name: total_cup_points, dtype: float64 Name: total_cup_points, dtype: float64

SAMPLING IN PYTHON
Mean cup points by country: cluster
Population: Cluster sample:

coffee_ratings_top.groupby("country_of_origin")\ coffee_ratings_clust.groupby("country_of_origin")\
['total_cup_points'].mean() ['total_cup_points'].mean()

country_of_origin country_of_origin
Brazil 82.405909 Colombia 83.128904
Colombia 83.106557 Mexico 80.936027
Guatemala 81.846575 Name: total_cup_points, dtype: float64
Mexico 80.890085
Taiwan 82.001333
United States (Hawaii) 81.820411
Name: total_cup_points, dtype: float64

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Relative error of
point estimates
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Sample size is number of rows
len(coffee_ratings.sample(n=300)) len(coffee_ratings.sample(frac=0.25))

300 334

SAMPLING IN PYTHON
Various sample sizes
coffee_ratings['total_cup_points'].mean()

82.15120328849028

coffee_ratings.sample(n=10)['total_cup_points'].mean()

83.027

coffee_ratings.sample(n=100)['total_cup_points'].mean()

82.4897

coffee_ratings.sample(n=1000)['total_cup_points'].mean()

82.1186

SAMPLING IN PYTHON
Relative errors
Population parameter:

population_mean = coffee_ratings['total_cup_points'].mean()

Point estimate:

sample_mean = coffee_ratings.sample(n=sample_size)['total_cup_points'].mean()

Relative error as a percentage:

rel_error_pct = 100 * abs(population_mean-sample_mean) / population_mean

SAMPLING IN PYTHON
Relative error vs. sample size
import matplotlib.pyplot as plt
errors.plot(x="sample_size",
y="relative_error",
kind="line")
plt.show()

Properties:

Really noise, particularly for small samples

Amplitude is initially steep, then flattens

Relative error decreases to zero (when the


sample size = population)

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Creating a sampling
distribution
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Same code, different answer
coffee_ratings.sample(n=30)['total_cup_points'].mean() coffee_ratings.sample(n=30)['total_cup_points'].mean()

82.53066666666668 81.97566666666667

coffee_ratings.sample(n=30)['total_cup_points'].mean() coffee_ratings.sample(n=30)['total_cup_points'].mean()

82.68 81.675

SAMPLING IN PYTHON
Same code, 1000 times
mean_cup_points_1000 = []
for i in range(1000):
mean_cup_points_1000.append(
coffee_ratings.sample(n=30)['total_cup_points'].mean()
)
print(mean_cup_points_1000)

[82.11933333333333, 82.55300000000001, 82.07266666666668, 81.76966666666667,


...
82.74166666666666, 82.45033333333335, 81.77199999999999, 82.8163333333333]

SAMPLING IN PYTHON
Distribution of sample means for size 30
import matplotlib.pyplot as plt
plt.hist(mean_cup_points_1000, bins=30)
plt.show()

A sampling distribution is a distribution of


replicates of point estimates.

SAMPLING IN PYTHON
Different sample sizes
Sample size: 6 Sample size: 150

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Approximate
sampling
distributions
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
4 dice
die1 die2 die3 die4
0 1 1 1 1
1 1 1 1 2
2 1 1 1 3
3 1 1 1 4
4 1 1 1 5
dice = expand_grid( ... ... ... ... ...
{'die1': [1, 2, 3, 4, 5, 6], 1291 6 6 6 2
'die2': [1, 2, 3, 4, 5, 6], 1292 6 6 6 3
'die3': [1, 2, 3, 4, 5, 6], 1293 6 6 6 4
'die4': [1, 2, 3, 4, 5, 6] 1294 6 6 6 5
} 1295 6 6 6 6
)
[1296 rows x 4 columns]

SAMPLING IN PYTHON
Mean roll
dice['mean_roll'] = (dice['die1'] + die1 die2 die3 die4 mean_roll
dice['die2'] + 0 1 1 1 1 1.00
dice['die3'] + 1 1 1 1 2 1.25
dice['die4']) / 4 2 1 1 1 3 1.50
print(dice) 3 1 1 1 4 1.75
4 1 1 1 5 2.00
... ... ... ... ... ...
1291 6 6 6 2 5.00
1292 6 6 6 3 5.25
1293 6 6 6 4 5.50
1294 6 6 6 5 5.75
1295 6 6 6 6 6.00

[1296 rows x 5 columns]

SAMPLING IN PYTHON
Exact sampling distribution
dice['mean_roll'] = dice['mean_roll'].astype('category')
dice['mean_roll'].value_counts(sort=False).plot(kind="bar")

SAMPLING IN PYTHON
The number of outcomes increases fast
n_dice = list(range(1, 101))
n_outcomes = []
for n in n_dice:
n_outcomes.append(6**n)

outcomes = pd.DataFrame(
{"n_dice": n_dice,
"n_outcomes": n_outcomes})

outcomes.plot(x="n_dice",
y="n_outcomes",
kind="scatter")
plt.show()

SAMPLING IN PYTHON
Simulating the mean of four dice rolls
import numpy as np

np.random.choice(list(range(1, 7)), size=4, replace=True).mean()

SAMPLING IN PYTHON
Simulating the mean of four dice rolls
import numpy as np
sample_means_1000 = []
for i in range(1000):
sample_means_1000.append(
np.random.choice(list(range(1, 7)), size=4, replace=True).mean()
)
print(sample_means_1000)

[3.25, 3.25, 1.75, 2.0, 2.0, 1.0, 1.0, 2.75, 2.75, 2.5, 3.0, 2.0, 2.75,
...
1.25, 2.0, 2.5, 2.5, 3.75, 1.5, 1.75, 2.25, 2.0, 1.5, 3.25, 3.0, 3.5]

SAMPLING IN PYTHON
Approximate sampling distribution
plt.hist(sample_means_1000, bins=20)

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Standard errors and
the Central Limit
Theorem
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Sampling distribution of mean cup points
Sample size: 5 Sample size: 20

Sample size: 80 Sample size: 320

SAMPLING IN PYTHON
Consequences of the central limit theorem

Averages of independent samples have approximately normal distributions.

As the sample size increases,

The distribution of the averages gets closer to being normally distributed

The width of the sampling distribution gets narrower

SAMPLING IN PYTHON
Population & sampling distribution means
coffee_ratings['total_cup_points'].mean() Use np.mean() on each approximate
sampling distribution:
82.15120328849028
Sample size Mean sample mean
5 82.18420719999999

20 82.1558634

80 82.14510154999999

320 82.154017925

SAMPLING IN PYTHON
Population & sampling distribution standard deviations
coffee_ratings['total_cup_points'].std(ddof=0) Sample size Std dev sample mean
5 1.1886358227738543
2.685858187306438
20 0.5940321141669805

80 0.2934024263916487

Specify ddof=0 when calling .std() on 320 0.13095083089190876


populations

Specify ddof=1 when calling np.std() on


samples or sampling distributions

SAMPLING IN PYTHON
Population mean over square root sample size
Sample size Std dev sample mean Calculation Result
5 1.1886358227738543 2.685858187306438 / sqrt(5) 1.201

20 0.5940321141669805 2.685858187306438 / sqrt(20) 0.601

80 0.2934024263916487 2.685858187306438 / sqrt(80) 0.300

320 0.13095083089190876 2.685858187306438 / sqrt(320) 0.150

SAMPLING IN PYTHON
Standard error
Standard deviation of the sampling distribution
Important tool in understanding sampling variability

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Introduction to
bootstrapping
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
With or without
Sampling without replacement: Sampling with replacement ("resampling"):

SAMPLING IN PYTHON
Simple random sampling without replacement
Population: Sample:

SAMPLING IN PYTHON
Simple random sampling with replacement
Population: Resample:

SAMPLING IN PYTHON
Why sample with replacement?
coffee_ratings : a sample of a larger population of all coffees

Each coffee in our sample represents many different hypothetical population coffees

Sampling with replacement is a proxy

SAMPLING IN PYTHON
Coffee data preparation
coffee_focus = coffee_ratings[["variety", "country_of_origin", "flavor"]]
coffee_focus = coffee_focus.reset_index()

index variety country_of_origin flavor


0 0 None Ethiopia 8.83
1 1 Other Ethiopia 8.67
2 2 Bourbon Guatemala 8.50
3 3 None Ethiopia 8.58
4 4 Other Ethiopia 8.50
... ... ... ... ...
1333 1333 None Ecuador 7.58
1334 1334 None Ecuador 7.67
1335 1335 None United States 7.33
1336 1336 None India 6.83
1337 1337 None Vietnam 6.67

[1338 rows x 4 columns]

SAMPLING IN PYTHON
Resampling with .sample()
coffee_resamp = coffee_focus.sample(frac=1, replace=True)

index variety country_of_origin flavor


1140 1140 Bourbon Guatemala 7.25
57 57 Bourbon Guatemala 8.00
1152 1152 Bourbon Mexico 7.08
621 621 Caturra Thailand 7.50
44 44 SL28 Kenya 8.08
... ... ... ... ...
996 996 Typica Mexico 7.33
1090 1090 Bourbon Guatemala 7.33
918 918 Other Guatemala 7.42
249 249 Caturra Colombia 7.67
467 467 Caturra Colombia 7.50

[1338 rows x 4 columns]

SAMPLING IN PYTHON
Repeated coffees
coffee_resamp["index"].value_counts() 658 5
167 4
363 4
357 4
1047 4
..
771 1
770 1
766 1
764 1
0 1
Name: index, Length: 868, dtype: int64

SAMPLING IN PYTHON
Missing coffees
num_unique_coffees = len(coffee_resamp.drop_duplicates(subset="index"))

868

len(coffee_ratings) - num_unique_coffees

470

SAMPLING IN PYTHON
Bootstrapping
The opposite of sampling from a
population

Sampling: going from a population to a


smaller sample

Bootstrapping: building up a theoretical


population from the sample

Bootstrapping use case:

Develop understanding of sampling


variability using a single sample

SAMPLING IN PYTHON
Bootstrapping process
1. Make a resample of the same size as the original sample
2. Calculate the statistic of interest for this bootstrap sample

3. Repeat steps 1 and 2 many times

The resulting statistics are bootstrap statistics, and they form a bootstrap distribution

SAMPLING IN PYTHON
Bootstrapping coffee mean flavor
import numpy as np
mean_flavors_1000 = []
for i in range(1000):
mean_flavors_1000.append(
np.mean(coffee_sample.sample(frac=1, replace=True)['flavor'])
)

SAMPLING IN PYTHON
Bootstrap distribution histogram
import matplotlib.pyplot as plt
plt.hist(mean_flavors_1000)
plt.show()

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Comparing
sampling and
bootstrap
distributions
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Coffee focused subset
coffee_sample = coffee_ratings[["variety", "country_of_origin", "flavor"]]\
.reset_index().sample(n=500)

index variety country_of_origin flavor


132 132 Other Costa Rica 7.58
51 51 None United States (Hawaii) 8.17
42 42 Yellow Bourbon Brazil 7.92
569 569 Bourbon Guatemala 7.67
.. ... ... ... ...
643 643 Catuai Costa Rica 7.42
356 356 Caturra Colombia 7.58
494 494 None Indonesia 7.58
169 169 None Brazil 7.81

[500 rows x 4 columns]

SAMPLING IN PYTHON
The bootstrap of mean coffee flavors
import numpy as np
mean_flavors_5000 = []
for i in range(5000):
mean_flavors_5000.append(
np.mean(coffee_sample.sample(frac=1, replace=True)['flavor'])
)
bootstrap_distn = mean_flavors_5000

SAMPLING IN PYTHON
Mean flavor bootstrap distribution
import matplotlib.pyplot as plt
plt.hist(bootstrap_distn, bins=15)
plt.show()

SAMPLING IN PYTHON
Sample, bootstrap distribution, population means
Sample mean: Estimated population mean:

coffee_sample['flavor'].mean() np.mean(bootstrap_distn)

7.5132200000000005 7.513357731999999

True population mean:

coffee_ratings['flavor'].mean()

7.526046337817639

SAMPLING IN PYTHON
Interpreting the means
Bootstrap distribution mean:

Usually close to the sample mean

May not be a good estimate of the population mean


Bootstrapping cannot correct biases from sampling

SAMPLING IN PYTHON
Sample sd vs. bootstrap distribution sd
Sample standard deviation: Estimated population standard deviation?

coffee_sample['flavor'].std() np.std(bootstrap_distn, ddof=1)

0.3540883911928703 0.015768474367958217

SAMPLING IN PYTHON
Sample, bootstrap dist'n, pop'n standard deviations
Sample standard deviation: Estimated population standard deviation:

coffee_sample['flavor'].std() standard_error = np.std(bootstrap_distn, ddof=1)

0.3540883911928703 Standard error is the standard deviation of the


statistic of interest

True standard deviation: standard_error * np.sqrt(500)

coffee_ratings['flavor'].std(ddof=0)
0.3525938058821761

0.34125481224622645 Standard error times square root of sample


size estimates the population standard
deviation

SAMPLING IN PYTHON
Interpreting the standard errors
Estimated standard error → standard deviation of the bootstrap distribution for a sample
statistic

Population std. dev ≈ Std. Error × √Sample size

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Confidence intervals
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Confidence intervals
"Values within one standard deviation of the mean" includes a large number of values from
each of these distributions

We'll define a related concept called a confidence interval

SAMPLING IN PYTHON
Predicting the weather
Rapid City, South Dakota in the United
States has the least predictable weather

Our job is to predict the high temperature


there tomorrow

SAMPLING IN PYTHON
Our weather prediction
Point estimate = 47°F (8.3°C)
Range of plausible high temperature values = 40 to 54°F (4.4 to 12.8°C)

SAMPLING IN PYTHON
We just reported a confidence interval!
40 to 54°F is a confidence interval
Sometimes written as 47 °F (40°F, 54°F) or 47°F [40°F, 54°F]

... or, 47 ± 7°F

7°F is the margin of error

SAMPLING IN PYTHON
Bootstrap distribution of mean flavor
import matplotlib.pyplot as plt
plt.hist(coffee_boot_distn, bins=15)
plt.show()

SAMPLING IN PYTHON
Mean of the resamples
import numpy as np
np.mean(coffee_boot_distn)

7.513452892

SAMPLING IN PYTHON
Mean plus or minus one standard deviation
np.mean(coffee_boot_distn)

7.513452892

np.mean(coffee_boot_distn) - np.std(coffee_boot_distn, ddof=1)

7.497385709174466

np.mean(coffee_boot_distn) + np.std(coffee_boot_distn, ddof=1)

7.529520074825534

SAMPLING IN PYTHON
Quantile method for confidence intervals
np.quantile(coffee_boot_distn, 0.025)

7.4817195

np.quantile(coffee_boot_distn, 0.975)

7.5448805

SAMPLING IN PYTHON
Inverse cumulative distribution function
PDF: The bell curve

CDF: integrate to get area under bell curve

Inv. CDF: flip x and y axes

Implemented in Python with

from scipy.stats import norm


norm.ppf(quantile, loc=0, scale=1)

SAMPLING IN PYTHON
Standard error method for confidence interval
point_estimate = np.mean(coffee_boot_distn)

7.513452892

std_error = np.std(coffee_boot_distn, ddof=1)

0.016067182825533724

from scipy.stats import norm


lower = norm.ppf(0.025, loc=point_estimate, scale=std_error)
upper = norm.ppf(0.975, loc=point_estimate, scale=std_error)
print((lower, upper))

(7.481961792328933, 7.544943991671067)

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Congratulations!
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Recap
Chapter 1 Chapter 3

Sampling basics Sample size and population parameters

Selection bias Creating sampling distributions


Pseudo-random numbers Approximate vs. actual sampling dist'ns

Central limit theorem


Chapter 2 Chapter 4

Simple random sampling Bootstrapping from a single sample

Systematic sampling Standard error

Stratified sampling Confidence intervals


Cluster sampling

SAMPLING IN PYTHON
The most important things

The std. deviation of a bootstrap statistic is a good approximation of the standard error

Can assume bootstrap distributions are normally distributed for confidence intervals

SAMPLING IN PYTHON
What's next?
Experimental Design in Python and Customer Analytics and A/B Testing in Python
Hypothesis Testing in Python

Foundations of Probability in Python and Bayesian Data Analysis in Python

SAMPLING IN PYTHON
Happy learning!
SAMPLING IN PYTHON

You might also like