0% found this document useful (0 votes)

22 views140 pages

Sampling in Python

Sampling in Python is the cornerstone of inference statistics and hypothesis testing. It's a powerful skill used in survey analysis and experimental design to draw conclusions without surveying an entire population. In this Sampling in Python course, you’ll discover when to use sampling and how to perform common types of sampling—from simple random sampling to more complex methods like stratified and cluster sampling.

Uploaded by

jcmayac

We take content rights seriously. If you suspect this is your content, claim it here.

0% found this document useful (0 votes)

22 views140 pages

Sampling in Python

Uploaded by

jcmayac

We take content rights seriously. If you suspect this is your content, claim it here.

You are on page 1/ 140

Sampling and point

estimates
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Estimating the population of France
A census asks every household how many
people live there.

SAMPLING IN PYTHON
There are lots of people in France
Censuses are really expensive!

SAMPLING IN PYTHON
Sampling households
Cheaper to ask a small number of households
and use statistics to estimate the population

Working with a subset of the whole population

is called sampling

SAMPLING IN PYTHON
Population vs. sample
The population is the complete dataset

Doesn't have to refer to people

Typically, don't know what the whole population is

The sample is the subset of data you calculate on

SAMPLING IN PYTHON
Coffee rating dataset
total_cup_points variety country_of_origin aroma flavor aftertaste body balance
90.58 NA Ethiopia 8.67 8.83 8.67 8.50 8.42
89.92 Other Ethiopia 8.75 8.67 8.50 8.42 8.42
... ... ... ... ... ... ... ...
73.75 NA Vietnam 6.75 6.67 6.5 6.92 6.83

Each row represents 1 coffee

1338 rows

We'll treat this as the population

SAMPLING IN PYTHON
Points vs. flavor: population
pts_vs_flavor_pop = coffee_ratings[["total_cup_points", "flavor"]]

total_cup_points flavor
0 90.58 8.83
1 89.92 8.67
2 89.75 8.50
3 89.00 8.58
4 88.83 8.50
... ... ...
1333 78.75 7.58
1334 78.08 7.67
1335 77.17 7.33
1336 75.08 6.83
1337 73.75 6.67

[1338 rows x 2 columns]

SAMPLING IN PYTHON
Points vs. flavor: 10 row sample
pts_vs_flavor_samp = pts_vs_flavor_pop.sample(n=10)

total_cup_points flavor
1088 80.33 7.17
1157 79.67 7.42
1267 76.17 7.33
506 83.00 7.67
659 82.50 7.42
817 81.92 7.50
1050 80.67 7.42
685 82.42 7.50
1027 80.92 7.25
62 85.58 8.17

[10 rows x 2 columns]

SAMPLING IN PYTHON
Python sampling for Series
Use .sample() for pandas DataFrames and Series

cup_points_samp = coffee_ratings['total_cup_points'].sample(n=10)

1088 80.33
1157 79.67
1267 76.17
... ...
685 82.42
1027 80.92
62 85.58
Name: total_cup_points, dtype: float64

SAMPLING IN PYTHON
Population parameters & point estimates
A population parameter is a calculation made on the population dataset

import numpy as np
np.mean(pts_vs_flavor_pop['total_cup_points'])

82.15120328849028

A point estimate or sample statistic is a calculation made on the sample dataset

np.mean(cup_points_samp)

81.31800000000001

SAMPLING IN PYTHON
Point estimates with pandas
pts_vs_flavor_pop['flavor'].mean()

7.526046337817639

pts_vs_flavor_samp['flavor'].mean()

7.485000000000001

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Convenience
sampling
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
The Literary Digest election prediction

Prediction: Landon gets 57%; Roosevelt gets 43%

Actual results: Landon got 38%; Roosevelt got 62%

Sample not representative of population, causing sample bias

Collecting data by the easiest method is called convenience sampling

SAMPLING IN PYTHON
Finding the mean age of French people
Survey 10 people at Disneyland Paris
Mean age of 24.6 years

Will this be a good estimate for all of

France?

1 Image by Sean MacEntee

SAMPLING IN PYTHON
How accurate was the survey?
Year Average French Age 24.6 years is a poor estimate
1975 31.6 People who visit Disneyland aren't
1985 33.6 representative of the whole population

1995 36.2
2005 38.9
2015 41.2

SAMPLING IN PYTHON
Convenience sampling coffee ratings
coffee_ratings["total_cup_points"].mean()

82.15120328849028

coffee_ratings_first10 = coffee_ratings.head(10)

coffee_ratings_first10["total_cup_points"].mean()

89.1

SAMPLING IN PYTHON
Visualizing selection bias
import matplotlib.pyplot as plt
import numpy as np
coffee_ratings["total_cup_points"].hist(bins=np.arange(59, 93, 2))
plt.show()

coffee_ratings_first10["total_cup_points"].hist(bins=np.arange(59, 93, 2))

plt.show()

SAMPLING IN PYTHON
Distribution of a population and of a convenience
sample
Population: Convenience sample:

SAMPLING IN PYTHON
Visualizing selection bias for a random sample
coffee_sample = coffee_ratings.sample(n=10)
coffee_sample["total_cup_points"].hist(bins=np.arange(59, 93, 2))
plt.show()

SAMPLING IN PYTHON
Distribution of a population and of a simple random
sample
Population: Random Sample:

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Pseudo-random
number generation
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
What does random mean?
{adjective} made, done, happening, or chosen without method or conscious decision.

1 Oxford Languages

SAMPLING IN PYTHON
True random numbers
Generated from physical processes, like flipping coins
Hotbits uses radioactive decay

RANDOM.ORG uses atmospheric noise

True randomness is expensive

1 https://www.fourmilab.ch/hotbits 2 https://www.random.org

SAMPLING IN PYTHON
Pseudo-random number generation
Pseudo-random number generation is cheap and fast
Next "random" number calculated from previous "random" number

The first "random" number calculated from a seed

The same seed value yields the same random numbers

SAMPLING IN PYTHON
Pseudo-random number generation example
seed = 1
calc_next_random(seed)

calc_next_random(3)

calc_next_random(2)

SAMPLING IN PYTHON
Random number generating functions
Prepend with numpy.random , such as numpy.random.beta()

function distribution function distribution

.beta Beta .hypergeometric Hypergeometric
.binomial Binomial .lognormal Lognormal
.chisquare Chi-squared .negative_binomial Negative binomial
.exponential Exponential .normal Normal
.f F .poisson Poisson
.gamma Gamma .standard_t t
.geometric Geometric .uniform Uniform

SAMPLING IN PYTHON
Visualizing random numbers
randoms = np.random.beta(a=2, b=2, size=5000)
randoms

array([0.6208281 , 0.73216171, 0.44298403, ...,

0.13411873, 0.52198411, 0.72355098])

plt.hist(randoms, bins=np.arange(0, 1, 0.05))

plt.show()

SAMPLING IN PYTHON
Random numbers seeds
np.random.seed(20000229) np.random.seed(20000229)

np.random.normal(loc=2, scale=1.5, size=2) np.random.normal(loc=2, scale=1.5, size=2)

array([-0.59030264, 1.87821258]) array([-0.59030264, 1.87821258])

np.random.normal(loc=2, scale=1.5, size=2) np.random.normal(loc=2, scale=1.5, size=2)

array([2.52619561, 4.9684949 ]) array([2.52619561, 4.9684949 ])

SAMPLING IN PYTHON
Using a different seed
np.random.seed(20000229) np.random.seed(20041004)

np.random.normal(loc=2, scale=1.5, size=2) np.random.normal(loc=2, scale=1.5, size=2)

array([-0.59030264, 1.87821258]) array([1.09364337, 4.55285159])

np.random.normal(loc=2, scale=1.5, size=2) np.random.normal(loc=2, scale=1.5, size=2)

array([2.52619561, 4.9684949 ]) array([2.67038916, 2.36677492])

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Simple random and
systematic sampling
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Simple random sampling

SAMPLING IN PYTHON
Simple random sampling of coffees

SAMPLING IN PYTHON
Simple random sampling with pandas
coffee_ratings.sample(n=5, random_state=19000113)

total_cup_points variety country_of_origin aroma flavor \

437 83.25 None Colombia 7.92 7.75
285 83.83 Yellow Bourbon Brazil 7.92 7.50
784 82.08 None Colombia 7.50 7.42
648 82.58 Caturra Colombia 7.58 7.50
155 84.58 Caturra Colombia 7.42 7.67

aftertaste body balance

437 7.25 7.83 7.58
285 7.33 8.17 7.50
784 7.42 7.67 7.42
648 7.42 7.67 7.42
155 7.75 8.08 7.83

SAMPLING IN PYTHON
Systematic sampling

SAMPLING IN PYTHON
Systematic sampling - defining the interval
sample_size = 5
pop_size = len(coffee_ratings)
print(pop_size)

1338

interval = pop_size // sample_size

print(interval)

267

SAMPLING IN PYTHON
Systematic sampling - selecting the rows
coffee_ratings.iloc[::interval]

total_cup_points variety country_of_origin aroma flavor aftertaste \

0 90.58 None Ethiopia 8.67 8.83 8.67
267 83.92 None Colombia 7.83 7.75 7.58
534 82.92 Bourbon El Salvador 7.50 7.50 7.75
801 82.00 Typica Taiwan 7.33 7.50 7.17
1068 80.50 Other Taiwan 7.17 7.17 7.17

body balance
0 8.50 8.42
267 7.75 7.75
534 7.92 7.83
801 7.50 7.33
1068 7.17 7.25

SAMPLING IN PYTHON
The trouble with systematic sampling
coffee_ratings_with_id = coffee_ratings.reset_index()
coffee_ratings_with_id.plot(x="index", y="aftertaste", kind="scatter")
plt.show()

Systematic sampling is only safe if we don't see a pattern in this scatter plot

SAMPLING IN PYTHON
Making systematic sampling safe
shuffled = coffee_ratings.sample(frac=1)
shuffled = shuffled.reset_index(drop=True).reset_index()
shuffled.plot(x="index", y="aftertaste", kind="scatter")
plt.show()

Shuffling rows + systematic sampling is the same as simple random sampling

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Stratified and
weighted random
sampling
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Coffees by country
top_counts = coffee_ratings['country_of_origin'].value_counts()
top_counts.head(6)

country_of_origin
Mexico 236
Colombia 183
Guatemala 181
Brazil 132
Taiwan 75
United States (Hawaii) 73
dtype: int64

1 The dataset lists Hawaii and Taiwan as countries for convenience, as they are notable coffee-growing regions.

SAMPLING IN PYTHON
Filtering for 6 countries
top_counted_countries = ["Mexico", "Colombia", "Guatemala",
"Brazil", "Taiwan", "United States (Hawaii)"]

top_counted_subset = coffee_ratings['country_of_origin'].isin(top_counted_countries)

coffee_ratings_top = coffee_ratings[top_counted_subset]

SAMPLING IN PYTHON
Counts of a simple random sample
coffee_ratings_samp = coffee_ratings_top.sample(frac=0.1, random_state=2021)

coffee_ratings_samp['country_of_origin'].value_counts(normalize=True)

country_of_origin
Mexico 0.250000
Guatemala 0.204545
Colombia 0.181818
Brazil 0.181818
United States (Hawaii) 0.102273
Taiwan 0.079545
dtype: float64

SAMPLING IN PYTHON
Comparing proportions
Population: 10% simple random sample:

Mexico 0.268182 Mexico 0.250000

Colombia 0.207955 Guatemala 0.204545
Guatemala 0.205682 Colombia 0.181818
Brazil 0.150000 Brazil 0.181818
Taiwan 0.085227 United States (Hawaii) 0.102273
United States (Hawaii) 0.082955 Taiwan 0.079545
Name: country_of_origin, dtype: float64 Name: country_of_origin, dtype: float64

SAMPLING IN PYTHON
Proportional stratified sampling
coffee_ratings_strat = coffee_ratings_top.groupby("country_of_origin")\
.sample(frac=0.1, random_state=2021)

coffee_ratings_strat['country_of_origin'].value_counts(normalize=True)

Mexico 0.272727
Guatemala 0.204545
Colombia 0.204545
Brazil 0.147727
Taiwan 0.090909
United States (Hawaii) 0.079545
Name: country_of_origin, dtype: float64

SAMPLING IN PYTHON
Equal counts stratified sampling
coffee_ratings_eq = coffee_ratings_top.groupby("country_of_origin")\
.sample(n=15, random_state=2021)

coffee_ratings_eq['country_of_origin'].value_counts(normalize=True)

Taiwan 0.166667
Brazil 0.166667
United States (Hawaii) 0.166667
Guatemala 0.166667
Mexico 0.166667
Colombia 0.166667
Name: country_of_origin, dtype: float64

SAMPLING IN PYTHON
Weighted random sampling
Specify weights to adjust the relative probability of a row being sampled

import numpy as np
coffee_ratings_weight = coffee_ratings_top
condition = coffee_ratings_weight['country_of_origin'] == "Taiwan"

coffee_ratings_weight['weight'] = np.where(condition, 2, 1)

coffee_ratings_weight = coffee_ratings_weight.sample(frac=0.1, weights="weight")

SAMPLING IN PYTHON
Weighted random sampling results
10% weighted sample:

coffee_ratings_weight['country_of_origin'].value_counts(normalize=True)

Brazil 0.261364
Mexico 0.204545
Guatemala 0.204545
Taiwan 0.170455
Colombia 0.090909
United States (Hawaii) 0.068182
Name: country_of_origin, dtype: float64

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Cluster sampling
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Stratified sampling vs. cluster sampling
Stratified sampling
Split the population into subgroups

Use simple random sampling on every subgroup

Cluster sampling
Use simple random sampling to pick some subgroups

Use simple random sampling on only those subgroups

SAMPLING IN PYTHON
Varieties of coffee
varieties_pop = list(coffee_ratings['variety'].unique())

[None, 'Other', 'Bourbon', 'Catimor',

'Ethiopian Yirgacheffe','Caturra',
'SL14', 'Sumatra', 'SL34', 'Hawaiian Kona',
'Yellow Bourbon', 'SL28', 'Gesha', 'Catuai',
'Pacamara', 'Typica', 'Sumatra Lintong',
'Mundo Novo', 'Java', 'Peaberry', 'Pacas',
'Mandheling', 'Ruiru 11', 'Arusha',
'Ethiopian Heirlooms', 'Moka Peaberry',
'Sulawesi', 'Blue Mountain', 'Marigojipe',
'Pache Comun']

SAMPLING IN PYTHON
Stage 1: sampling for subgroups
import random
varieties_samp = random.sample(varieties_pop, k=3)

['Hawaiian Kona', 'Bourbon', 'SL28']

SAMPLING IN PYTHON
Stage 2: sampling each group
variety_condition = coffee_ratings['variety'].isin(varieties_samp)
coffee_ratings_cluster = coffee_ratings[variety_condition]

coffee_ratings_cluster['variety'] = coffee_ratings_cluster['variety'].cat.remove_unused_categories()

coffee_ratings_cluster.groupby("variety")\
.sample(n=5, random_state=2021)

SAMPLING IN PYTHON
Stage 2 output
total_cup_points variety country_of_origin ...
variety
Bourbon 575 82.83 Bourbon Guatemala
560 82.83 Bourbon Guatemala
524 83.00 Bourbon Guatemala
1140 79.83 Bourbon Guatemala
318 83.67 Bourbon Brazil
Hawaiian Kona 1291 73.67 Hawaiian Kona United States (Hawaii)
1266 76.25 Hawaiian Kona United States (Hawaii)
488 83.08 Hawaiian Kona United States (Hawaii)
461 83.17 Hawaiian Kona United States (Hawaii)
117 84.83 Hawaiian Kona United States (Hawaii)
SL28 137 84.67 SL28 Kenya
452 83.17 SL28 Kenya
224 84.17 SL28 Kenya
66 85.50 SL28 Kenya
559 82.83 SL28 Kenya

SAMPLING IN PYTHON
Multistage sampling
Cluster sampling is a type of multistage sampling
Can have > 2 stages

E.g., countrywide surveys may sample states, counties, cities, and neighborhoods

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Comparing
sampling methods
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Review of sampling techniques - setup
top_counted_countries = ["Mexico", "Colombia", "Guatemala",
"Brazil", "Taiwan", "United States (Hawaii)"]

subset_condition = coffee_ratings['country_of_origin'].isin(top_counted_countries)
coffee_ratings_top = coffee_ratings[subset_condition]

coffee_ratings_top.shape

(880, 8)

SAMPLING IN PYTHON
Review of simple random sampling
coffee_ratings_srs = coffee_ratings_top.sample(frac=1/3, random_state=2021)

coffee_ratings_srs.shape

(293, 8)

SAMPLING IN PYTHON
Review of stratified sampling
coffee_ratings_strat = coffee_ratings_top.groupby("country_of_origin")\
.sample(frac=1/3, random_state=2021)

coffee_ratings_strat.shape

(293, 8)

SAMPLING IN PYTHON
Review of cluster sampling
import random
top_countries_samp = random.sample(top_counted_countries, k=2)
top_condition = coffee_ratings_top['country_of_origin'].isin(top_countries_samp)
coffee_ratings_cluster = coffee_ratings_top[top_condition]
coffee_ratings_cluster['country_of_origin'] = coffee_ratings_cluster['country_of_origin']\
.cat.remove_unused_categories()

coffee_ratings_clust = coffee_ratings_cluster.groupby("country_of_origin")\
.sample(n=len(coffee_ratings_top) // 6)

coffee_ratings_clust.shape

(292, 8)

SAMPLING IN PYTHON
Calculating mean cup points
Population Simple random sample
coffee_ratings_top['total_cup_points'].mean() coffee_ratings_srs['total_cup_points'].mean()

81.94700000000002 81.95982935153583

Stratified sample Cluster sample

coffee_ratings_strat['total_cup_points'].mean() coffee_ratings_clust['total_cup_points'].mean()

81.92566552901025 82.03246575342466

SAMPLING IN PYTHON
Mean cup points by country: simple random
Population: Simple random sample:

coffee_ratings_top.groupby("country_of_origin")\ coffee_ratings_srs.groupby("country_of_origin")\
['total_cup_points'].mean() ['total_cup_points'].mean()

country_of_origin country_of_origin
Brazil 82.405909 Brazil 82.414878
Colombia 83.106557 Colombia 82.925536
Guatemala 81.846575 Guatemala 82.045385
Mexico 80.890085 Mexico 81.100714
Taiwan 82.001333 Taiwan 81.744333
United States (Hawaii) 81.820411 United States (Hawaii) 82.008000
Name: total_cup_points, dtype: float64 Name: total_cup_points, dtype: float64

SAMPLING IN PYTHON
Mean cup points by country: stratified
Population: Stratified sample:

coffee_ratings_top.groupby("country_of_origin")\ coffee_ratings_strat.groupby("country_of_origin")\
['total_cup_points'].mean() ['total_cup_points'].mean()

country_of_origin country_of_origin
Brazil 82.405909 Brazil 82.499773
Colombia 83.106557 Colombia 83.288197
Guatemala 81.846575 Guatemala 81.727667
Mexico 80.890085 Mexico 80.994684
Taiwan 82.001333 Taiwan 81.846800
United States (Hawaii) 81.820411 United States (Hawaii) 81.051667
Name: total_cup_points, dtype: float64 Name: total_cup_points, dtype: float64

SAMPLING IN PYTHON
Mean cup points by country: cluster
Population: Cluster sample:

coffee_ratings_top.groupby("country_of_origin")\ coffee_ratings_clust.groupby("country_of_origin")\
['total_cup_points'].mean() ['total_cup_points'].mean()

country_of_origin country_of_origin
Brazil 82.405909 Colombia 83.128904
Colombia 83.106557 Mexico 80.936027
Guatemala 81.846575 Name: total_cup_points, dtype: float64
Mexico 80.890085
Taiwan 82.001333
United States (Hawaii) 81.820411
Name: total_cup_points, dtype: float64

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Relative error of
point estimates
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Sample size is number of rows
len(coffee_ratings.sample(n=300)) len(coffee_ratings.sample(frac=0.25))

300 334

SAMPLING IN PYTHON
Various sample sizes
coffee_ratings['total_cup_points'].mean()

82.15120328849028

coffee_ratings.sample(n=10)['total_cup_points'].mean()

83.027

coffee_ratings.sample(n=100)['total_cup_points'].mean()

82.4897

coffee_ratings.sample(n=1000)['total_cup_points'].mean()

82.1186

SAMPLING IN PYTHON
Relative errors
Population parameter:

population_mean = coffee_ratings['total_cup_points'].mean()

Point estimate:

sample_mean = coffee_ratings.sample(n=sample_size)['total_cup_points'].mean()

Relative error as a percentage:

rel_error_pct = 100 * abs(population_mean-sample_mean) / population_mean

SAMPLING IN PYTHON
Relative error vs. sample size
import matplotlib.pyplot as plt
errors.plot(x="sample_size",
y="relative_error",
kind="line")
plt.show()

Properties:

Really noise, particularly for small samples

Amplitude is initially steep, then flattens

Relative error decreases to zero (when the

sample size = population)

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Creating a sampling
distribution
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Same code, different answer
coffee_ratings.sample(n=30)['total_cup_points'].mean() coffee_ratings.sample(n=30)['total_cup_points'].mean()

82.53066666666668 81.97566666666667

coffee_ratings.sample(n=30)['total_cup_points'].mean() coffee_ratings.sample(n=30)['total_cup_points'].mean()

82.68 81.675

SAMPLING IN PYTHON
Same code, 1000 times
mean_cup_points_1000 = []
for i in range(1000):
mean_cup_points_1000.append(
coffee_ratings.sample(n=30)['total_cup_points'].mean()
)
print(mean_cup_points_1000)

[82.11933333333333, 82.55300000000001, 82.07266666666668, 81.76966666666667,

...
82.74166666666666, 82.45033333333335, 81.77199999999999, 82.8163333333333]

SAMPLING IN PYTHON
Distribution of sample means for size 30
import matplotlib.pyplot as plt
plt.hist(mean_cup_points_1000, bins=30)
plt.show()

A sampling distribution is a distribution of

replicates of point estimates.

SAMPLING IN PYTHON
Different sample sizes
Sample size: 6 Sample size: 150

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Approximate
sampling
distributions
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
4 dice
die1 die2 die3 die4
0 1 1 1 1
1 1 1 1 2
2 1 1 1 3
3 1 1 1 4
4 1 1 1 5
dice = expand_grid( ... ... ... ... ...
{'die1': [1, 2, 3, 4, 5, 6], 1291 6 6 6 2
'die2': [1, 2, 3, 4, 5, 6], 1292 6 6 6 3
'die3': [1, 2, 3, 4, 5, 6], 1293 6 6 6 4
'die4': [1, 2, 3, 4, 5, 6] 1294 6 6 6 5
} 1295 6 6 6 6
)
[1296 rows x 4 columns]

SAMPLING IN PYTHON
Mean roll
dice['mean_roll'] = (dice['die1'] + die1 die2 die3 die4 mean_roll
dice['die2'] + 0 1 1 1 1 1.00
dice['die3'] + 1 1 1 1 2 1.25
dice['die4']) / 4 2 1 1 1 3 1.50
print(dice) 3 1 1 1 4 1.75
4 1 1 1 5 2.00
... ... ... ... ... ...
1291 6 6 6 2 5.00
1292 6 6 6 3 5.25
1293 6 6 6 4 5.50
1294 6 6 6 5 5.75
1295 6 6 6 6 6.00

[1296 rows x 5 columns]

SAMPLING IN PYTHON
Exact sampling distribution
dice['mean_roll'] = dice['mean_roll'].astype('category')
dice['mean_roll'].value_counts(sort=False).plot(kind="bar")

SAMPLING IN PYTHON
The number of outcomes increases fast
n_dice = list(range(1, 101))
n_outcomes = []
for n in n_dice:
n_outcomes.append(6**n)

outcomes = pd.DataFrame(
{"n_dice": n_dice,
"n_outcomes": n_outcomes})

outcomes.plot(x="n_dice",
y="n_outcomes",
kind="scatter")
plt.show()

SAMPLING IN PYTHON
Simulating the mean of four dice rolls
import numpy as np

np.random.choice(list(range(1, 7)), size=4, replace=True).mean()

SAMPLING IN PYTHON
Simulating the mean of four dice rolls
import numpy as np
sample_means_1000 = []
for i in range(1000):
sample_means_1000.append(
np.random.choice(list(range(1, 7)), size=4, replace=True).mean()
)
print(sample_means_1000)

[3.25, 3.25, 1.75, 2.0, 2.0, 1.0, 1.0, 2.75, 2.75, 2.5, 3.0, 2.0, 2.75,
...
1.25, 2.0, 2.5, 2.5, 3.75, 1.5, 1.75, 2.25, 2.0, 1.5, 3.25, 3.0, 3.5]

SAMPLING IN PYTHON
Approximate sampling distribution
plt.hist(sample_means_1000, bins=20)

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Standard errors and
the Central Limit
Theorem
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Sampling distribution of mean cup points
Sample size: 5 Sample size: 20

Sample size: 80 Sample size: 320

SAMPLING IN PYTHON
Consequences of the central limit theorem

Averages of independent samples have approximately normal distributions.

As the sample size increases,

The distribution of the averages gets closer to being normally distributed

The width of the sampling distribution gets narrower

SAMPLING IN PYTHON
Population & sampling distribution means
coffee_ratings['total_cup_points'].mean() Use np.mean() on each approximate
sampling distribution:
82.15120328849028
Sample size Mean sample mean
5 82.18420719999999

20 82.1558634

80 82.14510154999999

320 82.154017925

SAMPLING IN PYTHON
Population & sampling distribution standard deviations
coffee_ratings['total_cup_points'].std(ddof=0) Sample size Std dev sample mean
5 1.1886358227738543
2.685858187306438
20 0.5940321141669805

80 0.2934024263916487

Specify ddof=0 when calling .std() on 320 0.13095083089190876

populations

Specify ddof=1 when calling np.std() on

samples or sampling distributions

SAMPLING IN PYTHON
Population mean over square root sample size
Sample size Std dev sample mean Calculation Result
5 1.1886358227738543 2.685858187306438 / sqrt(5) 1.201

20 0.5940321141669805 2.685858187306438 / sqrt(20) 0.601

80 0.2934024263916487 2.685858187306438 / sqrt(80) 0.300

320 0.13095083089190876 2.685858187306438 / sqrt(320) 0.150

SAMPLING IN PYTHON
Standard error
Standard deviation of the sampling distribution
Important tool in understanding sampling variability

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Introduction to
bootstrapping
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
With or without
Sampling without replacement: Sampling with replacement ("resampling"):

SAMPLING IN PYTHON
Simple random sampling without replacement
Population: Sample:

SAMPLING IN PYTHON
Simple random sampling with replacement
Population: Resample:

SAMPLING IN PYTHON
Why sample with replacement?
coffee_ratings : a sample of a larger population of all coffees

Each coffee in our sample represents many different hypothetical population coffees

Sampling with replacement is a proxy

SAMPLING IN PYTHON
Coffee data preparation
coffee_focus = coffee_ratings[["variety", "country_of_origin", "flavor"]]
coffee_focus = coffee_focus.reset_index()

index variety country_of_origin flavor

0 0 None Ethiopia 8.83
1 1 Other Ethiopia 8.67
2 2 Bourbon Guatemala 8.50
3 3 None Ethiopia 8.58
4 4 Other Ethiopia 8.50
... ... ... ... ...
1333 1333 None Ecuador 7.58
1334 1334 None Ecuador 7.67
1335 1335 None United States 7.33
1336 1336 None India 6.83
1337 1337 None Vietnam 6.67

[1338 rows x 4 columns]

SAMPLING IN PYTHON
Resampling with .sample()
coffee_resamp = coffee_focus.sample(frac=1, replace=True)

index variety country_of_origin flavor

1140 1140 Bourbon Guatemala 7.25
57 57 Bourbon Guatemala 8.00
1152 1152 Bourbon Mexico 7.08
621 621 Caturra Thailand 7.50
44 44 SL28 Kenya 8.08
... ... ... ... ...
996 996 Typica Mexico 7.33
1090 1090 Bourbon Guatemala 7.33
918 918 Other Guatemala 7.42
249 249 Caturra Colombia 7.67
467 467 Caturra Colombia 7.50

[1338 rows x 4 columns]

SAMPLING IN PYTHON
Repeated coffees
coffee_resamp["index"].value_counts() 658 5
167 4
363 4
357 4
1047 4
..
771 1
770 1
766 1
764 1
0 1
Name: index, Length: 868, dtype: int64

SAMPLING IN PYTHON
Missing coffees
num_unique_coffees = len(coffee_resamp.drop_duplicates(subset="index"))

868

len(coffee_ratings) - num_unique_coffees

470

SAMPLING IN PYTHON
Bootstrapping
The opposite of sampling from a
population

Sampling: going from a population to a

smaller sample

Bootstrapping: building up a theoretical

population from the sample

Bootstrapping use case:

Develop understanding of sampling

variability using a single sample

SAMPLING IN PYTHON
Bootstrapping process
1. Make a resample of the same size as the original sample
2. Calculate the statistic of interest for this bootstrap sample

3. Repeat steps 1 and 2 many times

The resulting statistics are bootstrap statistics, and they form a bootstrap distribution

SAMPLING IN PYTHON
Bootstrapping coffee mean flavor
import numpy as np
mean_flavors_1000 = []
for i in range(1000):
mean_flavors_1000.append(
np.mean(coffee_sample.sample(frac=1, replace=True)['flavor'])
)

SAMPLING IN PYTHON
Bootstrap distribution histogram
import matplotlib.pyplot as plt
plt.hist(mean_flavors_1000)
plt.show()

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Comparing
sampling and
bootstrap
distributions
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Coffee focused subset
coffee_sample = coffee_ratings[["variety", "country_of_origin", "flavor"]]\
.reset_index().sample(n=500)

index variety country_of_origin flavor

132 132 Other Costa Rica 7.58
51 51 None United States (Hawaii) 8.17
42 42 Yellow Bourbon Brazil 7.92
569 569 Bourbon Guatemala 7.67
.. ... ... ... ...
643 643 Catuai Costa Rica 7.42
356 356 Caturra Colombia 7.58
494 494 None Indonesia 7.58
169 169 None Brazil 7.81

[500 rows x 4 columns]

SAMPLING IN PYTHON
The bootstrap of mean coffee flavors
import numpy as np
mean_flavors_5000 = []
for i in range(5000):
mean_flavors_5000.append(
np.mean(coffee_sample.sample(frac=1, replace=True)['flavor'])
)
bootstrap_distn = mean_flavors_5000

SAMPLING IN PYTHON
Mean flavor bootstrap distribution
import matplotlib.pyplot as plt
plt.hist(bootstrap_distn, bins=15)
plt.show()

SAMPLING IN PYTHON
Sample, bootstrap distribution, population means
Sample mean: Estimated population mean:

coffee_sample['flavor'].mean() np.mean(bootstrap_distn)

7.5132200000000005 7.513357731999999

True population mean:

coffee_ratings['flavor'].mean()

7.526046337817639

SAMPLING IN PYTHON
Interpreting the means
Bootstrap distribution mean:

Usually close to the sample mean

May not be a good estimate of the population mean

Bootstrapping cannot correct biases from sampling

SAMPLING IN PYTHON
Sample sd vs. bootstrap distribution sd
Sample standard deviation: Estimated population standard deviation?

coffee_sample['flavor'].std() np.std(bootstrap_distn, ddof=1)

0.3540883911928703 0.015768474367958217

SAMPLING IN PYTHON
Sample, bootstrap dist'n, pop'n standard deviations
Sample standard deviation: Estimated population standard deviation:

coffee_sample['flavor'].std() standard_error = np.std(bootstrap_distn, ddof=1)

0.3540883911928703 Standard error is the standard deviation of the

statistic of interest

True standard deviation: standard_error * np.sqrt(500)

coffee_ratings['flavor'].std(ddof=0)
0.3525938058821761

0.34125481224622645 Standard error times square root of sample

size estimates the population standard
deviation

SAMPLING IN PYTHON
Interpreting the standard errors
Estimated standard error → standard deviation of the bootstrap distribution for a sample
statistic

Population std. dev ≈ Std. Error × √Sample size

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Confidence intervals
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Confidence intervals
"Values within one standard deviation of the mean" includes a large number of values from
each of these distributions

We'll define a related concept called a confidence interval

SAMPLING IN PYTHON
Predicting the weather
Rapid City, South Dakota in the United
States has the least predictable weather

Our job is to predict the high temperature

there tomorrow

SAMPLING IN PYTHON
Our weather prediction
Point estimate = 47°F (8.3°C)
Range of plausible high temperature values = 40 to 54°F (4.4 to 12.8°C)

SAMPLING IN PYTHON
We just reported a confidence interval!
40 to 54°F is a confidence interval
Sometimes written as 47 °F (40°F, 54°F) or 47°F [40°F, 54°F]

... or, 47 ± 7°F

7°F is the margin of error

SAMPLING IN PYTHON
Bootstrap distribution of mean flavor
import matplotlib.pyplot as plt
plt.hist(coffee_boot_distn, bins=15)
plt.show()

SAMPLING IN PYTHON
Mean of the resamples
import numpy as np
np.mean(coffee_boot_distn)

7.513452892

SAMPLING IN PYTHON
Mean plus or minus one standard deviation
np.mean(coffee_boot_distn)

7.513452892

np.mean(coffee_boot_distn) - np.std(coffee_boot_distn, ddof=1)

7.497385709174466

np.mean(coffee_boot_distn) + np.std(coffee_boot_distn, ddof=1)

7.529520074825534

SAMPLING IN PYTHON
Quantile method for confidence intervals
np.quantile(coffee_boot_distn, 0.025)

7.4817195

np.quantile(coffee_boot_distn, 0.975)

7.5448805

SAMPLING IN PYTHON
Inverse cumulative distribution function
PDF: The bell curve

CDF: integrate to get area under bell curve

Inv. CDF: flip x and y axes

Implemented in Python with

from scipy.stats import norm

norm.ppf(quantile, loc=0, scale=1)

SAMPLING IN PYTHON
Standard error method for confidence interval
point_estimate = np.mean(coffee_boot_distn)

7.513452892

std_error = np.std(coffee_boot_distn, ddof=1)

0.016067182825533724

from scipy.stats import norm

lower = norm.ppf(0.025, loc=point_estimate, scale=std_error)
upper = norm.ppf(0.975, loc=point_estimate, scale=std_error)
print((lower, upper))

(7.481961792328933, 7.544943991671067)

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Congratulations!
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Recap
Chapter 1 Chapter 3

Sampling basics Sample size and population parameters

Selection bias Creating sampling distributions

Pseudo-random numbers Approximate vs. actual sampling dist'ns

Central limit theorem

Chapter 2 Chapter 4

Simple random sampling Bootstrapping from a single sample

Systematic sampling Standard error

Stratified sampling Confidence intervals

Cluster sampling

SAMPLING IN PYTHON
The most important things

The std. deviation of a bootstrap statistic is a good approximation of the standard error

Can assume bootstrap distributions are normally distributed for confidence intervals

SAMPLING IN PYTHON
What's next?
Experimental Design in Python and Customer Analytics and A/B Testing in Python
Hypothesis Testing in Python

Foundations of Probability in Python and Bayesian Data Analysis in Python

SAMPLING IN PYTHON
Happy learning!
SAMPLING IN PYTHON

AI Fundamentals
90% (10)
AI Fundamentals
881 pages
1 Sampling-Chapter1
No ratings yet
1 Sampling-Chapter1
32 pages
Chapter 1
No ratings yet
Chapter 1
32 pages
Chapter 1
No ratings yet
Chapter 1
32 pages
Chapter 4
No ratings yet
Chapter 4
41 pages
Sampling Chapter1
No ratings yet
Sampling Chapter1
29 pages
Sampling Chapter4
No ratings yet
Sampling Chapter4
41 pages
Sampling Chapter2
No ratings yet
Sampling Chapter2
35 pages
3 Sam-Chapter3
No ratings yet
3 Sam-Chapter3
29 pages
Sampling Chapter3
No ratings yet
Sampling Chapter3
29 pages
BDS306B Module5
No ratings yet
BDS306B Module5
5 pages
Lecture 4 - Data Wrangling
No ratings yet
Lecture 4 - Data Wrangling
41 pages
FDS Slips Solution
No ratings yet
FDS Slips Solution
7 pages
Sampling
No ratings yet
Sampling
14 pages
Spatial Sampling With R.sanet - ST
No ratings yet
Spatial Sampling With R.sanet - ST
549 pages
Sample Population R
No ratings yet
Sample Population R
9 pages
Pandas Lec 2
No ratings yet
Pandas Lec 2
21 pages
Stat 201 Mt1 Cheatsheet
No ratings yet
Stat 201 Mt1 Cheatsheet
2 pages
Data Visualization - 1 by Matplot Lib
No ratings yet
Data Visualization - 1 by Matplot Lib
19 pages
Descriptive Statistics: Chapter 6 - Random Sampling and Data Description 1
No ratings yet
Descriptive Statistics: Chapter 6 - Random Sampling and Data Description 1
43 pages
Random Module
No ratings yet
Random Module
14 pages
Data Visualization With Python
No ratings yet
Data Visualization With Python
34 pages
R Code For Simulation
No ratings yet
R Code For Simulation
3 pages
Graphs Using Matplotlib
No ratings yet
Graphs Using Matplotlib
23 pages
STA 410 Lecture Notes
No ratings yet
STA 410 Lecture Notes
47 pages
BDA Lab 4: Python Data Visualization: Your Name: Mohamad Salehuddin Bin Zulkefli Matric No: 17005054
No ratings yet
BDA Lab 4: Python Data Visualization: Your Name: Mohamad Salehuddin Bin Zulkefli Matric No: 17005054
10 pages
Lab 3
No ratings yet
Lab 3
14 pages
Basic Plotting
No ratings yet
Basic Plotting
8 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
58 pages
MLP Slides Merged
No ratings yet
MLP Slides Merged
480 pages
Pandas PD Numpy NP Matplotlib - Pyplot PLT Seaborn SNS: Import As Import As Import As Import As
No ratings yet
Pandas PD Numpy NP Matplotlib - Pyplot PLT Seaborn SNS: Import As Import As Import As Import As
16 pages
R Session - Note2 - Updated
No ratings yet
R Session - Note2 - Updated
7 pages
Lecture5 Classnotes
No ratings yet
Lecture5 Classnotes
23 pages
Shuffle (: New in Version 3.6. Changed in Version 3.9: Raises A
No ratings yet
Shuffle (: New in Version 3.6. Changed in Version 3.9: Raises A
3 pages
Year 12 Statistics
No ratings yet
Year 12 Statistics
62 pages
Sampling Distributions Coursera
No ratings yet
Sampling Distributions Coursera
8 pages
Matplotlip
No ratings yet
Matplotlip
12 pages
Data Sci
No ratings yet
Data Sci
10 pages
Data Science and Analtics Laboratory
No ratings yet
Data Science and Analtics Laboratory
21 pages
Unit5 Randomsampling
No ratings yet
Unit5 Randomsampling
21 pages
Data Science Experiments
No ratings yet
Data Science Experiments
31 pages
DV Nivas
No ratings yet
DV Nivas
24 pages
Lecture 5
No ratings yet
Lecture 5
21 pages
Class 10 Lab Data Science
No ratings yet
Class 10 Lab Data Science
7 pages
Chapter 3. Food Quality
No ratings yet
Chapter 3. Food Quality
53 pages
Data Science Algorithmen Master - 02 Data Handling
No ratings yet
Data Science Algorithmen Master - 02 Data Handling
76 pages
Learn Seaborn 1674064934
No ratings yet
Learn Seaborn 1674064934
24 pages
Data Visualization Using Matplotlib
No ratings yet
Data Visualization Using Matplotlib
10 pages
Lecture 2 - Sampling
No ratings yet
Lecture 2 - Sampling
37 pages
DVA Lab Manual
No ratings yet
DVA Lab Manual
20 pages
WINE Prediction Quality
100% (1)
WINE Prediction Quality
6 pages
AD8412 - Data Analytics - Staff COPY - V1
No ratings yet
AD8412 - Data Analytics - Staff COPY - V1
80 pages
6 6sampling
No ratings yet
6 6sampling
3 pages
Ai&Ml Bail606 ML Lab Manual
No ratings yet
Ai&Ml Bail606 ML Lab Manual
50 pages
Practical File Artificial Intelligence Class 10
No ratings yet
Practical File Artificial Intelligence Class 10
11 pages
Unit 3
No ratings yet
Unit 3
34 pages
Intro To Statistics (CH1&2)
No ratings yet
Intro To Statistics (CH1&2)
38 pages
Test1 Outline Formulas
No ratings yet
Test1 Outline Formulas
12 pages
This is The Statistics Handbook your Professor Doesn't Want you to See. So Easy, it's Practically Cheating...
From Everand
This is The Statistics Handbook your Professor Doesn't Want you to See. So Easy, it's Practically Cheating...
S. Deviant
4.5/5 (6)
Quant Developers' Tools and Techniques: Quant Books, #1
From Everand
Quant Developers' Tools and Techniques: Quant Books, #1
Manfred Hindering
No ratings yet
Differential Evolution: Fundamentals and Applications
From Everand
Differential Evolution: Fundamentals and Applications
Fouad Sabry
No ratings yet
Introduction To TensorFlow in Python
100% (3)
Introduction To TensorFlow in Python
146 pages
Supervised Learning With Scikit-Learn
No ratings yet
Supervised Learning With Scikit-Learn
178 pages
Introduction To Statistics in Python
100% (2)
Introduction To Statistics in Python
211 pages
Finance Fundamentals in Python
100% (4)
Finance Fundamentals in Python
877 pages
Introduction and Intermediate Docker
100% (1)
Introduction and Intermediate Docker
255 pages
Applied Finance in Python
100% (2)
Applied Finance in Python
545 pages
Summary of Research
No ratings yet
Summary of Research
2 pages
Overview Comments Received Ich Guideline q2r2 Ich Guideline q2r2 Validation Analytical Procedures Emachmpich820722006 en
No ratings yet
Overview Comments Received Ich Guideline q2r2 Ich Guideline q2r2 Validation Analytical Procedures Emachmpich820722006 en
72 pages
Mohsin PPTX Bos.-1
No ratings yet
Mohsin PPTX Bos.-1
21 pages
9609 s16 Ms 12
No ratings yet
9609 s16 Ms 12
10 pages
Crash Course in Analytics For Non Analytics Managers
No ratings yet
Crash Course in Analytics For Non Analytics Managers
74 pages
Business Vocabulary
No ratings yet
Business Vocabulary
26 pages
24054-Article Text-63048-68473-10-20220614
No ratings yet
24054-Article Text-63048-68473-10-20220614
14 pages
TQM-I - 04 - Sampling Plans and Acceptance Sampling
100% (1)
TQM-I - 04 - Sampling Plans and Acceptance Sampling
54 pages
Research Proposal - Roles of Sangguniang Kabataan of Barangay General Luna Zaragoza Nueva Ecija in Crime Prevention
50% (2)
Research Proposal - Roles of Sangguniang Kabataan of Barangay General Luna Zaragoza Nueva Ecija in Crime Prevention
25 pages
Subject Reference: Mathematics 2.5 Internal Assessment Resource Reference Number
No ratings yet
Subject Reference: Mathematics 2.5 Internal Assessment Resource Reference Number
14 pages
STA111 Chapter (Draft) GROUP IKMAL
No ratings yet
STA111 Chapter (Draft) GROUP IKMAL
24 pages
Practical Research 1 Subject Code Shs Resrch 1: Learning Module For Independent Learning
No ratings yet
Practical Research 1 Subject Code Shs Resrch 1: Learning Module For Independent Learning
36 pages
Chapter 3 Sample Size Calculation and Sampling
100% (1)
Chapter 3 Sample Size Calculation and Sampling
60 pages
SWGDRUG Recommendations 4 Edição PDF
No ratings yet
SWGDRUG Recommendations 4 Edição PDF
63 pages
National P.G. College, Lucknow
No ratings yet
National P.G. College, Lucknow
16 pages
Lesson 2.3-Measures of Central Tendency
No ratings yet
Lesson 2.3-Measures of Central Tendency
31 pages
Sales Promotion and Customer Awareness of The Services, Standerd Charterd Finance Ltd. by Shiv Gautam - Marketing
100% (1)
Sales Promotion and Customer Awareness of The Services, Standerd Charterd Finance Ltd. by Shiv Gautam - Marketing
67 pages
Specfic Gravity Report
No ratings yet
Specfic Gravity Report
41 pages
Module 8 Lesson 1 2 Data Collection Methods PDF
No ratings yet
Module 8 Lesson 1 2 Data Collection Methods PDF
46 pages
Zimbabwe k3
No ratings yet
Zimbabwe k3
34 pages
153 725 1 PB
No ratings yet
153 725 1 PB
14 pages
The Impact of Grandparental Support On The Mental Well-Being of Psychology Students at Icct Colleges
No ratings yet
The Impact of Grandparental Support On The Mental Well-Being of Psychology Students at Icct Colleges
48 pages
Research Guidelines EIABC For Students
100% (1)
Research Guidelines EIABC For Students
45 pages
Control Chart
No ratings yet
Control Chart
11 pages
Chapter 1 4
No ratings yet
Chapter 1 4
39 pages
4 Impact of TQM On Organisational Performance
No ratings yet
4 Impact of TQM On Organisational Performance
21 pages
EEG Electric Field Topography Is Stable During Moments of High Field Strength
No ratings yet
EEG Electric Field Topography Is Stable During Moments of High Field Strength
26 pages
Project Report Format B.Com H VI
No ratings yet
Project Report Format B.Com H VI
16 pages
Value Chain Analysisof Coffeein Jimma Zoneof Oromia Regional State Ethiopia
No ratings yet
Value Chain Analysisof Coffeein Jimma Zoneof Oromia Regional State Ethiopia
10 pages
A Comparative Study On Pizza Hut and Domino's in Lucknow City
No ratings yet
A Comparative Study On Pizza Hut and Domino's in Lucknow City
28 pages