0% found this document useful (0 votes)
6 views8 pages

Data Sci HW1

Data science homework

Uploaded by

De Zheng Zhao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views8 pages

Data Sci HW1

Data science homework

Uploaded by

De Zheng Zhao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

import numpy as np

Question 1
# Estimate the bias, variance, and RMSE for the uniform estimator
theta = 20
n = 200

# Generate samples
def generate_samples(n):
y_s = []
for i in range(n):
y_s.append(np.random.uniform(0, 20))
return y_s

# Run many trials of this generation and to obtain many theta hats
results = []

for i in range(10000):
results.append(np.max(generate_samples(n)))

# Set the mean of these 10000 trials to be the expected value of theta
hat
exp_theta_hat = np.mean(results)

# Calculate bias
bias = exp_theta_hat - theta

# Calculate variance
variance = np.var(results)

# Calculate RMSE
rmse = np.sqrt(bias ** 2 + variance)

print("Bias: " + str(bias) + ' , Variance: ' + str(variance) + ' ,


RMSE: ' + str(rmse))

Bias: -0.09908465984527837 , Variance: 0.009774149390922771 , RMSE:


0.13997113705181252

The estimated values of bias, variance, and RMSE decrease significantly when increasing n from
200 to 1000. This is due to the property of estimators that increasing the sample size leads to a
more accurate estimation of the true theta value.

Question 2
# Perform the bootstrap
orig_data = [3.0, 1.9, 6.4, 5.9, 4.2, 6.2, 1.4, 2.9, 2.3, 4.8, 7.8,
4.5, 0.7, 4.4, 4.4, 6.5, 7.6, 6.1, 2.7, 1.6]
x_boot_list = []

# Define the test statistic


def t_stat(x):
return np.median(x)

for i in range(1000):
new_sample = np.random.choice(orig_data, len(orig_data), replace =
True)
x_boot = t_stat(new_sample)
x_boot_list.append(x_boot)

std_err = np.std(x_boot_list) / np.sqrt(len(x_boot_list))

# Find the 95% confidence interval


def conf_int(data_list):
p_hat = np.median(data_list)
a = p_hat - 1.96 * std_err
b = p_hat + 1.96 * std_err
return a, b

print('Standard error: ' + str(std_err))


print(conf_int(x_boot_list))

Standard error: 0.02405101406386018


(4.352860012434834, 4.4471399875651665)

# Generate 100 data points from normal distribution


y = np.random.normal(0, 5, 100)

t_1_boots = []
t_2_boots = []

# Find standard error for sample median using bootstrap


def t_stat1(x):
return np.median(x)

for i in range(1000):
sample = np.random.choice(y, len(y), replace = True)
x_boot = t_stat1(sample)
t_1_boots.append(x_boot)

std_err1 = np.std(t_1_boots) / np.sqrt(len(t_1_boots))

# Find standard error for sample maximum using bootstrap


def t_stat2(x):
return np.max(x)

for i in range(1000):
sample1 = np.random.choice(y, len(y), replace = True)
x_boot1 = t_stat2(sample1)
t_2_boots.append(x_boot1)

std_err2 = np.std(t_2_boots) / np.sqrt(len(t_2_boots))

# Compute actual standard error for sample median through simulations


median_list = []

for i in range(10000):
median_list.append(np.median(y))

median_stderr = np.std(median_list) / np.sqrt(len(median_list))

# Compute actual standard error for sample maximum through simulations


max_list = []

for i in range(10000):
max_list.append(np.max(y))

max_stderr = np.std(max_list) / np.sqrt(len(max_list))

print('Actual Standard Error for Sample Median: ' + str(median_stderr)


+ ', Bootstrap Estimate: ' + str(std_err1))
print('Actual Standard Error for Sample Maximum: ' + str(max_stderr) +
', Bootstrap Estimate: ' + str(std_err2))

Actual Standard Error for Sample Median: 1.1102230246251566e-18,


Bootstrap Estimate: 0.022243242162067033
Actual Standard Error for Sample Maximum: 1.7763568394002505e-17,
Bootstrap Estimate: 0.040556826403483785

As we can see from the above results, the actual standard error for both the sample median and
the sample maximum is 0. This makes analytical sense because our sample is the true
distribution, so there is no variance between the sample and the true distribution. In the
bootstrap scenarios, we have standard errors of about 0.02, which indicates that the bootstrap
does a very good job of approximating the true population statistic.

Question 3
# Use observed data to compute the estimated mean for this
distribution
theta_hat = np.mean(orig_data)
theta_hat

4.265

# Generate many samples from N(theta, 2) with 20 data points per


sample
sample_results = []
for i in range(10000):
sample_norm = np.random.normal(theta_hat, 2, 20)
sample_results.append(sample_norm)

# Estimate the value of theta in each sample


thetas_list = []
for i in sample_results:
thetas_list.append(np.mean(i))

# Compute the standard deviation among the samples


print(np.std(thetas_list))

0.4511328551853174

The following is a link to the original Google Colab I used to code this assignment:
https://colab.research.google.com/drive/193oUh1wg7hmCsUNRP0dyQGbaVSu-hNfO?
usp=sharing. The format here is a bit weird since I used an online file format converter
to PDF.

You might also like