cs447 - Tool Using Simulation To Understand Uncertainty
cs447 - Tool Using Simulation To Understand Uncertainty
Using Simulation to
Understand Uncertainty
Collecting multiple samples to evaluate the uncertainty around your results can be expensive
and time consuming. In data science, simulation offers a cost-effective method to understand
uncertainty. Essentially, simulation uses computers to mimic the process of drawing many different
samples from a population. Use this tool to help you run a simulation and understand the
uncertainty associated with your results.
Running a Simulation
When you run a simulation, you start by asking the question: What would I see in my data if there
was nothing interesting going on in the population? By starting with this question and using the
simulation to examine the variation inherent in your data, you examine the chance that you would
have seen an interesting result in your sample just due to randomness. If the chances that you
would see the same result if nothing unexpected was happening are small, you should be confident
that what you see in your sample is a real signal and not just due to chance. However, if a large
sample statistic is often observed due to randomness, you should not be very certain that the
conclusions from your sample should be generalized to the larger population.
Step 1: Use the sample() function to simulate a random sample of the same size as your data set.
Set the sample function so that it matches what you would expect based on current knowledge
(nothing interesting is going on in the population). Calculate the sample statistic on this simulated
sample.
Step 3: Use the vector of sample statistics from Step 2 to draw a histogram of the null distribution
of the sample statistic. Use that histogram to see how the sample statistic varies from sample to
sample even under the baseline assumption that nothing interesting was going on. Calculate the
mean and standard deviation of this histogram.
Step 1: In the code chunk below, we have set the probability that the drug works to 0.5, created a
vector result that stores the outcome of ten simulated patients, and calculated the sample statistic
— success rate of the drug — in the variable p_sim.
You can run this code chunk to see that the success rate is 60%. This variation is just due to
randomness, since we explicitly set the population success rate at 0.5.
## [1] 0.6
Step 2: Use a for loop to simulate a large number (nsim = 100000) of random samples (each with
size 10) from the population assuming the drug works 50% of the time. Calculate the sample statistic
of each random sample, and store these sample statistics in a vector store_p.
# Set up:
nsim = 100000 # Number of iterations
store_p = rep(0, nsim) # Vector in which to store sample statistics
Step 3: Use a histogram to see how the sample success rate of the drug varied from sample to
sample, even when you explicitly set its true success rate in the population at 50%. When you draw a
null distribution with your data, make sure to check that it is centered around the value specified by
your null hypothesis.
1.5
Density
1.0
0.5
0.0
Success
Success Rate
Rate
The mean of this histogram tells you what you should expect to see if there is nothing interesting
going on in the population. As expected, since we set a 50% population success rate in our
simulation, we see that sample success rates across many random samples are concentrated
around 50%.
The standard deviation of this histogram gives you a sense of the variability you should expect
to see in the sample statistic. This tells you how much the sample statistic (success rate) would
vary from sample to sample on average if the drug only had a 50% chance of working on each
patient. When this standard deviation is small, a large value of the sample statistic would provide
strong evidence that what you find in your sample is a real signal and not just due to randomness.
However, if this standard deviation is large, even if you see a large sample statistic, take it with a
grain of salt before generalizing your finding to the population.