0% found this document useful (0 votes)

3 views5 pages

cs447 - Tool Using Simulation To Understand Uncertainty

The document discusses using simulation as a cost-effective method to understand uncertainty in data science. It provides a step-by-step guide on running a simulation in R, including generating random samples, calculating sample statistics, and visualizing the results through histograms. The document emphasizes interpreting the null distribution and the significance of mean and standard deviation in assessing the reliability of sample statistics.

Uploaded by

hasiba

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views5 pages

cs447 - Tool Using Simulation To Understand Uncertainty

Uploaded by

hasiba

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

TOOL

Using Simulation to
Understand Uncertainty
Collecting multiple samples to evaluate the uncertainty around your results can be expensive
and time consuming. In data science, simulation offers a cost-effective method to understand
uncertainty. Essentially, simulation uses computers to mimic the process of drawing many different
samples from a population. Use this tool to help you run a simulation and understand the
uncertainty associated with your results.

Running a Simulation
When you run a simulation, you start by asking the question: What would I see in my data if there
was nothing interesting going on in the population? By starting with this question and using the
simulation to examine the variation inherent in your data, you examine the chance that you would
have seen an interesting result in your sample just due to randomness. If the chances that you
would see the same result if nothing unexpected was happening are small, you should be confident
that what you see in your sample is a real signal and not just due to chance. However, if a large
sample statistic is often observed due to randomness, you should not be very certain that the
conclusions from your sample should be generalized to the larger population.

Here is a step-by-step guide to performing a simulation in R:

Step 1: Use the sample() function to simulate a random sample of the same size as your data set.
Set the sample function so that it matches what you would expect based on current knowledge
(nothing interesting is going on in the population). Calculate the sample statistic on this simulated
sample.

Measuring Relationships and Uncertainty

© 2021 Cornell University 1
Cornell Bowers College of Computing and Information Science
Step 2: Use a for loop to repeat your simulation from Step 1 over and over. This will create many
many new sampled data sets under the assumption that nothing unexpected is going on in the
population. Calculate the sample statistic with each iteration, and keep track of them in a vector.

Step 3: Use the vector of sample statistics from Step 2 to draw a histogram of the null distribution
of the sample statistic. Use that histogram to see how the sample statistic varies from sample to
sample even under the baseline assumption that nothing interesting was going on. Calculate the
mean and standard deviation of this histogram.

Using R With This Tool

The portions of this tool with a grey background are code text you can use to do the examples
included in this tool. You can also modify them to use with your own data. In these examples:
• Commands are the lines of code that don’t begin with a pound sign (#). Type these lines into R
to carry out the command.
• Commented text begins with one pound sign and explains what the code does.
• Code output begins with two pound signs.

Measuring Relationships and Uncertainty

© 2021 Cornell University 2
Cornell Bowers College of Computing and Information Science
Running a Simulation in R
Now we will follow these steps in an example. Suppose it is well-known that 50% of all cases of
common cold are cured within a week without medicine. A new drug to treat the common cold was
tested on ten patients, and seven patients reported that their cold was gone within a week, so you
see a higher success rate (70%) than the baseline of 50%. How certain should you be that this is a
real signal, i.e., the drug helps cure common cold, and that this result isn’t due to random chance? If
eight patients reported recovery within a week, you would probably be more certain, but how much
more? What if nine out of ten patients reported recovery?

Step 1: In the code chunk below, we have set the probability that the drug works to 0.5, created a
vector result that stores the outcome of ten simulated patients, and calculated the sample statistic
— success rate of the drug — in the variable p_sim.

You can run this code chunk to see that the success rate is 60%. This variation is just due to
randomness, since we explicitly set the population success rate at 0.5.

set.seed(1) # Set the seed for reproducibility

outcome = c("Worked", "Did not Work") # Vector of possible outcomes

result = sample(outcome, 10, # Pull 10 samples from the outcome vector
replace = TRUE, prob = c(0.5, 0.5)) # The probability of picking each
# outcome is 0.5
result # View the resulting vector:
## [1] "Did not Work" "Did not Work" "Worked" "Worked" "Did not
## Work"
## [6] "Worked" "Worked" "Worked" "Worked" "Did not
## Work"

p_sim = mean(result == "Worked") # calculate the proportion that "Worked"

p_sim # View the proportion:

## [1] 0.6

Step 2: Use a for loop to simulate a large number (nsim = 100000) of random samples (each with
size 10) from the population assuming the drug works 50% of the time. Calculate the sample statistic
of each random sample, and store these sample statistics in a vector store_p.

Measuring Relationships and Uncertainty

© 2021 Cornell University 3
Cornell Bowers College of Computing and Information Science
set.seed(1) # Set the seed for reproducibility

# Set up:
nsim = 100000 # Number of iterations
store_p = rep(0, nsim) # Vector in which to store sample statistics

# For loop to repeat Step 1 multiple times:

for (i in 1:nsim){
outcome = c("Worked", "Did not Work")
result = sample(outcome, 10, replace = TRUE, prob = c(0.5, 0.5))
p_sim = mean(result == "Worked")
store_p[i] = p_sim
}

Step 3: Use a histogram to see how the sample success rate of the drug varied from sample to
sample, even when you explicitly set its true success rate in the population at 50%. When you draw a
null distribution with your data, make sure to check that it is centered around the value specified by
your null hypothesis.

hist(store_p, breaks = seq(0, 1, 0.1), freq = FALSE, col = "black",

main = "Success Rates in Simulated Samples (True Success Rate = 50%)",
xlab = "Success Rate")

Success Rates in Simulated Samples (True Success Rate = 50%)

Success Rates in Simulated Samples (True Success Rate = 50%)
2.5
2.0
Density

1.5
Density

1.0
0.5
0.0

0.0 0.2 0.4 0.6 0.8 1.0

Success
Success Rate
Rate

Measuring Relationships and Uncertainty

© 2021 Cornell University 4
Cornell Bowers College of Computing and Information Science
Interpreting a Simulation
You can examine the histogram you made with the simulation to help you understand the null
distribution of the sample statistic.

The mean of this histogram tells you what you should expect to see if there is nothing interesting
going on in the population. As expected, since we set a 50% population success rate in our
simulation, we see that sample success rates across many random samples are concentrated
around 50%.

The standard deviation of this histogram gives you a sense of the variability you should expect
to see in the sample statistic. This tells you how much the sample statistic (success rate) would
vary from sample to sample on average if the drug only had a 50% chance of working on each
patient. When this standard deviation is small, a large value of the sample statistic would provide
strong evidence that what you find in your sample is a real signal and not just due to randomness.
However, if this standard deviation is large, even if you see a large sample statistic, take it with a
grain of salt before generalizing your finding to the population.

Measuring Relationships and Uncertainty

R Programming Unit 4
No ratings yet
R Programming Unit 4
26 pages
BRF+ in S4 HANA PDF
100% (1)
BRF+ in S4 HANA PDF
23 pages
Centrifugal Compressor Surge and Surge Control
No ratings yet
Centrifugal Compressor Surge and Surge Control
32 pages
Regular Expression Question Solution
100% (2)
Regular Expression Question Solution
68 pages
GC 315
No ratings yet
GC 315
25 pages
U2T2 (Angular Measurement)
No ratings yet
U2T2 (Angular Measurement)
24 pages
03 - CT3S Introduction To Probability Simulation and Gibbs Sampling With R Solutions
100% (1)
03 - CT3S Introduction To Probability Simulation and Gibbs Sampling With R Solutions
270 pages
A Longitudinal Study of Skeletal Rapid Maxillary Expansion Side Effects Induced by
No ratings yet
A Longitudinal Study of Skeletal Rapid Maxillary Expansion Side Effects Induced by
8 pages
00 Lab Notes
No ratings yet
00 Lab Notes
10 pages
PTSP Lab Record
No ratings yet
PTSP Lab Record
27 pages
BIOB20 Notes
No ratings yet
BIOB20 Notes
45 pages
Introduction To Simulations in R: Charles Dimaggio, PHD, MPH, Pa-C
No ratings yet
Introduction To Simulations in R: Charles Dimaggio, PHD, MPH, Pa-C
48 pages
Lab Manual Ch4
No ratings yet
Lab Manual Ch4
10 pages
R Programming Student Lab Manual-52-63-3-12
No ratings yet
R Programming Student Lab Manual-52-63-3-12
10 pages
Stats Final Review
No ratings yet
Stats Final Review
11 pages
Random Distributions
No ratings yet
Random Distributions
2 pages
Data1901 Notes
No ratings yet
Data1901 Notes
70 pages
Test Plan Template 02
No ratings yet
Test Plan Template 02
10 pages
08-Data Science-S25-Comparing Two Samples
No ratings yet
08-Data Science-S25-Comparing Two Samples
33 pages
Inferencial Test PDF
No ratings yet
Inferencial Test PDF
3 pages
FAQ - The Data Science Course: General Questions
No ratings yet
FAQ - The Data Science Course: General Questions
8 pages
Assignment 7
No ratings yet
Assignment 7
23 pages
Week 1 - QM1
No ratings yet
Week 1 - QM1
64 pages
Sujal 4
No ratings yet
Sujal 4
31 pages
Da Unit-4
No ratings yet
Da Unit-4
37 pages
Data Science Probability
No ratings yet
Data Science Probability
75 pages
cs447 - Tool Using Simulation To Test A Hypothesis
No ratings yet
cs447 - Tool Using Simulation To Test A Hypothesis
4 pages
Lab 8
No ratings yet
Lab 8
5 pages
Algorithm M
No ratings yet
Algorithm M
8 pages
Statistical Computing With R: Masters in Data Sciences 503 (S29) Third Batch, SMS, TU, 2024
No ratings yet
Statistical Computing With R: Masters in Data Sciences 503 (S29) Third Batch, SMS, TU, 2024
40 pages
Data Science Interview Questions - Statistics: Mohit Kumar Dec 12, 2018 11 Min Read
100% (1)
Data Science Interview Questions - Statistics: Mohit Kumar Dec 12, 2018 11 Min Read
14 pages
Lec 15
No ratings yet
Lec 15
43 pages
Con Dence: ECON 226 - J L. G
No ratings yet
Con Dence: ECON 226 - J L. G
54 pages
Matlab Homework Help
No ratings yet
Matlab Homework Help
6 pages
Installation Guide
No ratings yet
Installation Guide
210 pages
Unit 4
No ratings yet
Unit 4
38 pages
Random Experiments With R
No ratings yet
Random Experiments With R
3 pages
Synthesizing
100% (1)
Synthesizing
29 pages
FAQ - The Data Science Course: General Questions
No ratings yet
FAQ - The Data Science Course: General Questions
8 pages
Presentation15 (One Way ANOVA Random Effects Model)
No ratings yet
Presentation15 (One Way ANOVA Random Effects Model)
37 pages
Simulation: Hadley Wickham
No ratings yet
Simulation: Hadley Wickham
23 pages
UNL STAT318 Notes Chapter 1-4 (2020)
No ratings yet
UNL STAT318 Notes Chapter 1-4 (2020)
66 pages
Unit4 R
No ratings yet
Unit4 R
21 pages
Statistical Inference Course Project Part I
No ratings yet
Statistical Inference Course Project Part I
2 pages
Statistics
No ratings yet
Statistics
5 pages
STA1007S Lab 10: Confidence Intervals: October 2020
No ratings yet
STA1007S Lab 10: Confidence Intervals: October 2020
5 pages
Bayesian Model - Statistics
No ratings yet
Bayesian Model - Statistics
29 pages
Parametric and Non Parametric Test
No ratings yet
Parametric and Non Parametric Test
76 pages
Check All Questions and Expand in Short If Needed ...
No ratings yet
Check All Questions and Expand in Short If Needed ...
6 pages
Unit 2 R
No ratings yet
Unit 2 R
16 pages
R Session Bootstrapping Randomisation 2024
No ratings yet
R Session Bootstrapping Randomisation 2024
4 pages
Stat 20 Section Worksheet 2 Problems From FPP, Chapter 2
No ratings yet
Stat 20 Section Worksheet 2 Problems From FPP, Chapter 2
2 pages
Tutorial 7 - Questions
No ratings yet
Tutorial 7 - Questions
4 pages
Business Econometrics Using SAS Tools (BEST) : Class IV - Probability Refresher
No ratings yet
Business Econometrics Using SAS Tools (BEST) : Class IV - Probability Refresher
31 pages
Probability Distributions in R
No ratings yet
Probability Distributions in R
42 pages
Exp 7
No ratings yet
Exp 7
8 pages
STA3030 Note
No ratings yet
STA3030 Note
141 pages
1.statistics and Probability
No ratings yet
1.statistics and Probability
7 pages
Simulation
No ratings yet
Simulation
180 pages
Simple Statistics Functions in R
No ratings yet
Simple Statistics Functions in R
41 pages
Unit4 R
No ratings yet
Unit4 R
21 pages
00 Lab Notes
No ratings yet
00 Lab Notes
8 pages
Mathematics Grade 6 Handover Tool
No ratings yet
Mathematics Grade 6 Handover Tool
9 pages
How To Avoid Marrying A Jerk
100% (3)
How To Avoid Marrying A Jerk
35 pages
Business Research Methods: Introductory Lecture Notes
No ratings yet
Business Research Methods: Introductory Lecture Notes
445 pages
R Manual PDF
No ratings yet
R Manual PDF
78 pages
Key Statistical Ideas For Research Students v2
No ratings yet
Key Statistical Ideas For Research Students v2
4 pages
Sim R
No ratings yet
Sim R
6 pages
Autonomous Sylabus
No ratings yet
Autonomous Sylabus
190 pages
R Notes For Data Analysis and Statistical Inference
No ratings yet
R Notes For Data Analysis and Statistical Inference
10 pages
R Commands
No ratings yet
R Commands
5 pages
Three Year LL.B Syllabus-Regulations 2016-17 PDF
No ratings yet
Three Year LL.B Syllabus-Regulations 2016-17 PDF
102 pages
03 The Modern Age of Microbiology
No ratings yet
03 The Modern Age of Microbiology
5 pages
A Philosophicall Essay For The Reunion of by Pierre Besnier
No ratings yet
A Philosophicall Essay For The Reunion of by Pierre Besnier
20 pages
Blue Eyes Technology
No ratings yet
Blue Eyes Technology
39 pages
PCS Notes M1
No ratings yet
PCS Notes M1
17 pages
International MKT Case Study 2 IKEA
No ratings yet
International MKT Case Study 2 IKEA
3 pages
Plantilla de Psicologia
No ratings yet
Plantilla de Psicologia
44 pages
The World'S First 3Lcd, Touch-Enabled, Interactive Projector at Your Fingertips
No ratings yet
The World'S First 3Lcd, Touch-Enabled, Interactive Projector at Your Fingertips
8 pages
Analysis of Multiple Experiments Tigr Multiple Experiment Viewer (Mev)
No ratings yet
Analysis of Multiple Experiments Tigr Multiple Experiment Viewer (Mev)
130 pages
Demand Forecasting Analysis
No ratings yet
Demand Forecasting Analysis
13 pages
Hackathon Presentation-Online
No ratings yet
Hackathon Presentation-Online
14 pages
Vol 2 Issue 7 May June 2016 5
No ratings yet
Vol 2 Issue 7 May June 2016 5
5 pages
DLL - Mathematics 6 - Q2 - W4
No ratings yet
DLL - Mathematics 6 - Q2 - W4
6 pages
Venn Diagram
No ratings yet
Venn Diagram
2 pages
cs446 - Tool Summarizing and Visualizing Numerical Variables in Bbivariate and Multivariate Analyses
No ratings yet
cs446 - Tool Summarizing and Visualizing Numerical Variables in Bbivariate and Multivariate Analyses
14 pages
Xperiment O: A: To Compute Fourier Transform of A Continuous Time Signal
No ratings yet
Xperiment O: A: To Compute Fourier Transform of A Continuous Time Signal
6 pages
$R5YYXTO
No ratings yet
$R5YYXTO
3 pages
cs446 - Tool Assembling A Document With R Markdown
No ratings yet
cs446 - Tool Assembling A Document With R Markdown
2 pages
cs446 Glossary
No ratings yet
cs446 Glossary
4 pages
cs446 - Course Project Part
No ratings yet
cs446 - Course Project Part
4 pages
Selection and Ordering Data: Resistance Thermometers
No ratings yet
Selection and Ordering Data: Resistance Thermometers
1 page
Term Paper Reflection
100% (1)
Term Paper Reflection
5 pages
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)

cs447 - Tool Using Simulation To Understand Uncertainty

Uploaded by

cs447 - Tool Using Simulation To Understand Uncertainty

Uploaded by

TOOL

Here is a step-by-step guide to performing a simulation in R:

Measuring Relationships and Uncertainty

Using R With This Tool

Measuring Relationships and Uncertainty

set.seed(1) # Set the seed for reproducibility

outcome = c("Worked", "Did not Work") # Vector of possible outcomes

p_sim = mean(result == "Worked") # calculate the proportion that "Worked"

Measuring Relationships and Uncertainty

# For loop to repeat Step 1 multiple times:

hist(store_p, breaks = seq(0, 1, 0.1), freq = FALSE, col = "black",

Success Rates in Simulated Samples (True Success Rate = 50%)

0.0 0.2 0.4 0.6 0.8 1.0

Measuring Relationships and Uncertainty

Measuring Relationships and Uncertainty

You might also like