BIOB20 Notes
BIOB20 Notes
Example 1:
A die is rolled 57 times and obtains 15 sixes, is the die fair?
Null hypothesis is that the die is fair
Alternative hypothesis is that the die is not fair
Test Statistic:
● When rolling a fair die, the prob of rolling a 6 is ⅙
● Since the die is rolled 57 times and under the assumption the die is fair (null hypothesis), the expected #
of sixes after 57 rolls is calculated using the formula for the expected value of a binomial distribution.
● The expected value E(X) of a binomial distribution is: E(X) = n x p
○ N is the number of trials
○ P is the probability of success on a single trial.
● E(X) = 57 x ⅙ = 9.5
○ If the die is fair, we expect to see about 9.5 sixes after 57 rolls.
Decision Rule:
● Then, a histogram is made of the stimulated test statistics.
● Then calculate the 95th percentile (or 5% quantile) of the distribution.
○ Tells us what test statistic value corresponds to the most extreme 5% of the outcomes under the
assumption that the die is fair.
○ In this case, the critical value is 5.5.
● If the observed test statistic is greater than the critical value, you reject the null hypothesis.
○ Suggests that the observed outcome is unusual enough (in the extreme 5% of cases) that its
unlikely to have occurred by random chance if the die were fair.
● If the observed test statistic is less than or equal the critical value, you accept the null hypothesis.
○ Suggests that the outcome is not unusual.
● Conclusion: both the observed test statistic and the critical value are 5.5 so we accept the null hypothesis
that the die is fair.
Example #2:
A driving school with 109 students, 98 passed the exam and the local government requires an 80% success
rate. Will this driving school have better than 80% success in the future.
Null Hypothesis: The school's success rate is 80%, meaning there is no evidence the school has a better success
rate than required.
Alternative Hypothesis (H₁): The school's success rate is better than 80%.
Test Statistic:
● Under the assumption that the success rate is 80%, the E(X) = 109 x 0.80 = 87.2
● Deviation = Observed Value - Expected Value
○ Deviation = 98-87.2 = 10.8
● The observed test statistic (98) is greater than the critical value (83.2), we reject the null hypothesis. This
would suggest that the school has a significantly better success rate than 80%.
● If the observed test statistic is less than or equal to the critical value, you fail to reject the null hypothesis;
can’t conclude that the school has a better success rate than 80%.
VIDEO #3 + Tutorial 3
Dbinom Function:
● Dbinom calculates the prob of getting exactly x successes in size independent trials, where each trial has
a success probability of prob.
● > dbinom (x, size, prob)
○ x - the # of successes (eg. the number of mutations we are interested in)
○ size - the total # of trials (eg. time intervals in our mutation example)
○ prob - the prob of success (eg. or mutation) in each trial.
Example 1 - Using Mutation Rates to Model the Binomial and Poisson Distributions:
● Goal: to determine the prob of no mutations over a period where, on avg, one mutation occurs.
● Approach: use the binomial distribution to model mutation events, dividing the time interval into smaller
sub-intervals. Then, take the limit as the intervals become infinitely small, which leads to the Poisson
distribution.
Step 1 is the binomial distribution setup:
● The binomial distribution models the # of successes (mutations) in a fixed # of independent trials (time
intervals), which a fixed prob of success per trial.
○ > dbinom (size=2, prob=½, x=0)
■ Size = 2: splits the time into 2 intervals
■ prob = ½ is the change of mutation in each interval
■ x = 0 represents the event of no mutations
○ Output: 0.25 is the prob of no mutations when the time is divided into 2 intervals
○ If we were to split into more subdivisisions; the probability stabilizes.
■ > dbinom (size=4, prob=¼, x=0)
■ > dbinom (size=256 , prob=1/256, x=0)
■ > dbinom (size=100000000, prob=1/100000000, x=0)
■ > the prob converges to approx 0.36778794
● As the # of divisions increases (eg. size becames larger and prob becomes smaller)
Thus, the guess that makes the results (7 heads and 3 tails) most likely is p=0.7 This is your maximum
likelihood estimate because it best explains the results you got.
Log-Likelihood
● > sum(dpois(lambda_0, x=obs, log=TRUE) or > log(prod(dpois(lambda=lambda_0, x=obs)))
● Essentially this returns the the logarithm of each prob instead of the actual prob.
● This is because multiplying many small probabilities can result in very small numbers (eg. 1e-44) which
are hard to work with.
Demonstration:
Demonstration Continued……
VIDEO #4
Continuous Variables:
● Some random variables are continuous instead of definite (like rbinom and rpois)
● > runif() - generates random numbers from a uniform distribution.
○ generates a random sample of numbers between 0 and 1.
○ Numbers are drawn from the uniform distribution, meaning that every value between 0 and 1 is
equally likely.
○ Eg. runif(5, min=0, max=1)
● > runif (n=10)
○ > 0.8390957 0.2427499 0.5665379 0.1225692 0.5382546 0.1827380 0.1452402 0.1456641
0.7780869 0.1627826
○ Prints 10 random numbers from the uniform distribution
● Making a histogram of these shows numbers similar in frequency.
● We also don’t get values below 0 and above 1
Libraries:
● Functions are available in libraries
● > install.packages (“circular”)
● > library(circular)
● > x = runif (n=100, max = 2*pi)
● > plot(circular(x))
Example 1:
In a national exam, a teacher has three students who take the exam and score a ranking of 0.734, 0.859 and
0.971. Based on this, do we give the teacher a promotion? These scores are drawn from a uniform
distribition between 0 and 1, which means every score within this range is equally likely. We are using these
scores to decide if the teacher deserves a promotion based on how well these students perform relative to
what you would expect under the uniform distribution.
Null Hypothesis: The students' scores are average, meaning they could be from any teacher, and the teacher
doesn’t necessarily deserve a promotion.
● Generally for the null hypothesis, this is the hypothesis you dont like and are setting it up to hopefully
reject it.
● Plus, there is only one way to be average.
Alternative Hypothesis: The students' scores are high, suggesting the teacher might be better than average and
might deserve a promotion.
Test Statistic
● Take the average of the three grades; could also use the max or the min
● Under the null hypothesis (uniform distribution), the expected avg score is 0.5
● > mean(c(0.734, 0.859, 0.971)
● > 0.8546667
● We will calculate the average score of these three students and compare it to the expected average
under the null hypothesis.
Decision Rule:
● > runif(n=3,min=0,max=1)
○ Generates 3 random #s between 0 and 1 (to represent student scores under the H0)
● > mean(runif(n=3,min=0,max=1))
○ Calculates the avg of these random number
○ Simulates the test statistic under the assumption the null hypothesis is true.
● > nulld = replicate(n=10000, mean(runif(n=3,min=0,max=1)))
○ The null distribution assuming the null hypothesis is true; shows us what kind of avg scores we
would expect from a typical teacher is their students scores came from a uniform distribution.
○ Replicate simulates the avg score for 10,000 sets of 3 students under the null hypothesis.
● > hist(nulld, xlab = “statistic”)
● > quantile (nulld, 0.99)
○ > 99%
○ > 0.8748196
○ calculates the 99th percentile of the null distribution, which means 99% of the scores from the
null distribution are less than this value.
○ This value will serve as the cutoff for deciding whether to promote the teacher.
○ If the observed average is above this 99% cutoff, we will reject the null hypothesis and suggest
that the teacher’s students performed unusually well.
● > abline (v=0.8748196, col = “red”)
○ This line represents the threshold; if a teacher’s average score exceeds this line, there is less than
a 1% chance that their score could be due to random variation.
● > mean(c(0.734, 0.859, 0.971)
○ > 0.8546667
● > abline (v=0.85466767, col = “blue”)
● If the blue line (teacher’s average score) is to the
right of the red line (cutoff for the 99th percentile),
the teacher's students performed exceptionally
well, and we reject the null hypothesis, meaning
we would promote the teacher.
● If the blue line is to the left of the red line (as in
this case), the teacher’s students' performance is
not significantly better than expected under the
null hypothesis, so we would not promote the teacher.
Example 2:
Trying to arrange nurses hospital shift.
Since there are 24 hours in a day, we can use the circular library.
> plot (circular(a, template = “clock24”, units = “hours”), stack = TRUE)
Null Hypothesis (H0): The shift times are uniformly distributed across 24 hours.
Alternative Hypothesis (H1): The shift times are not uniformly distributed (there are clusters).
>runif(n=5)
● generates 5 random numbers from a uniform distribution.
>sort(runif(n=5))
● sorts the random numbers.
>diff(sort(runif=n=5)))
● Computes the difference between the numbers
>sum(diff(sort(runif=n=5)))^2)
● Multiplying the differences by 2 adjusts the scale, and the sum of the differences gives a measure of the
total "spread" of the points.
>nulld = replicate(n=10000, sum(diff(sort(runif(n=1000,min=0,max=24)))^2))
● You simulate 100,00 datasets of 1000 uniformly distributed shift times over a 24-hour period.
>hist(nulld)
>quantile (nulld,0.99)
>47.92697
● The 99th percentile gives the value that 99% of the simulated values are below. This is the cutoff for
determining if the observed shift distribution is unusually clustered.
>a%%24
● a %% 24 ensures that all shift times are wrapped around a 24-hour clock.
>sum(diff(sort(a%%24))^2)
● diff calculates the diff between consecutive shift times.
● sorts the shift from earliest to latest
● multiplying by 2 adjusts the scale of the differences,
● sum will give a total measure of how spread out the shift times are within the 24 hour period.
● * this value tells us how “spread out” or “clustered” the shift times are for the actual date compared to
the uniform distribution.
>47.02074
>abline(v=47.02074
>mean(nulld >= 47.02074)
● Creates a logical vector where each entry is either T/F depending on whether the simulated value in nulld
is greater than or equal to the observed value (47.02074).
● The mean calculates the proportion of TRUE values in the vector
● This proportion represents the p-value, which is the probability of obtaining a sum of differences as
larger or larger than the observed value (47.02074) under the null hypothesis.
> 0.60719
● A p-value of 0.60719, means that 60.7% of the time, the randomly generated uniform shift distributions
will have a sum of differences greater than or equal to 47.02074.
Decision:
● Since 0.60719 is much larger than the typical significance threshold of 1% (0.01) or 5% (0.05), we fail to
reject the null hypothesis.
● This means that the observed shift times do not show significant clustering, and we conclude that the
shift times are uniformly distributed.
● Accept the null hypothesis.
Standard Deviation:
● SD represents the amount of variability between the numbers of the sample.
● > x = rnorm(n=5)
○ > sd(x)
○ [1] 1.16349
● If the numbers are close to one another, the SD is close to 0; if the numbers are far away, the SD is high
● SD is linked to variance
○ > sqrt(var(x))
○ [1] 1.16349
● If you multiply your numbers with 100, the SD changes meaning the SD scales linearly with the sample
○ > x * 100
○ > sd(x*100)
○ [1] 116.349
● But if you add 100 to your numbers, adding a constant doesnt change the SD.
○ > x + 100
○ > sd(x+100)
○ [1] 1.16349
Functions:
● You can use curly brackets to make functions in R that don’t exist.
● Lets say span means the max - min but this function doesnt exist in R so:
○ > span = function(x){max(x)-min(x)}
○ > span(x)
○ [1] 1.779923
● After making a function, you can use it again
○ > span(rnorm(n=5))
○ [1] 0.8498488
Recycling Rule
● In R, the recycling rule automatically repeats (recycles) shorter vectors to match the length of longer ones
during element-wise operations.
● If the length of the longer vector is not a multiple of the shorter one, a warning is issued
● > x = rpois(n=2, lambda = 1)
● > y = rpois(n=4, lambda = 1)
● > x+y
● [1] 3 1 3 2
Test Statistic:
● Remember the test statistic is a data set that we can compute and a good potential test statistic could be
the z-score.
● > obs = c(3.52, 1.94)
● z = (mean(obs) - hypothesized_mean) / sd(obs)
○ Z-score (also called standard score) indicates how many standard deviations away your sample
mean is from the hypothesized mean (1.2345).
○ Helps standardize different data points so they can be compared on a common scale, even if they
come from different distributions.
● What does the Z-Score mean?
○ z = 0: The value is exactly at the mean.
○ z > 0: The value is above the mean (right of the mean).
○ z < 0: The value is below the mean (left of the mean).
● In the context of hypothesis testing:
○ A small z-score (close to 0) means that your sample mean is close to the hypothesized mean.
○ A large z-score (positive or negative) indicates that your sample mean is far away from the
hypothesized mean, suggesting the null hypothesis might not hold.
● > statistic = function(x){(mean(x)-1.2345)/sd(x)}
Decision Rule:
● > quantile (nulld, 0.005)
○ 0.5%
○ -45.64871
● > quantile (nulld, 0.995)
○ 99.5%
○ 45.1567
● > statistic (obs)
○ 1.313009
● If the test statistic lies outside these quantiles, we would reject the null hypothesis.
● 1.31 falls within -45 and 45 so we accept the null hypothesis.
Gamma Distribution:
● The gamma distribution is a continuous probability distribution commonly used to model waiting times
until a certain number of events occur.
● The distribution only takes positive values, making it best for modeling time-to-event data.
● It is especially useful when dealing with time-to-event data, such as the time it takes for a bus to arrive.
● rgamma (n, shape, scale)
○ Shape (alpha) = determines the # of events you are waiting for
○ Scale (beta) = the avg time for one event to occur
● Shape Parameter: defines how many events need to occur before stopping the timer
○ When shape = 1, the gamma distribution reduces to an Exponential distribution, modeling the
waiting time for a single event.
■ > hist(rgamma(n=10000, shape=1, scale=1), breaks = 80)
■ The distribution has a steep drop-off to the right, indicating that shorter waiting times are
more likely.
○ As the shape parameter increases (e.g., shape = 2), the histogram starts to shift to the right.
■ > hist(rgamma(n=10000, shape=2, scale=1), breaks = 80)
■ The distribution's peak moves to the right, making longer waiting times more probable.
■ The shape becomes smoother with less steep decline.
■ Looks more and more like a normal distribution.
● Scale Parameter:
○ affects the stretching of the distribution along the x-axis
○ With larger scale values, the distribution spreads out, and the average waiting time increases.
○ > hist(rgamma(n=10000, shape=1, scale=100), breaks = 80)
■ The histogram is stretched, indicating that events take longer on average.
■ The overall shape of the distribution remains the same; only the x-axis scale changes.
● Mean and Variance:
○ Mean: α×β
○ Variance: α×β2
For the Gamma distribution, the mean of the observed data serves as the Maximum Likelihood Estimate (MLE)
for the scale parameter when the shape parameter is known. This is because the mean of the Gamma
distribution is given by scale x shape and solving for scale using the sample mean leads to the MLE.
GGPLOT:
> url = "https://raw.githubusercontent.co
m/gui11aume/BIOB20/main/tutorial_06/a.txt"
> a = read.delim(url)
> head(a)
x y
1 -1.31409690 -1.3584203
2 -0.47806638 -0.2116580
● ggplot() uses a layering approach: start with ggplot(),
add layers like geom_point(), customize with theme()
and labs().
○ > ggplot(a, aes(x=x, y=y)) + geom_point() #specifies scatterplot
● theme_classic() is a built-in ggplot2 theme that gives the plot a classic, clean look with no grid lines and
a simple white background.
○ > ggplot(a, aes(x=x, y=y)) + geom_point() + theme_classic()
● labs() is used to set custom labels for the x-axis and y-axis.
○ > ggplot(a, aes(x=x, y=y)) + geom_point() + theme_classic() + labs(x="variable x (au)",
y="variable y (au)")
● panel.grid.major and panel.grid.minor control the appearance of the major and minor grid lines,
respectively.
● element_line() specifies the line's properties, such as size (thickness), linetype (e.g., "solid", "dashed"),
and colour.
○ > ggplot(a, aes(x=x, y=y)) + geom_point() + theme_classic() + labs(x="variable x (au)",
y="variable y (au)") + theme(panel.grid.major=element_line(size=.3, linetype="solid",
colour="grey"), panel.grid.minor = element_line(size=.3, linetype="solid", colour="grey"))
● geom_abline() adds a straight line to the plot. The intercept and slope arguments specify where the line
crosses the y-axis (intercept = 0) and its slope (slope = 1).
○ > ggplot(a, aes(x=x, y=y)) + geom_point() + theme_classic() + labs(x="variable x (au)",
y="variable y (au)") + theme(panel.grid.major=element_line(size=.3, linetype="solid",
colour="grey"), panel.grid.minor = element_line(size=.3, linetype="solid", colour="pink")) +
geom_abline(intercept=0, slope = 1)
GEOM TYPES:
● Histogram: to visualize the distribution of a single continuous variable
○ > ggplot(data, aes(x = variable)) + geom_histogram(bins = 30, fill = "blue", color = "black", alpha
= 0.7) + labs(title = "Histogram", x = "Variable", y = "Count")
● Scatter Plot: to visualize the relationship between two continuous variables
○ > ggplot(data, aes(x = variable1, y = variable2)) + geom_point(color = "red", size = 3) + labs(title
= "Scatter Plot", x = "Variable 1", y = "Variable 2")
● Line Plot: used for time series data to show trends over time
○ > ggplot(data, aes(x = time, y = value)) + geom_line(color = "blue", size = 1) + labs(title = "Line
Plot", x = "Time", y = "Value")
● Bar Plot: displaying the counts/frequency of categorical variables
○ > ggplot(data, aes(x = category)) + geom_bar(fill = "green", color = "black") + labs(title = "Bar
Plot", x = "Category", y = "Count")
● Density Plot: shows a smoothed version of a histogram
○ > ggplot(data, aes(x = variable)) + geom_density(fill = "purple", alpha = 0.5) + labs(title =
"Density Plot", x = "Variable", y = "Density")
For example, when using ggplot2 for plotting, it is necessary to have data in a data frame format where each
column represents a variable. In this case:
● The Index column can be used for the x-axis (e.g., to plot the order of data).
● The Value column can be used for the y-axis (e.g., to plot the actual values in the dataset).
VIDEO 7 NOTES
For Loops:
● For loops are used to repeat a block of code for a certain number of iterations.
● > for (i in 1:5) {print(i)}
○ The for loop goes through each number in the sequence 1:5.
■ i takes on the values of 1, 2, 3, 4, 5
■ Inside the loop, print(i), prints the current value of i
○ > [1] 1 [1] 2 [1] 3 [1] 4 [1] 5
● > for (i in 1:5) {print(i^2)}
○ This time, instead of printing i, the loop prints i^2 (the square of i)
○ Eg. when i is 2, it prints 2^2 = 4 and so on…..
○ > [1] 1 [1] 4 [1] 9 [1] 16 [1] 25
● > x = rnorm (n=5)
○ > 0.71423218 1.32068976 -0.84250403 -0.05279274 -1.30105252
○ > for (i in 1:5) {print(x[i])}
■ The loop prints each value of x[i] for i from 1 to 5.
■ > [1] 0.7142 [1] 1.320 [1] -0.842 [1] -0.0527 [1] -1.3010
● Conditional Statements Inside a For Loop (If Statements):
○ > for (i in 1:5) { if (i > 2) {print(i)}}
■ Using an if statement to only print values of i greater than 2.
■ > [1] 3 [1] 4 [1] 5
● General syntax on for loops
○ for ( i in sequence) { #code to repeat for each value of i }
○ sequence: A vector of values for i (e.g., 1:5 means that i will take on values from 1 to 5).
○ Inside the curly braces {}, you put the code that you want to repeat for each value of i.
TUTORIAL 7 NOTES
Example #1 - The Effect Size:
Does blood pressure (x) go up/down after the intervention. Is there a difference before and after the
treatment.
● > url = "https://raw.githubusercontent.com/gui11aume/BIOB20/main/tutorial_07/a.txt”
Null: there is no difference -> before = after
Alternative: there is a difference.
Test Statistic:
● The formula statistic <- function(x, y) { (mean(x)-mean(y)) / (sd(x)+sd(y))} calculates the standardized
difference between the two means, accounting for the variability (standard deviations) in both groups.
● Difference between the means of two groups (before & after) divided by the sum of their SDs.
● When the null hypothesis is true (no difference between means), this statistic will be close to 0. When
the alternative hypothesis is true (there is a difference), the statistic will be either very positive or very
negative, indicating that the difference is significant.
● This is called standardization by removing the influence of the spread (variability) in the data and after
standardization, the test statistic follows a normal distribution if the sample sizes are large enough,
thanks to the CLT.
○ Allows us to apply normal distribution theory to make inferences about whether the difference in
means is statistically significant.
● > nulld <- replicate (n=100000, statistic(rnorm(30), rnorm(30)))
● > nulld2 <- replicate(n=100000, statistic(runif(30), runif(30)))
○ Can visualize both by:
○ > par(mfrow = c(1, 2))
○ hist(nulld, breaks=50)
○ hist(nulld2, breaks=50)
○ Regardless of the underlying distribution (normal or
uniform), the CLT (Central Limit Theorem) kicks in and the
test statistic becomes approximately normal when the
sample size is large enough (in this case, 30).
Decision Rule:
● Compute the statistic for the real data:
○ > statistic(a$x[1:30], a$x[31:60])
○ > 0.1232936
● > hist(nulld, breaks=50)
● > abline(v=0.1232936, col="blue", lwd=4)
● The rejection region on both sides; 0.5% on each side. Thus you can calculate the abs value of the null
distribution to make it two-sided.
● > quantile (abs(nulld), 0.99) - 1% significance level
○ > 0.3471285
● > abline(v=0.3471284, col="red", lwd=4)
● When the distribution is in abs: the rejection region (1%) is on the right - sum of the two 0.5%.
● Accept the null because we are before the threshold (0.1232 < 0.34712)
VIDEO #8 NOTES
Ranks in R:
● Ranking is used to determining the relative positions of values within a vector.
● > x = rnorm(n=5)
○ sort (x) ⇨ -0.6991376 -0.6768828 0.3030927 0.9424099 1.2891228
○ rank (x) ⇨ 4 5 2 1 3
■ Lowest value gets rank 1, and highest value gets rank 5.
● > y = rnorm (n=5) + 30
● > rank (c (x, y))
○ > 4 5 2 1 3 6 10 9 7 8
○ X is centered around 0 and y is shifted by 30, thus elements of x occupy the lower ranks (first 5)
and elements of y occupy the higher ranks (second 5).
● When vectors with smaller shifts (e.g., +5 or +2) are combined, the separation of ranks is less distinct,
leading to potential overlap in ranks.
○ > rank(c(rnorm(n=5), rnorm(n=5)+30)) ⇨ 1 5 2 4 3 9 10 8 6 7
○ > rank(c(rnorm(n=5), rnorm(n=5)+5)) ⇨ 2 3 5 4 1 8 7 10 6 9
○ > rank(c(rnorm(n=5), rnorm(n=5)+2)) ⇨ 1 3 4 6 2 7 9 8 10 5
■ Pattern breaks; 6 in the first half and 5 in the second half.
● When two vectors from the same distribution are combined, the ranks are generally more randomly
distributed.
○ > rank(c(rnorm(n=5), rnorm(n=5))) ⇨ 7 4 8 10 2 5 6 9 1 3
● But increasing the gap between the means, the ranks separate (eg. adding 30 to one sample). First
sample occupying the lower ranks and the second sample occupying the higher ranks.
T-Tests:
Power Curves:
● The power curve is a plot of the probability of rejecting the null hypothesis (rejection rate) for varying
mean differences.
● Create a sequence of mean differences:
○ > mean = seq(from=0, to=3, by=0.1)
○ > rejection_rate = rep(NA, length(mean)) = initializes the vector
■ Output here would just be NA NA NA…..
● Loop through each mean difference, calculating the rejection rate:
○ > for (i in 1:length(mean)){ rejection_rate [i] <- mean(replicate(n=100000, t.test(rnorm(n=10,
mean=mean[i]), rnorm(n=10))$p.value) < 0.01)}
○ > plot(x=mean, y=rejection_rate, type= "l")
● Plot shows the prob of rejecting the null hypothesis at diff mean
differences.
● The curve starts at 1% (our significance level) when the mean is 0 and
rises to 1 as the mean difference increases.
○ Red line at 0.01 shows that the rejection rate aligns with the significance level when the null
hypothesis is true.
○ > abline(h=.01, col=2)
● Changing significance level to 5%.
○ > for (i in 1:length(mean)){ rejection_rate [i] <-
mean(replicate(n=100000, t.test(rnorm(n=10, mean=mean[i]),
rnorm(n=10))$p.value) < 0.05)}
○ > lines(x=mean, y=rejection_rate, col = 4) (adds new line to plot)
○ > abline (h=0.05, col=2)
○ The blue curve (5% level) lies above the black curve (1% level),
showing a higher probability of rejecting the null hypothesis
throughout.
● Thus, increasing from 1% to 5%; increase the prob of rejecting the null
○ But also increases the probability of being right.
For Loops in R:
● > for (variable in sequence) {# Code to repeat}
○ variable: Think of this as a placeholder that changes every time the loop runs.
○ sequence: This is a list of values that the loop goes through, one by one.
○ The code block inside {} runs for each value in the sequence.
● > for ( i in 1:3) { print (i) }
○ i is the placeholder. It starts at 1 and changes to 2, then to 3.
○ 1:3 means the sequence is 1, 2, 3.
○ First run: i = 1, so R prints 1.
○ Second run: i = 2, so R prints 2.
○ Third run: i = 3, so R prints 3.
● General Rule for For Loops
○ Set up a placeholder (variable) that changes each time.
○ Decide the sequence of values you want to go through.
○ Write the code inside {} that you want to repeat.
○ Each run of the loop uses the next value in the sequence.
● words <- c("Hi", "Bye", "Thanks")
○ for (word in words) { print(word)}
○ R prints "Hi", "Bye", and "Thanks".
TUTORIAL #8 NOTES
Bayesian Problem:
● When sample sizes are too small, the t.test is not suitable (according to the CLT).
● > sz = rpois(n=1, lambda=12.3)
● > rninom (n=1, size=sz, prob=0.5)
○ > 8
● What is the prob that sz is equal to 10?
○ > dpois (lambda=12.3, x=10)
○ [1] 0.09941821
○ The prob is 9.9%.
● If the output now becomes 11, instead of 8, the prob that sz = 10 change?
○ This introduces a Bayesian problem, which involves updating probabilities after observing new
data.
Bayesian Update
● The shift from 9.9% (initial probability) to 12.7% (after observation) is an example of a Bayesian update.
● This approach adjusts the probability based on observed data, rather than being a typical statistical
estimation or test.
Using effect size for the statistic is problematic becaus the sample sizes are too small and exp is known to need
at least 30 values to apply the CLT to approximate a normal distribution.
● Thus, we can use relative positions via ranks where y should have larger values compared to x.
● > rank (c(x,y).
● If x & y are really far apart, x will always have the first 3 ranks and y will have the other 5 ranks but of the
magic # is 0, the ranks will be randomly mixed due to x and y having the exact same distribution.
Compute U Statistic:
● Can compute the mean of the ranks (mean(c(2,3,1)) - mean(c(6,5,7,8,4)) ⇨ -4 (u-statistic)
● > statistic <- function(x, y) { ranks_x <- rank(c(x, y))[1:length(x)] ranks_y <- rank(c(y, x))[1:length(y)]
mean(ranks_x) - mean(ranks_y)}
> wilcox.test(x,y)
● W = 572, p-value = 0.03655 alternative hypothesis: true location shift is not equal to 0
● The p-value is approximately 0.03655, which is less than 0.05, suggesting that we reject the null
hypothesis at the 5% significance level.
● However, at the 1% significance level, we still fail to reject the null hypothesis.
Statistical Power:
● Statistical power is the probability of correctly rejecting the null hypothesis when it is false.
● The power is calculated by simulating the tests many times (1 million times) and counting how often the
null hypothesis is rejected at a specific significance level (e.g., 0.01).
VIDEO #9 NOTES
Logical Operations:
● Logical Comparisions: evaluate conditions and return TRUE, FALSE, or NA for each element of a vector.
○ > x <- rpois(n = 20, lambda = 1827)
○ x > 1830; output: TRUE or FALSE for each value in x
● Subsetting Using Logical Conditions:
○ Vector [condition]
○ x [x > 1830]; returns only the values of x where the condition is TRUE.
● Modulo Operation (%%); finds the remainder when dividing by a number
○ x %% 2 # Remainders when dividing each value in x by 2
■ 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 1 0 0 0 1
○ x %% 2 == 0 # TRUE for even numbers, FALSE for odd numbers
■ TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE…
○ x[x %% 2 == 0]; subset even numbers
■ 1758 1816 1870 1798 1844 1806 1828 1884 1866 1796 1788 1848..
● Combining Logical Conditions:
○ & (AND): Both conditions must be TRUE.
○ | (OR): At least one condition must be TRUE.
○ > x[(x %% 2 == 0) & (x %% 3 == 1)]
■ #’s that are divisible by 2 AND leave a remainder of 1 when divided by 3.
○ > x[(x %% 2 == 0) | (x %% 3 == 1)]
■ #’s that are divisible by 2 OR leave a remainder of 1 when divided by 3.
Frequency Tables:
● x = rbinom(n = 1000, prob = 0.5, size = 13)
● table(x) - Counts how many times each value occurs in x (a frequency table).
● plot(table(x)) - A bar plot showing the frequencies of each value in x.
● Using data (esoph), can make contingency tables;
○ table(esoph$alcgp); frequency table of alcohol
consumption
● table(esoph$alcgp, esoph$tobgp); Cross-tabulate alcgp and tobgp.
● Extract specific rows or columns from the contingency table:
○ y = table(esoph$alcgp, esoph$tobgp)
○ y["120+", ] # All columns for "120+" alcohol consumption
○ y[, "30+"] # All rows for "30+" tobacco consumption
TUTORIAL #9 NOTES
● Shuffle only the second column (o, blood type) in the patients data frame.
● Find the count of patients from Nuevo Leon (n1 = 1) with blood type O (o = 1) after shuffling.
● set.seed(123)
● patients$o <- sample(orig$o) # Shuffle the `o` column
● table(patients)
● table(patients)[2,2] ⇨ 323.
● This question involves simulating a null distribution by shuffling data 10,000 times and determining
whether the observed value lies within the rejection region.
● Shuffle the second column (o, blood type) of the patients data frame 10,000 times using a for loop.
● Record the number of patients from Nuevo Leon (NL) (n1 = 1) with blood type O (o = 1) in each shuffled
iteration.
● nulld <- rep(NA, 10000)
● for ( i in 1:10000) {patients$o = sample(patients$o); nulld[i] =
table(patients)[2,2]}
○ sample(patients$o) shuffles the o column
○ nulld[i] stores the count of NL ppl with type O.
● hist(nulld, main = "Null Distribution", xlab = "Count of NL with
Blood Type O", col = "lightblue")
○ The histogram shows the range of counts we expect if
blood type and location were independent.
● Find rejection region
○ quantile(nulld, 0.005) ⇨ abline(v =304, col = "red")
○ quantile(nulld, 0.995) ⇨ abline (v=351, col = “red”)
● abline(v = 346, col = "blue")
● Lies within rejection region; accept the null hypothesis. This suggests that the observed relationship
between location and blood type could occur by random chance if they were independent.
● This question explores generating a null distribution for a statistic using a binomial model with
replacement and transforming it into a standardized statistic.
The goal is to compare this null distribution to a Gaussian
(standard normal) distribution.
● First generate the statistic, simulate 10,000 values from this
transformed null distribution. Then compare the density of this
null distribution to the Gaussian distribution.
● nulld <- (rbinom(n = 10000, size = N, prob = pNL * pO) - pNL * pO *
N) / sqrt(pNL * pO * (1 - pNL * pO) * N)
● plot(density(nulld), main = "Comparison of Null Distribution and
Gaussian", xlab = "Standardized Statistic", col = "blue", lwd = 2)
● gauss <- rnorm(100000)
● lines(density(gauss), col = "red", lwd = 2)
● legend("topright", legend = c("Null Distribution", "Gaussian
Distribution"), col = c("blue", "red"), lwd=2)
● The null distribution and the Gaussian distribution are nearly
identical, confirming that the Gaussian approximation works well in
this case.
● This question explains how to compute the chi-square test statistic manually for a 2×2 contingency table.
This statistic evaluates whether two categorical variables (e.g., location and blood type) are independent.
● The outer() function calculates the expected values under the null hypothesis of independence:
○ expected = outer(rowSums(btO), colSums(btO)) / N
● For each cell, compute the chi-square contribution:
○ (btO - expected)^2 / expected
● Sum all the contributions to get the chi-square statistic:
○ sum((btO - expected)^2 / expected)
○ 4.106456
● By default, chisq.test() applies Yates' continuity correction to the chi-square statistic. This adjusts the
statistic to account for small sample sizes in 2×2 tables
● chisq.test(btO) #statistic = 3.8893
○ statistic is different from the manually computed value in Q10 due to the continuity correction.
● To match the manually computed statistic, set the correct parameter to FALSE to disable Yates' continuity
correction:
○ chisq.test(btO, correct = FALSE) #statistic = 4.1065
○ This value matches the manually computed statistic from Question 10
● Rmultinom () function can simulate random contingency tables based on the probabilities in the expected
table.
○ P <- expected / N
○ set.seed(123)
○ mat <- matrix(rmultinom(n = 1, size = N, prob = as.vector(P)), ncol = 2)
○ This simulates a contingency table with the same probabilities as the
expected values.
○ chisq.test(mat, correct=FALSE #output statistic = 0.18666
● This task involves simulating the null distribution of the Chi-Square statistic by resampling the
contingency table 10,000 times and comparing it to the theoretical Chi-Square distribution with 1 degree
of freedom.
● nulld <- rep(NA, 10000) # Initialize an empty vector
● for (i in 1:10000) {mat <- matrix(rmultinom(n = 1, size = N, prob = P), ncol = 2); nulld[i] <- chisq.test(mat,
correct = FALSE)$statistic}
● plot(density(nulld), main = "Null Distribution of Chi-Square Statistics", lwd = 4, xlab = "Chi-Square
Statistic")
○ Plot the density of the simulated null distribution
● Overlay the theoretical Chi-Square distribution
○ chi2 <- rgamma(n = 10000, shape = 1 / 2, scale = 2) # Chi-square with df=1
○ lines(density(chi2), col = "red", lwd = 3)
Q14:
rl <- "https://raw.githubusercontent.com/gui11aume/BIOB20/main/tutorial_09/ABO.txt"
> ABO <- as.matrix(read.delim(url, row.names = 1))
> row_totals <- rowSums(ABO)
> col_totals <- colSums(ABO)
> grand_total <- sum(ABO)
> expected <- outer(row_totals, col_totals) / grand_total
> set.seed(123)
> nulld <- rep(NA, 10000)
> for (i in 1:10000) {
+ # Generate a resampled contingency table
+ simulated_table <- matrix(rmultinom(n = 1, size = grand_total, prob = as.vector(expected / grand_total)),
+ nrow = nrow(ABO), ncol = ncol(ABO))
+
+ # Compute the Chi-Square statistic
+ nulld[i] <- sum((simulated_table - expected)^2 / expected)
+}
plot(density(nulld), main = "Null Distribution of Chi-Square Statistic",
+ xlab = "Chi-Square Statistic", lwd = 4, col = "blue", xlim = c(0, max(nulld)),
+ ylim = c(0, max(density(nulld)$y) + 0.13))
> chi2 <- rgamma(n = 10000, shape = 3 / 2, scale = 2)
> lines(density(chi2), col = "red", lwd = 3)
> legend("topright", legend = c("Simulated Null Distribution", "Theoretical Chi-Square (df=3)"),
+ col = c("blue", "red"), lwd = 3)
Visualizing P-Values:
● The plot shows the relationship between hypothesized means (μ) and p-values:
● Blue line at p=0.05
○ Values of μ with p>0.05 are "accepted" at the 5% significance level. These correspond to means
compatible with the data.
○ These means form a confidence interval around
the sample mean.
● Red line at p=0.01
○ A stricter threshold narrows the range of accepted
μ, reflecting greater certainty.
● The key takeaway is that the p-value tells us whether the
data is compatible with H0, not whether H0 is "true."
min(which(pvals>0.05))
> #112
> mu[112]
[1] 15.77778
Making Predictions:
● If x = 1, what is y?
● Plug into equation: y = a(0.1086) + b (2.8948)* 1 EQUALs 3.0034.
○ best guess is 3
● If you are predicting something outside range of observed values (eg. 7), it is called extrapolation
● Within range is interpolation
● confint (lm(y~x))
○ 0.1086 + 2.4846760 * 1 = 2.593276
○ 0.1086 + 3.3049405 * 1 = 3.41354
● Anything between 2.59 and 3.41 is a good estimate
for x = 1.