0% found this document useful (0 votes)
12 views45 pages

BIOB20 Notes

The document provides a tutorial on statistical sampling methods, hypothesis testing, and the use of binomial and Poisson distributions in R programming. It includes examples of testing the fairness of a die and the success rate of a driving school, as well as methods for visualizing and simulating data. Additionally, it explains the Maximum Likelihood Estimation (MLE) concept and how to read data into R.

Uploaded by

sarahfarhan122
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views45 pages

BIOB20 Notes

The document provides a tutorial on statistical sampling methods, hypothesis testing, and the use of binomial and Poisson distributions in R programming. It includes examples of testing the fairness of a die and the success rate of a driving school, as well as methods for visualizing and simulating data. Additionally, it explains the Maximum Likelihood Estimation (MLE) concept and how to read data into R.

Uploaded by

sarahfarhan122
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

TUTORIAL #2

Making Random Samples Another Way:


●​ If we have grades of an exam we would normally make a sample like:
○​ > sample (x=100, replace = TRUE)
●​ How do we determine what it takes to be in the top 10%.
●​ Method #1
○​ Using > sort (x), sort from lowest to highest and than look at position 91 for the cutoff (93) such
that there are 10 grades equal or above 93 and 90 grades equal or below 93.
●​ Method #2
○​ Using >quantile (x,0.9); > 90%; >93
○​ The quantile 0.9 is the cufoff such that 10% of the values are higher and 90% are lower than 93.
●​ Using hist(x), we can make a histogram of the distribution.
○​ And add a line to the graph using > abline(v=93, col=”red”)
○​ If we want to find the cutoff for the top 10%, we add the line at position 93 and having a grade
right of this line makes you in the top 10%.
●​ >rbinom (n=100, size=196, prob=½)
○​ Shorter more efficient way to sample
○​ Basically simulates 100 trials of the # of vaccinated people in a group of 196.

Example 1:
A die is rolled 57 times and obtains 15 sixes, is the die fair?
Null hypothesis is that the die is fair
Alternative hypothesis is that the die is not fair

So first we start off with generating a distribution of sample deviations


●​ Sample = rbinom (n=100000, size=57, prob=⅙)
○​ This simulates 100,000 trials of rolling a fair die 57 times, and returns the # of sixes observed in
each trial.

Test Statistic:
●​ When rolling a fair die, the prob of rolling a 6 is ⅙
●​ Since the die is rolled 57 times and under the assumption the die is fair (null hypothesis), the expected #
of sixes after 57 rolls is calculated using the formula for the expected value of a binomial distribution.
●​ The expected value E(X) of a binomial distribution is: E(X) = n x p
○​ N is the number of trials
○​ P is the probability of success on a single trial.
●​ E(X) = 57 x ⅙ = 9.5
○​ If the die is fair, we expect to see about 9.5 sixes after 57 rolls.

Measuring Deviation from the Expected Value


●​ The observed # of sixes from the actual experiment is 15, which might be higher or lower than what we’d
expect if the die were fair.
●​ To determine if this observed value is unusually high or low, we compare to the expected value (9,5)
●​ This is done by calculating the difference:
○​ Deviation = Observed Value - Expected Value
○​ 15 - 9.5 = 5.5
●​ We also use the absolute difference.
●​ statistic = abs(sample-9.5)

Decision Rule:
●​ Then, a histogram is made of the stimulated test statistics.
●​ Then calculate the 95th percentile (or 5% quantile) of the distribution.
○​ Tells us what test statistic value corresponds to the most extreme 5% of the outcomes under the
assumption that the die is fair.
○​ In this case, the critical value is 5.5.
●​ If the observed test statistic is greater than the critical value, you reject the null hypothesis.
○​ Suggests that the observed outcome is unusual enough (in the extreme 5% of cases) that its
unlikely to have occurred by random chance if the die were fair.
●​ If the observed test statistic is less than or equal the critical value, you accept the null hypothesis.
○​ Suggests that the outcome is not unusual.
●​ Conclusion: both the observed test statistic and the critical value are 5.5 so we accept the null hypothesis
that the die is fair.

Example #2:
A driving school with 109 students, 98 passed the exam and the local government requires an 80% success
rate. Will this driving school have better than 80% success in the future.
Null Hypothesis: The school's success rate is 80%, meaning there is no evidence the school has a better success
rate than required.
Alternative Hypothesis (H₁): The school's success rate is better than 80%.

> sample = rbinom (n=100000, size=109, prob =0.80)


●​ This simulates 100,000 trials of 109 students with an 80% success rate.

Test Statistic:
●​ Under the assumption that the success rate is 80%, the E(X) = 109 x 0.80 = 87.2
●​ Deviation = Observed Value - Expected Value
○​ Deviation = 98-87.2 = 10.8

> statistic = abs(sample-10.8)


> hist (statistic)
> quantile (statistic, 0.95)
​ > 95%
​ > 83.2
> abline (v=83.2, col=”red”)

●​ The observed test statistic (98) is greater than the critical value (83.2), we reject the null hypothesis. This
would suggest that the school has a significantly better success rate than 80%.
●​ If the observed test statistic is less than or equal to the critical value, you fail to reject the null hypothesis;
can’t conclude that the school has a better success rate than 80%.

How Magic is It?


●​ If we have a magic coin (not fair) & we throw it 100 times; obtain 22 heads and 78 tails, how magic is it.
●​ For a normal coin you could say > rbinom(n=10, size=100, prob =0.5)
○​ > 51 50 47 46 63 42 52 44 47 54
●​ How likely are we to obtain 22?
○​ > rbinom (n=10, size=100, prob=0.5) == 22
○​ > FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
○​ This command transforms the input into TRUE/FALSE answers
○​ The presence of FALSE means that none of the numbers equal to 22 with a normal coin
●​ But if we change the probability > sum(rbinom(n=1000000, size=0.4) ==22)
○​ > 63
○​ > sum(rbinom(n=1000000, size=0.3) ==22)
○​ > 19020
○​ > sum(rbinom(n=1000000, size=0.2) ==22)
○​ 84632
○​ > sum(rbinom(n=1000000, size=0.1) ==22)
○​ > 212
○​ Thus, the max likelihood is at hypothesis 0.2.

VIDEO #3 + Tutorial 3
Dbinom Function:
●​ Dbinom calculates the prob of getting exactly x successes in size independent trials, where each trial has
a success probability of prob.
●​ > dbinom (x, size, prob)
○​ x - the # of successes (eg. the number of mutations we are interested in)
○​ size - the total # of trials (eg. time intervals in our mutation example)
○​ prob - the prob of success (eg. or mutation) in each trial.

Example 1 - Using Mutation Rates to Model the Binomial and Poisson Distributions:
●​ Goal: to determine the prob of no mutations over a period where, on avg, one mutation occurs.
●​ Approach: use the binomial distribution to model mutation events, dividing the time interval into smaller
sub-intervals. Then, take the limit as the intervals become infinitely small, which leads to the Poisson
distribution.
Step 1 is the binomial distribution setup:
●​ The binomial distribution models the # of successes (mutations) in a fixed # of independent trials (time
intervals), which a fixed prob of success per trial.
○​ > dbinom (size=2, prob=½, x=0)
■​ Size = 2: splits the time into 2 intervals
■​ prob = ½ is the change of mutation in each interval
■​ x = 0 represents the event of no mutations
○​ Output: 0.25 is the prob of no mutations when the time is divided into 2 intervals
○​ If we were to split into more subdivisisions; the probability stabilizes.
■​ > dbinom (size=4, prob=¼, x=0)
■​ > dbinom (size=256 , prob=1/256, x=0)
■​ > dbinom (size=100000000, prob=1/100000000, x=0)
■​ > the prob converges to approx 0.36778794
●​ As the # of divisions increases (eg. size becames larger and prob becomes smaller)

Step 2 involves the Poisson Distribution


●​ The binomial model approaches the Poisson distribution when the # of subdivisions becomes very large,
and the intervals are infinitsimaly small.
●​ The poission distribution models the prob of a given number of events (like mutations) occurring in a
fixed time period or space, when these events happen at a constant avg rate and independently of each
other.
○​ lambda = the avg # of events (mutations) expected to occur during athe time period.
○​ > dpois (lambda = 1, x = 0)
■​ lambda = 1 means an avg of 1 mutation and x=0 represents the event of no mutations
■​ The result is 0.3678794, matching the binomial limit.
●​ Calculating the Prob of Exactly 1 Mutation:
○​ > dpois(lambda=1, x=1)
■​ Means that the prob of observing exactly 1 mutation when the avg rate is 1.
○​ > 0.3678794

Simulating Poisson Random Variables:


●​ rpois generates random numbers/samples from a Poission distribution with a given avg rate (lambda).
●​ Each value generated by tis function represents the # of events (mutations) that occur in a fixed time.
●​ > rpois (n, lambda)
○​ n = # of random values to generate
○​ lambda = avg number of events expected to occur
●​ Eg. > rpois (n=5, lambda=1)
○​ > 0 2 1 2 3
○​ This simulates 5 random counts of mutations, given the avg rate (lambda) of 1.
○​ In this case, 1 period had 0 mutations, 1 period had 1 mutation, 2 periods had 2 mutations etc.
Visualizing the Poisson Distribution:
●​ > plot (x=seq(from =0,10), y=dpois(lambda=1, x=seq(from=0,10)), xlab=”X”, ylab=”Pr”,
panel.first=grid(), type = “h”, lwd=5)
●​ plot () function is used to grap data
●​ > x =seq (from =0,10)
○​ x = specifies the x-axis value for the plot
○​ seq (from = 0,10); creates a sequence of numbers from 0 to 10, which will be the x-values on the
graph. Represents the # of events (or mutations).
●​ > y=dpois(lambda=1, x=seq(from=0,10)
○​ y = specifies the y-axis values for the plot. These are the probabilities of observing x events
(mutations).
○​ dpois (lambda=1, x=seq(from 0,10). This uses the poission prob function to calculate the prob of
observing each value of x (from 0 to 10), given that the avg # of events is lambda =1
●​ > xlab=”X”
○​ xlab = specifies the label for the x-axis.
○​ “X” is the label
●​ > ylab = “Pr”
○​ ylab = specifies the label for the y axis
○​ “Pr” is the label for probability
●​ >panel.first=grid()
○​ Adds a grid to the background of the plot
●​ >type= “h”
○​ type: specifies the type of plot
○​ “h”: creates a vertical line for each data point to
produce a histogram-like plot
●​ >lwd=5
○​ Line width
●​ > plot(x=ew$x[1:8], y=ew$y[1:8], xlab="x (no units)",
ylab = "y (no units)", pch=19, col="red", cex =2, xlim =
c(-3,3), ylim = c(-3,3), panel.first=grid())
●​ > pch = 19
○​ Using numbers from 1 to 19, this will give you
different shapes for your points.
●​ > cex = 1
○​ This function specifies the sizing of the points;
can make it bigger or smaller.
●​ > xlim = c() and ylim = c()
○​ These functions allow you to increase or
decrease the limit/range of the x/y axis.
●​ To add another set of points to an existing plot, we use the function >points
○​ > points(x=ew$x[9:20],y=ew$y[9:20], cex =2,pch=19, col="blue")
Reading Data into R:
●​ Function read.delim() is used to read a tab-delimited text file into R.
○​ Needs to be put into a variable
○​ > prot <- read.delim (“file_path_or_url”)
●​ The function dim(), returns the dimensions of a data frame, which means the # of rows & columns
○​ > dim (prot)
○​ The output is in the form of (number of rows, number of columns)
○​ This is important to check if your data is correct
○​ > 19 3 (means the data frame has 19 rows and 3 columns)
●​ The function head(), returns the data from the first 6 rows; helpful for checking when the data set is
extremely long.
●​ is. vector(prot) checks if the object prot is a vector
●​ is. data.frame(prot) checks if the object prot is a data frame
○​ Returns TRUE or FALSE
●​ A data frame in R is a collection of vectors of equal length, where each column in the data frame is a
vector of the same size. Vectors are simpler, containing only one type of data (numeric, character, etc.).
●​ To access the columns of a data frame, we use the $ sign.
○​ prot$protein
○​ This prints all the values in the protein column of the prot data frame.
●​ To access specific elements in a data frame by using indexing.
○​ Single elements:
■​ prot$time[1] - returns the first element in the time column
○​ Multiple elements:
■​ prot$time[c(3,10,22,11)] - returns elements at posit 3, 10, 22 & 11 from the time column
○​ Specific rows and columns
■​ prot [1,2]: accesses elements in the first row and second column
●​ Subsetting data
○​ You can extract specific parts (subsets) of the data frame using square brackets [ ].
○​ The first index refers to rows, and the second index refers to columns.
○​ prot[1:3, 2:3] - Returns the values in rows 1-3 and columns 2-3

Maximum Likelihood Estimation:


●​ Maximum Likelihood Estimation (MLE) is a way to find the best guess for something by choosing the
value that makes the observed data most likely to happen.
○​ MLE is like finding the number that "fits" your data the best.

Example 1 - Using Coins to Model MLE:


You have a coin, but don’t know if its a normal coin (50/50). You flip the coin 10 times and get 7 heads and 3
tails. Now you want to guess how likely the coin is to land on heads each time you flip it.
●​ MLE helps to figure out the BEST GUESS for the chance of getting heads based on the 10 flips.

Now we try different guesses for p (prob of getting heads)


●​ Guess 1: p=0.5 (50% heads, like a normal coin)
○​ If the coin had a 50% chance of landing on heads, you’d expect about 5 heads out of 10 flips.
○​ But you got 7 heads—this makes 50% seem not quite right.
●​ Guess 2: p=0.7(70% chance of heads)
○​ If the coin had a 70% chance of landing on heads, you’d expect more heads, like 7 heads out of 10
flips.
○​ Since you actually got 7 heads, this guess matches the result better.
●​ Guess 3: p=0.8 (80% chance of heads)
○​ If the coin had an 80% chance of landing on heads, you’d expect 8 heads.
○​ But you only got 7, so this guess seems a bit too high.

Thus, the guess that makes the results (7 heads and 3 tails) most likely is p=0.7 This is your maximum
likelihood estimate because it best explains the results you got.

Example 2 - Determining Secret Lambda


MLE is a method used to estimate the parameters of a statistical model. In this case, we are trying to estimate
the parameter lambda of a Poisson distribution. The idea is to find the value of lambda that makes the observed
data most likely.

First, we set up a random value for lambda


●​ > secret_lambda = runif (min=2, max=15, n=1)
○​ > runif generates a random value for 𝛌 between 2 and 15.
○​ This will be the true value of 𝛌 but we don’t know it.

Then, generate Poisson-distributed observations:


●​ > obs = rpois(lambda=secret_lambda, n=12)
●​ > obs
●​ The function rpois() generates 12 random numbers (observations) from a Poisson distribution using the
hidden secret_lambda value.
●​ These values will be used to estimate 𝛌

Initial guess for 𝛌


●​ > lambda_0 = 2
●​ We start by guessing that 𝛌=2 (bc its the min) as a possible value. This guess can be adjusted based on
how well it fits the data.

Calculate the Likelihood


●​ > dpois(lambda = lambda_0, x = obs)
●​ This calculates the prob of observing each value in obs using the guess that lambda = 2.
●​ This outputs a list of probabilities for each observed values.
●​ But looking at all the probabilities makes it difficult to determine the maxes so we calculate the
OVERALL likelihood by multiplying probabilities.
○​ > prod(dpois(lambda=lambda_0, x=obs))
●​ These probabilities are what MLE is all about: trying to find the λ that maximizes these probabilities.
●​ The result is a very small number, indicating that it’s unlikely λ=2 produced these observations.

Log-Likelihood
●​ > sum(dpois(lambda_0, x=obs, log=TRUE) or > log(prod(dpois(lambda=lambda_0, x=obs)))
●​ Essentially this returns the the logarithm of each prob instead of the actual prob.
●​ This is because multiplying many small probabilities can result in very small numbers (eg. 1e-44) which
are hard to work with.

Testing different values of λ


●​ > log(prod(dpois(lambda=lambda_0+1, x=obs)))
●​ > log(prod(dpois(lambda=lambda_0+2, x=obs)))
●​ Then we calculating the log-likelihood for different guesses of λ (e.g., λ=3,4,5,…) and seeing how the
log-likelihood changes.
●​ The goal is to find the value of λ that gives the highest log-likelihood. As you increase λ you observe that
the log-likelihood becomes less negative (better) until you reach the most likely value.

Demonstration:

>secret_lambda = runif (min=2, max=15, n=1) > log(prod(dpois(lambda=lambda0+6, x=obs)))


> secret_lambda [1] -41.69345
[1] 11.32385 > log(prod(dpois(lambda=lambda0+7, x=obs)))
> obs = rpois(lambda=secret_lambda, n=12) [1] -37.08604
> obs > log(prod(dpois(lambda=lambda0+8, x=obs)))
[1] 12 15 15 9 6 8 12 11 14 8 20 11 [1] -34.23021
> lambda0 = 2 > log(prod(dpois(lambda=lambda0+9, x=obs)))
> log(prod(dpois(lambda=lambda0, x=obs))) [1] -32.79148
[1] -165.161 > log(prod(dpois(lambda=lambda0+10, x=obs)))
> log(prod(dpois(lambda=lambda0+1, x=obs))) [1] -32.52287
[1] -119.9904 > log(prod(dpois(lambda=lambda0+11, x=obs)))
> log(prod(dpois(lambda=lambda0+2, x=obs))) [1] -33.23685
[1] -91.4272 > log(prod(dpois(lambda=lambda0+12, x=obs)))
> log(prod(dpois(lambda=lambda0+3, x=obs))) [1] -34.78763
[1] -71.96396 > log(prod(dpois(lambda=lambda0+13, x=obs)))
> log(prod(dpois(lambda=lambda0+4, x=obs))) [1] -37.05963
[1] -58.25662 Thus, the lambda is 11 which is close to the secret
> log(prod(dpois(lambda=lambda0+5, x=obs))) lambda printed above.
[1] -48.52138

Example 3 - Determining Secret Lambda More Precisely


What if the best lambda is between two whole numbers like 9.4 or 7.8? Thus we can adjust the code to test
lambda values with smaller steps which then narrows down the estimate more accurately to find the lambda
that best fits the data.

Create a Sequence of Possible λ To Test


●​ > lambdaval = seq(from=2, to=15, by=0.1)
○​ 2.0,2.1,2.2,2.3,…,14.9,15
●​ >lambdagrid = replicate(lambdaval, n=12)
○​ This creates a grid/matrix where each row is one of the possible λ values from lambdaval, and
each column corresponds to the 12 observations you made.

Log-Likelihood for All λ Values


●​ >dpois(lambda=lambdagrid, x=obs, log=TRUE)
●​ This again computes the log-prob for each value allowing us to handle very small numbers more easily
●​ >rowSums(dpois(lambda=lambdagrid, x=obs, log=TRUE)
○​ rowSums () adds up the log-likelihood values for each row of the lambdagrid.

Plot the Log-Likelihoods:


●​ >lambdagraph = rowSums(dpois(lambda=lambdagrid, x=obs, log=TRUE)
●​ > plot (x=lambdaval, y=lambdagraph)
●​ The higher the point on the graph, the more likely that λ value is to have produced the observed data.
●​ Example: If your plot shows that the curve peaks around λ=7.5. this means the value of λ that best
explains your data is around 7.5.
●​ this value should be close to the secret_lambda you generated earlier.

Demonstration Continued……

> lambdaval=seq(from=2, to=15, by=0.1)


> lambdagrid = replicate(lambdaval, n=12)
> lambdagraph = rowSums(dpois(lambda=lambdagrid,
x=obs, log=TRUE))
> plot(x=lambdaval, y=lambdagraph)

Again, peaks at 11, being close to the secret_lambda


generated above.

VIDEO #4

Continuous Variables:
●​ Some random variables are continuous instead of definite (like rbinom and rpois)
●​ > runif() - generates random numbers from a uniform distribution.
○​ generates a random sample of numbers between 0 and 1.
○​ Numbers are drawn from the uniform distribution, meaning that every value between 0 and 1 is
equally likely.
○​ Eg. runif(5, min=0, max=1)
●​ > runif (n=10)
○​ > 0.8390957 0.2427499 0.5665379 0.1225692 0.5382546 0.1827380 0.1452402 0.1456641
0.7780869 0.1627826
○​ Prints 10 random numbers from the uniform distribution
●​ Making a histogram of these shows numbers similar in frequency.
●​ We also don’t get values below 0 and above 1

Other Methods to Select Elements from a Vector:


●​ > x = runif(n=5)
●​ > x[1] - returns the first element
●​ > x[c(TRUE, FALSE, FALSE, TRUE, FALSE)] - returns the first and fourth element
●​ > x >= 0.5 - prints which values are greater than or equal to 0.5 via TRUE/FALSE
●​ > x[x>=0.5] - prints the values greater than or equal to 0.5

Libraries:
●​ Functions are available in libraries
●​ > install.packages (“circular”)
●​ > library(circular)
●​ > x = runif (n=100, max = 2*pi)
●​ > plot(circular(x))

TUTORIAL #4 - THE UNIFORM DISTRIBUTION

Example 1:
In a national exam, a teacher has three students who take the exam and score a ranking of 0.734, 0.859 and
0.971. Based on this, do we give the teacher a promotion? These scores are drawn from a uniform
distribition between 0 and 1, which means every score within this range is equally likely. We are using these
scores to decide if the teacher deserves a promotion based on how well these students perform relative to
what you would expect under the uniform distribution.

Null Hypothesis: The students' scores are average, meaning they could be from any teacher, and the teacher
doesn’t necessarily deserve a promotion.
●​ Generally for the null hypothesis, this is the hypothesis you dont like and are setting it up to hopefully
reject it.
●​ Plus, there is only one way to be average.

Alternative Hypothesis: The students' scores are high, suggesting the teacher might be better than average and
might deserve a promotion.

Test Statistic
●​ Take the average of the three grades; could also use the max or the min
●​ Under the null hypothesis (uniform distribution), the expected avg score is 0.5
●​ > mean(c(0.734, 0.859, 0.971)
●​ > 0.8546667
●​ We will calculate the average score of these three students and compare it to the expected average
under the null hypothesis.

Decision Rule:
●​ > runif(n=3,min=0,max=1)
○​ Generates 3 random #s between 0 and 1 (to represent student scores under the H0)
●​ > mean(runif(n=3,min=0,max=1))
○​ Calculates the avg of these random number
○​ Simulates the test statistic under the assumption the null hypothesis is true.
●​ > nulld = replicate(n=10000, mean(runif(n=3,min=0,max=1)))
○​ The null distribution assuming the null hypothesis is true; shows us what kind of avg scores we
would expect from a typical teacher is their students scores came from a uniform distribution.
○​ Replicate simulates the avg score for 10,000 sets of 3 students under the null hypothesis.
●​ > hist(nulld, xlab = “statistic”)
●​ > quantile (nulld, 0.99)
○​ > 99%
○​ > 0.8748196
○​ calculates the 99th percentile of the null distribution, which means 99% of the scores from the
null distribution are less than this value.
○​ This value will serve as the cutoff for deciding whether to promote the teacher.
○​ If the observed average is above this 99% cutoff, we will reject the null hypothesis and suggest
that the teacher’s students performed unusually well.
●​ > abline (v=0.8748196, col = “red”)
○​ This line represents the threshold; if a teacher’s average score exceeds this line, there is less than
a 1% chance that their score could be due to random variation.
●​ > mean(c(0.734, 0.859, 0.971)
○​ > 0.8546667
●​ > abline (v=0.85466767, col = “blue”)
●​ If the blue line (teacher’s average score) is to the
right of the red line (cutoff for the 99th percentile),
the teacher's students performed exceptionally
well, and we reject the null hypothesis, meaning
we would promote the teacher.
●​ If the blue line is to the left of the red line (as in
this case), the teacher’s students' performance is
not significantly better than expected under the
null hypothesis, so we would not promote the teacher.
Example 2:
Trying to arrange nurses hospital shift.

url <- "https://raw.githubusercontent.com/gui11aume/BIOB20/main/tutorial_04/a.txt"


a = scan(url)
●​ Scan is used when the data is a list of numbers but read.delim is used when you have a table of data.
●​ Scan also has no dimension (eg. dim(a)), it has length
○​ Length (a)

Since there are 24 hours in a day, we can use the circular library.
> plot (circular(a, template = “clock24”, units = “hours”), stack = TRUE)

Null Hypothesis (H0): The shift times are uniformly distributed across 24 hours.
Alternative Hypothesis (H1): The shift times are not uniformly distributed (there are clusters).

So we need to take the distance between consecutive points.

>runif(n=5)
●​ generates 5 random numbers from a uniform distribution.
>sort(runif(n=5))
●​ sorts the random numbers.
>diff(sort(runif=n=5)))
●​ Computes the difference between the numbers
>sum(diff(sort(runif=n=5)))^2)
●​ Multiplying the differences by 2 adjusts the scale, and the sum of the differences gives a measure of the
total "spread" of the points.
>nulld = replicate(n=10000, sum(diff(sort(runif(n=1000,min=0,max=24)))^2))
●​ You simulate 100,00 datasets of 1000 uniformly distributed shift times over a 24-hour period.
>hist(nulld)
>quantile (nulld,0.99)
>47.92697
●​ The 99th percentile gives the value that 99% of the simulated values are below. This is the cutoff for
determining if the observed shift distribution is unusually clustered.
>a%%24
●​ a %% 24 ensures that all shift times are wrapped around a 24-hour clock.
>sum(diff(sort(a%%24))^2)
●​ diff calculates the diff between consecutive shift times.
●​ sorts the shift from earliest to latest
●​ multiplying by 2 adjusts the scale of the differences,
●​ sum will give a total measure of how spread out the shift times are within the 24 hour period.
●​ * this value tells us how “spread out” or “clustered” the shift times are for the actual date compared to
the uniform distribution.
>47.02074
>abline(v=47.02074
>mean(nulld >= 47.02074)
●​ Creates a logical vector where each entry is either T/F depending on whether the simulated value in nulld
is greater than or equal to the observed value (47.02074).
●​ The mean calculates the proportion of TRUE values in the vector
●​ This proportion represents the p-value, which is the probability of obtaining a sum of differences as
larger or larger than the observed value (47.02074) under the null hypothesis.
> 0.60719
●​ A p-value of 0.60719, means that 60.7% of the time, the randomly generated uniform shift distributions
will have a sum of differences greater than or equal to 47.02074.

Decision:
●​ Since 0.60719 is much larger than the typical significance threshold of 1% (0.01) or 5% (0.05), we fail to
reject the null hypothesis.
●​ This means that the observed shift times do not show significant clustering, and we conclude that the
shift times are uniformly distributed.
●​ Accept the null hypothesis.

VIDEO #5 - NORMAL DISTRIBUTION


●​ rnorm (n, mean = 0, sd = 1), generates n random numbers from a normal distribution with specified
mean and SD (default mean =0, SD = 1).
●​ the normal distribution is continous; the numbers are not integers
●​ > mean = 10
●​ > sd = 50
●​ > rnorm (n=5, mean=mean, sd=sd)
○​ > -18.023782 -1.508874 87.935416 13.525420 16.464387
●​ > rnorm(n=5) * sd + mean
○​ > -18.023782 -1.508874 87.935416 13.525420 16.464387
○​ Obtain the same #’s; shows that the parameter SD is the multiplication of a standard normal
distribution variable and that the mean is that the addition of a constant.

Standard Deviation:
●​ SD represents the amount of variability between the numbers of the sample.
●​ > x = rnorm(n=5)
○​ > sd(x)
○​ [1] 1.16349
●​ If the numbers are close to one another, the SD is close to 0; if the numbers are far away, the SD is high
●​ SD is linked to variance
○​ > sqrt(var(x))
○​ [1] 1.16349
●​ If you multiply your numbers with 100, the SD changes meaning the SD scales linearly with the sample
○​ > x * 100
○​ > sd(x*100)
○​ [1] 116.349
●​ But if you add 100 to your numbers, adding a constant doesnt change the SD.
○​ > x + 100
○​ > sd(x+100)
○​ [1] 1.16349

Functions:
●​ You can use curly brackets to make functions in R that don’t exist.
●​ Lets say span means the max - min but this function doesnt exist in R so:
○​ > span = function(x){max(x)-min(x)}
○​ > span(x)
○​ [1] 1.779923
●​ After making a function, you can use it again
○​ > span(rnorm(n=5))
○​ [1] 0.8498488

Reading Data in R Continued:


●​ > x = rnorm(n=5)
○​ > x
○​ [1] -1.6866933 0.8377870 0.1533731 -1.1381369 1.2538149
●​ You can use square brackets to call out specific elements but you can also use negatives.
●​ > x[-1]
○​ Calls out every number but the first element
○​ > 0.8377870 0.1533731 -1.1381369 1.2538149
●​ > x[-c(1,2)]
○​ Calls out the whole sample but the first two
○​ > 0.1533731 -1.1381369 1.2538149

Recycling Rule
●​ In R, the recycling rule automatically repeats (recycles) shorter vectors to match the length of longer ones
during element-wise operations.
●​ If the length of the longer vector is not a multiple of the shorter one, a warning is issued
●​ > x = rpois(n=2, lambda = 1)
●​ > y = rpois(n=4, lambda = 1)
●​ > x+y
●​ [1] 3 1 3 2

TUT #5 - NORMAL DISTRIBUTION

Understanding the Normal Distribution


●​ The normal distribution is a continuous probability distribution that is symmetric about the mean, with a
bell-shaped curve. Its key characteristics:
○​ Mean (μ): The center of the distribution, where the data clusters.
○​ Standard Deviation (σ): Measures the spread of the data around the mean. A small σ means the
data is close to the mean, and a large σ means the data is more spread out.
●​ rnorm(n, mean, sd)

Example #1 - Hypothesis Testing with Normal Distribution


Lets say you are trying to guess a secrete mean from a normally distributed dataset. Lets say the only two
numbers we are given are 3.52 and 1.94.
Null Hypothesis: the null hypothesis assumes the secret mean is equal to a specific value (1.2345)
●​ H₀: The secret mean is 1.2345.
Alternative hypothesis: The alternative hypothesis states the secret mean is not equal to 1.2345.
●​ H₁: The secret mean is not 1.2345.

Test Statistic:
●​ Remember the test statistic is a data set that we can compute and a good potential test statistic could be
the z-score.
●​ > obs = c(3.52, 1.94)
●​ z = (mean(obs) - hypothesized_mean) / sd(obs)
○​ Z-score (also called standard score) indicates how many standard deviations away your sample
mean is from the hypothesized mean (1.2345).
○​ Helps standardize different data points so they can be compared on a common scale, even if they
come from different distributions.
●​ What does the Z-Score mean?
○​ z = 0: The value is exactly at the mean.
○​ z > 0: The value is above the mean (right of the mean).
○​ z < 0: The value is below the mean (left of the mean).
●​ In the context of hypothesis testing:
○​ A small z-score (close to 0) means that your sample mean is close to the hypothesized mean.
○​ A large z-score (positive or negative) indicates that your sample mean is far away from the
hypothesized mean, suggesting the null hypothesis might not hold.
●​ > statistic = function(x){(mean(x)-1.2345)/sd(x)}

Then we simulate the null distribution:


●​ The null distribution is made by generating random values assuming the null hypothesis is true.
●​ > nulld = replicate(n=1000000, statistic (rnorm(n=2, mean = 1.2345, sd=1)))
●​ > hist (nulld, breaks=50)

Decision Rule:
●​ > quantile (nulld, 0.005)
○​ 0.5%
○​ -45.64871
●​ > quantile (nulld, 0.995)
○​ 99.5%
○​ 45.1567
●​ > statistic (obs)
○​ 1.313009
●​ If the test statistic lies outside these quantiles, we would reject the null hypothesis.
●​ 1.31 falls within -45 and 45 so we accept the null hypothesis.

Properties of Standard Deviation:


●​ (1) SD doesnt depend on the mean
○​ When you add or subtract a constant from all the data points, the SD doesnt change.
○​ This is bc SD measures the SPREAD of the data around the mean, not the actual values
themselves.
○​ Shifting the entire dataset up or down doesn’t affect how the spread out the data is, only the
location of the center/
○​ Example
■​ obs = c(3.52, 1.94)
■​ sd(obs) # Gives 1.131371
■​ sd(obs + 1e7) # Gives the same value 1.131371
■​ Even though you're adding a large constant (1e7) to each data point, the standard
deviation stays the same because the spread (the difference between values) remains
unchanged.
●​ (2) SD changes with multiplication
○​ When you multiply all the data points by a constant, the SD is also multiplied by that constant.
○​ This is because multiplying spreads the data out by that factor, increasing the range.
○​ Example
■​ sd(obs * 2) # 2.262742
■​ Multiplying all data points in obs by 2 doubles the SD. This makes sense because the data
becomes twice as spread out.
●​ (3) Standardizing the Mean:
○​ The compare data from diff distributions or diff scales, we can standard the data via z-score.
○​ In hypothesis testing, we are interested in knowing how far the observed mean is from the
hypothesized mean. By dividing by the SD, we are getting a standardized metric that allows to:
■​ Ignore the scale of the data (in case the data is very large or small
■​ Focus on relative deviations from the mean
TUT #6 - GAMMA DISTRIBUTION

Gamma Distribution:
●​ The gamma distribution is a continuous probability distribution commonly used to model waiting times
until a certain number of events occur.
●​ The distribution only takes positive values, making it best for modeling time-to-event data.
●​ It is especially useful when dealing with time-to-event data, such as the time it takes for a bus to arrive.
●​ rgamma (n, shape, scale)
○​ Shape (alpha) = determines the # of events you are waiting for
○​ Scale (beta) = the avg time for one event to occur
●​ Shape Parameter: defines how many events need to occur before stopping the timer
○​ When shape = 1, the gamma distribution reduces to an Exponential distribution, modeling the
waiting time for a single event.
■​ > hist(rgamma(n=10000, shape=1, scale=1), breaks = 80)
■​ The distribution has a steep drop-off to the right, indicating that shorter waiting times are
more likely.
○​ As the shape parameter increases (e.g., shape = 2), the histogram starts to shift to the right.
■​ > hist(rgamma(n=10000, shape=2, scale=1), breaks = 80)
■​ The distribution's peak moves to the right, making longer waiting times more probable.
■​ The shape becomes smoother with less steep decline.
■​ Looks more and more like a normal distribution.
●​ Scale Parameter:
○​ affects the stretching of the distribution along the x-axis
○​ With larger scale values, the distribution spreads out, and the average waiting time increases.
○​ > hist(rgamma(n=10000, shape=1, scale=100), breaks = 80)
■​ The histogram is stretched, indicating that events take longer on average.
■​ The overall shape of the distribution remains the same; only the x-axis scale changes.
●​ Mean and Variance:
○​ Mean: α×β
○​ Variance: α×β2

Example 1 - MLE for Parameter Estimation:


You measure the time is days for the bus to come; 7.0, 15.7 and 3.1 minutes.
●​ MLE is used to find parameter estimates (shape and scale) that maximize the probability of observing
given data.
●​ For example, with waiting times obs = c(7.0, 15.7, 3.1), the average time (mean) can be a good starting
estimate for the scale.

> guess = mean(obs) #initial guess for the scale parameter


Calculating Likelihood for Different Parameters
●​ dgamma () function computes the probability density function (PDF) of the Gamma distribution for
given values.
●​ >dgamma(x, shape, scale = 1)
●​ Returns the height of the density function at the given x values. For continuous distributions like the
Gamma, this represents the relative likelihood of the random variable being near the specified value.

> sum(dgamma(x=obs, shape =1, scale=guess, log=TRUE))


●​ Changing the scale slightly (e.g., guess + 0.1 or guess - 0.1) helps determine if the initial guess was
optimal.

For the Gamma distribution, the mean of the observed data serves as the Maximum Likelihood Estimate (MLE)
for the scale parameter when the shape parameter is known. This is because the mean of the Gamma
distribution is given by scale x shape and solving for scale using the sample mean leads to the MLE.

Example 2 - MLE for Parameter Estimation:


There are two power generators. How often do the power generators break? Breaks after 45.5 days, 129,7
days, 18.0 days and 40.3 days.

> obs = c(45.5, 129.7, 18.0, 40.3)


> guess = mean(obs)/2 #bc we have two generators
> sum(dgamma(x=obs, shape=2, scale=guess, log = TRUE)) #shape 2 bc two generators
[1] -19.72026
> sum(dgamma(x=obs, shape=2, scale=guess+0.1, log = TRUE))
[1] -19.72031
> sum(dgamma(x=obs, shape=2, scale=guess-0.1, log = TRUE))
[1] -19.72031

GGPLOT:
> url = "https://raw.githubusercontent.co
m/gui11aume/BIOB20/main/tutorial_06/a.txt"
> a = read.delim(url)
> head(a)
x y
1 -1.31409690 -1.3584203
2 -0.47806638 -0.2116580
●​ ggplot() uses a layering approach: start with ggplot(),
add layers like geom_point(), customize with theme()
and labs().
○​ > ggplot(a, aes(x=x, y=y)) + geom_point() #specifies scatterplot
●​ theme_classic() is a built-in ggplot2 theme that gives the plot a classic, clean look with no grid lines and
a simple white background.
○​ > ggplot(a, aes(x=x, y=y)) + geom_point() + theme_classic()
●​ labs() is used to set custom labels for the x-axis and y-axis.
○​ > ggplot(a, aes(x=x, y=y)) + geom_point() + theme_classic() + labs(x="variable x (au)",
y="variable y (au)")
●​ panel.grid.major and panel.grid.minor control the appearance of the major and minor grid lines,
respectively.
●​ element_line() specifies the line's properties, such as size (thickness), linetype (e.g., "solid", "dashed"),
and colour.
○​ > ggplot(a, aes(x=x, y=y)) + geom_point() + theme_classic() + labs(x="variable x (au)",
y="variable y (au)") + theme(panel.grid.major=element_line(size=.3, linetype="solid",
colour="grey"), panel.grid.minor = element_line(size=.3, linetype="solid", colour="grey"))
●​ geom_abline() adds a straight line to the plot. The intercept and slope arguments specify where the line
crosses the y-axis (intercept = 0) and its slope (slope = 1).
○​ > ggplot(a, aes(x=x, y=y)) + geom_point() + theme_classic() + labs(x="variable x (au)",
y="variable y (au)") + theme(panel.grid.major=element_line(size=.3, linetype="solid",
colour="grey"), panel.grid.minor = element_line(size=.3, linetype="solid", colour="pink")) +
geom_abline(intercept=0, slope = 1)

GEOM TYPES:
●​ Histogram: to visualize the distribution of a single continuous variable
○​ > ggplot(data, aes(x = variable)) + geom_histogram(bins = 30, fill = "blue", color = "black", alpha
= 0.7) + labs(title = "Histogram", x = "Variable", y = "Count")
●​ Scatter Plot: to visualize the relationship between two continuous variables
○​ > ggplot(data, aes(x = variable1, y = variable2)) + geom_point(color = "red", size = 3) + labs(title
= "Scatter Plot", x = "Variable 1", y = "Variable 2")
●​ Line Plot: used for time series data to show trends over time
○​ > ggplot(data, aes(x = time, y = value)) + geom_line(color = "blue", size = 1) + labs(title = "Line
Plot", x = "Time", y = "Value")
●​ Bar Plot: displaying the counts/frequency of categorical variables
○​ > ggplot(data, aes(x = category)) + geom_bar(fill = "green", color = "black") + labs(title = "Bar
Plot", x = "Category", y = "Count")
●​ Density Plot: shows a smoothed version of a histogram
○​ > ggplot(data, aes(x = variable)) + geom_density(fill = "purple", alpha = 0.5) + labs(title =
"Density Plot", x = "Variable", y = "Density")

Making My Own Data:


●​ > x = rgamma(n=10000, shape=2)
○​ > dat = data.frame(variable = x)
○​ > head(dat)
○​ Now I make a ggplot; ggplot(dat, aes(x=vairable)) + geom_histogram(bins=35)
●​ > x = rnorm(100)
○​ > y = x + rnorm(100)
○​ > dat = data.frame(variable1=x, variable2=y)
○​ > head(dat)

url <- "https://raw.githubusercontent.com/gui11aume/BIOB20/main/tutorial_06/bil.txt"


> bil <- scan(url)
Read 23 items
> length(bil)
[1] 23

> df = data.frame(Index = 1:23, Value = bil)


> ggplot (df, aes(x=index, y=Value))+ geom_point(color = "blue", size=3)
> ggplot (df, aes(x=index, y=Value))+ geom_point(color = "blue", size=3) + labs(title = "Scatter Plot of Sample
Data", x="Index", y="Value of bil") + theme_minimal()

●​ Df creates a data frame


○​ Has two columns
■​ Index: This column contains the sequence of numbers from 1 to 23, representing the
position of each value in the original bil vector.
■​ Value: This column contains the actual data values from the bil vector.
●​ Bil (vector with the sample data values)
●​ 1:23 creates a sequence of integers from 1 to 23. It will generate a vector like this: 1, 2, 3, ..., 23.
○​ This vector represents the Index, or position of each element in the bil vector. Essentially, it
gives a number to each value in bil according to its position in the list.
1.​ Value = bil:
○​ This part assigns the values from the bil vector to a new column called Value in the data frame.

For example, when using ggplot2 for plotting, it is necessary to have data in a data frame format where each
column represents a variable. In this case:

●​ The Index column can be used for the x-axis (e.g., to plot the order of data).
●​ The Value column can be used for the y-axis (e.g., to plot the actual values in the dataset).

VIDEO 7 NOTES

Central Limit Theorem (CLT):


●​ The CLT states that the distribution of the sample mean (the avg) of a larger number of independent,
identically distributed random variables will approach a normal distribution, even if the original data is
not normally distributed.
○​ This holds true as long as the sample size is large enough.
●​ Exponential distribution:
○​ Highly asymmetric; does not look like a bell curve
○​ > x = replicate(n=10000, mean(rexp(n=3)))
○​ Generates 10,000 avg values of 3 random numbers; the histogram does not resemble a normal
distribution yet.
○​ > x = replicate(n=10000, mean(rexp(n=50)))
○​ The distribution of the sample means now appears very close to a normal distribution.
●​ In the example code, exponential distributions (which are skewed) require a larger sample size (e.g., 50)
to see the effect, whereas distributions that start symmetric (like uniform distributions) will require
smaller sample sizes to see the CLT in action.

For Loops:
●​ For loops are used to repeat a block of code for a certain number of iterations.
●​ > for (i in 1:5) {print(i)}
○​ The for loop goes through each number in the sequence 1:5.
■​ i takes on the values of 1, 2, 3, 4, 5
■​ Inside the loop, print(i), prints the current value of i
○​ > [1] 1 [1] 2 [1] 3 [1] 4 [1] 5
●​ > for (i in 1:5) {print(i^2)}
○​ This time, instead of printing i, the loop prints i^2 (the square of i)
○​ Eg. when i is 2, it prints 2^2 = 4 and so on…..
○​ > [1] 1 [1] 4 [1] 9 [1] 16 [1] 25
●​ > x = rnorm (n=5)
○​ > 0.71423218 1.32068976 -0.84250403 -0.05279274 -1.30105252
○​ > for (i in 1:5) {print(x[i])}
■​ The loop prints each value of x[i] for i from 1 to 5.
■​ > [1] 0.7142 [1] 1.320 [1] -0.842 [1] -0.0527 [1] -1.3010
●​ Conditional Statements Inside a For Loop (If Statements):
○​ > for (i in 1:5) { if (i > 2) {print(i)}}
■​ Using an if statement to only print values of i greater than 2.
■​ > [1] 3 [1] 4 [1] 5
●​ General syntax on for loops
○​ for ( i in sequence) { #code to repeat for each value of i }
○​ sequence: A vector of values for i (e.g., 1:5 means that i will take on values from 1 to 5).
○​ Inside the curly braces {}, you put the code that you want to repeat for each value of i.

Example - Determining the Prob Distribution of sz when a Specific Condition on x is Met:


●​ > sz <- rpois(n = 1, lambda = 44)
○​ rpois ( ) generates a random value from a Poisson distribution with a mean (lambda) of 44.
●​ > x <- rbinom(n = 1, size = sz, prob = 0.3)
○​ rbinom() generates a random value from a binomial distribution, where:
○​ size = sz: Number of trials is equal to the value drawn from the Poisson distribution.
○​ prob = 0.3: Probability of success in each trial is 0.3.
●​ Simulate values for sz when x=15
○​ > values <- rep(NA, 1000000)
○​ > for (i in 1:1000000) { sz <- rpois(n = 1, lambda = 44) x <- rbinom(n = 1, size = sz, prob = 0.3)
○​ if (x == 15) {values[i] <- sz}}
○​ You simulate 1,000,000 iterations to find the value of sz that produces an x of 15.
○​ If the condition x == 15 is true, you store the value of sz in values. Otherwise, values remains NA.
●​ Check how often x=15 occurs:
○​ > sum(is.na(values))
■​ This line calculates the number of times that x was not equal to 15.
■​ In your output, you had 908,790 NA values, meaning that the condition x == 15 was met
approximately 9.12% of the time.
●​ Filter out NA values
○​ > v <- values[!is.na(values)]
○​ length(v) # Output: 91210
■​ values[!is.na(values)] extracts the non-NA values.
■​ You end up with 91,210 values of sz for which x was 15.
●​ Plot the distribution of sz, given x=15
○​ table(v) creates a frequency table of the values of sz.
○​ plot(table(v)) plots the distribution of sz for which x = 15.
●​ Plot the proportional distribution:
○​ > plot(table(v) / 91210)
○​ This plots the proportional frequency of sz, showing the distribution as a proportion of the total
number of times x was 15.

TUTORIAL 7 NOTES
Example #1 - The Effect Size:
Does blood pressure (x) go up/down after the intervention. Is there a difference before and after the
treatment.
●​ > url = "https://raw.githubusercontent.com/gui11aume/BIOB20/main/tutorial_07/a.txt”
Null: there is no difference -> before = after
Alternative: there is a difference.

Compute the Statistic:


●​ Can use difference in mean (aka mean of before vs mean of after)
○​ However, if there is an outlier, its going to skew the data; won’t reflect the majority of the data.
●​ So maybe we can use the median (middle point) which is less sensitive to outliers but the null
distribution is hard to compute with the mean or the median.
●​ Instead, we use the effect size as the statistic.

Choosing what Distribution to Use:


●​ The problem doesnt give us what distribution to use (runif, rbinom, rpois, rgamma, rexp etc.)
○​ This depends on the context of the problem
○​ If you expect all outcomes to be equally likely, you might use runif because it generates uniformly
distributed random #s between 0 and 1 that have an equal prob of being chosen.
○​ If you are counting events, might use rpois because it models the Poission distribution used for
modeling count date (eg. # of events in a fixed period)
○​ If you are looking at binary outcomes, use rbinom; like flipping a coin where each outcome is
either “success” or “failure”
○​ If you are modeling waiting times or life spans, you can use rgamma to model data that is positive
and can range widely.
○​ Modeling data for naturally occurring phenomena like height/test scores, you may use rnorm that
model the normal distribution where the data is symetrically distributed around a central mean.
○​ May use rexp for the exponential distribution that models waiting times or the time between
random events that happen continuously & independently at a constant rate.
■​ Eg. modeling how long it takes for a machine to fail (on average, once every 5 hours), you
could use rexp(n=100, rate=1/5) to simulate 100 random times between failures.
■​ Usually right skewed, meaning that smaller times are more frequent, and larger times are
less frequent, but possible.
●​ > hist (replicate (n=1000000, mean(runif(5))), breaks = 50)
○​ Avg of 5 #’s x 1000000 looks like the normal distribution; true because of the CLT.
○​ This happens because the probability of getting extremely high or low averages (like 0 or 1) from
5 random uniform numbers is very low.
●​ > hist(replicate(n=100000, mean(rexp(50))), breaks=50);
○​ However, as you increase the sample size (e.g., averaging over 50 or 389 values), the distribution
of those averages becomes closer to normal. This also illustrates the power of the CLT.
●​ The larger the sample size, the more the mean of any random sample (from any distribution) tends to
resemble a normal distribution.
●​ In practice, when you are calculating means, you don't always need a large sample to apply normal
distribution theory (as the sample size 30 is typically enough), but larger samples reduce variability and
make the results more reliable.

Test Statistic:

●​ The formula statistic <- function(x, y) { (mean(x)-mean(y)) / (sd(x)+sd(y))} calculates the standardized
difference between the two means, accounting for the variability (standard deviations) in both groups.
●​ Difference between the means of two groups (before & after) divided by the sum of their SDs.
●​ When the null hypothesis is true (no difference between means), this statistic will be close to 0. When
the alternative hypothesis is true (there is a difference), the statistic will be either very positive or very
negative, indicating that the difference is significant.
●​ This is called standardization by removing the influence of the spread (variability) in the data and after
standardization, the test statistic follows a normal distribution if the sample sizes are large enough,
thanks to the CLT.
○​ Allows us to apply normal distribution theory to make inferences about whether the difference in
means is statistically significant.
●​ > nulld <- replicate (n=100000, statistic(rnorm(30), rnorm(30)))
●​ > nulld2 <- replicate(n=100000, statistic(runif(30), runif(30)))
○​ Can visualize both by:
○​ > par(mfrow = c(1, 2))
○​ hist(nulld, breaks=50)
○​ hist(nulld2, breaks=50)
○​ Regardless of the underlying distribution (normal or
uniform), the CLT (Central Limit Theorem) kicks in and the
test statistic becomes approximately normal when the
sample size is large enough (in this case, 30).

Decision Rule:
●​ Compute the statistic for the real data:
○​ > statistic(a$x[1:30], a$x[31:60])
○​ > 0.1232936
●​ > hist(nulld, breaks=50)
●​ > abline(v=0.1232936, col="blue", lwd=4)
●​ The rejection region on both sides; 0.5% on each side. Thus you can calculate the abs value of the null
distribution to make it two-sided.
●​ > quantile (abs(nulld), 0.99) - 1% significance level
○​ > 0.3471285
●​ > abline(v=0.3471284, col="red", lwd=4)
●​ When the distribution is in abs: the rejection region (1%) is on the right - sum of the two 0.5%.
●​ Accept the null because we are before the threshold (0.1232 < 0.34712)

Compute the P-Value:


●​ Or we could compute the p-value which tells you the probability of obtaining a test statistic as extreme
as the observed one under the null hypothesis.
●​ Calculate the p-value as the proportion of null test statistics greater than your observed value:
●​ > sum(abs(nulld) > 0.123) / 100000
●​ > 0.34749
●​ The p-value is 0.34, meaning there's a 34% chance of getting a test statistic as extreme as 0.1232936
under the null hypothesis. Since this p-value is much larger than the 1% threshold, you accept the null
hypothesis (no significant difference).
●​ Or calculate the p value using mean ()
○​ > mean(abs(nulld) > 0.123)

Example 2 - Increase in temp in 2020 (East or West)


Was there an increase in temp in 2020?
●​ url <- "https://raw.githubusercontent.com/gui11aume/BIOB20/main/tutorial_07/b.txt
Null: no difference; east = west
Alt: there is a difference.
> statistic <- function(x, y) { abs(mean(x)-mean(y)) / (sd(x)+sd(y)) }
> nulld = replicate (n=10000, statistic(rnorm(30), rnorm(30)))
> statistic (b$x[1:30], b$x[31:60])
> hist(nulld, breaks=50)
> abline (v=0.02461167, col="blue", lwd = 3)
> mean(abs(nulld) >= 0.02461167)
[1] 0.8519
Accept null hypothesis

VIDEO #8 NOTES
Ranks in R:
●​ Ranking is used to determining the relative positions of values within a vector.
●​ > x = rnorm(n=5)
○​ sort (x) ⇨ -0.6991376 -0.6768828 0.3030927 0.9424099 1.2891228
○​ rank (x) ⇨ 4 5 2 1 3
■​ Lowest value gets rank 1, and highest value gets rank 5.
●​ > y = rnorm (n=5) + 30
●​ > rank (c (x, y))
○​ > 4 5 2 1 3 6 10 9 7 8
○​ X is centered around 0 and y is shifted by 30, thus elements of x occupy the lower ranks (first 5)
and elements of y occupy the higher ranks (second 5).
●​ When vectors with smaller shifts (e.g., +5 or +2) are combined, the separation of ranks is less distinct,
leading to potential overlap in ranks.
○​ > rank(c(rnorm(n=5), rnorm(n=5)+30)) ⇨ 1 5 2 4 3 9 10 8 6 7
○​ > rank(c(rnorm(n=5), rnorm(n=5)+5)) ⇨ 2 3 5 4 1 8 7 10 6 9
○​ > rank(c(rnorm(n=5), rnorm(n=5)+2)) ⇨ 1 3 4 6 2 7 9 8 10 5
■​ Pattern breaks; 6 in the first half and 5 in the second half.
●​ When two vectors from the same distribution are combined, the ranks are generally more randomly
distributed.
○​ > rank(c(rnorm(n=5), rnorm(n=5))) ⇨ 7 4 8 10 2 5 6 9 1 3
●​ But increasing the gap between the means, the ranks separate (eg. adding 30 to one sample). First
sample occupying the lower ranks and the second sample occupying the higher ranks.

T-Tests:

●​ The t. test() function is used to compare the means of two samples.


●​ It tests the null hypothesis that the means of two normally distributed populations are equal.
●​ > t.test(rnorm(n=10), rnorm(n=10))
○​ This runs a t-test between 2 samples of size 10 drawn from a normal dist (mean = 0, sd = 1).
○​ > t = -0.6412, df = 16.965, p-value = 0.53
○​ > alternative hypothesis: true difference in means is not equal to 0
○​ > 95 percent confidence interval: -1.1489708 0.6134363
○​ > sample estimates: mean of x (-0.06043455) mean of y (0.20733270).
●​ If p-value < 0.05, reject the null hypothesis at the 5% significance level.
●​ If p-value ≥ 0.05, fail to reject the null hypothesis.

Simulating P-Values Under the Null Hypothesis:


●​ > t.test(rnorm(n=10), rnorm(n=10))$p.value ⇨ 0.02441476
●​ If the null hypothesis is true:
○​ > p.val = replicate(n=100000, t.test(rnorm(n=10),
rnorm(n=10))$p.value)
○​ Histogram should show a uniform distribution between 0 and 1
when the null hypothesis is true.
●​ If the null hypothesis is false:
○​ p.val = replicate(n=100000, t.test(rnorm(n=10, mean=1),
rnorm(n=10))$p.value)
○​ When the mean difference is 1, p-values are clustered closer to 0, indicating a
higher probability of rejecting the null hypothesis.
○​ This simulation demonstrates that when the null hypothesis is false, p-values
are more likely to be small, leading to rejection.
●​ > mean(replicate(n=100000, t.test(rnorm(n=10, mean=1), rnorm(n=10))$p.value) < 0.01)
○​ Proportion of p-values < 0.01 indicates the rejection rate at a 1% significance level.
○​ > 0.28251 ⇨ 28% rejection rate when the mean difference is 1.
●​ As the mean difference increases (eg, mean = 2), the rejection rate also increases, indicating increased
power (eg. rejection rate increases to 92%).

Power Curves:
●​ The power curve is a plot of the probability of rejecting the null hypothesis (rejection rate) for varying
mean differences.
●​ Create a sequence of mean differences:
○​ > mean = seq(from=0, to=3, by=0.1)
○​ > rejection_rate = rep(NA, length(mean)) = initializes the vector
■​ Output here would just be NA NA NA…..
●​ Loop through each mean difference, calculating the rejection rate:
○​ > for (i in 1:length(mean)){ rejection_rate [i] <- mean(replicate(n=100000, t.test(rnorm(n=10,
mean=mean[i]), rnorm(n=10))$p.value) < 0.01)}
○​ > plot(x=mean, y=rejection_rate, type= "l")
●​ Plot shows the prob of rejecting the null hypothesis at diff mean
differences.
●​ The curve starts at 1% (our significance level) when the mean is 0 and
rises to 1 as the mean difference increases.
○​ Red line at 0.01 shows that the rejection rate aligns with the significance level when the null
hypothesis is true.
○​ > abline(h=.01, col=2)
●​ Changing significance level to 5%.
○​ > for (i in 1:length(mean)){ rejection_rate [i] <-
mean(replicate(n=100000, t.test(rnorm(n=10, mean=mean[i]),
rnorm(n=10))$p.value) < 0.05)}
○​ > lines(x=mean, y=rejection_rate, col = 4) (adds new line to plot)
○​ > abline (h=0.05, col=2)
○​ The blue curve (5% level) lies above the black curve (1% level),
showing a higher probability of rejecting the null hypothesis
throughout.
●​ Thus, increasing from 1% to 5%; increase the prob of rejecting the null
○​ But also increases the probability of being right.

For Loops in R:
●​ > for (variable in sequence) {# Code to repeat}
○​ variable: Think of this as a placeholder that changes every time the loop runs.
○​ sequence: This is a list of values that the loop goes through, one by one.
○​ The code block inside {} runs for each value in the sequence.
●​ > for ( i in 1:3) { print (i) }
○​ i is the placeholder. It starts at 1 and changes to 2, then to 3.
○​ 1:3 means the sequence is 1, 2, 3.
○​ First run: i = 1, so R prints 1.
○​ Second run: i = 2, so R prints 2.
○​ Third run: i = 3, so R prints 3.
●​ General Rule for For Loops
○​ Set up a placeholder (variable) that changes each time.
○​ Decide the sequence of values you want to go through.
○​ Write the code inside {} that you want to repeat.
○​ Each run of the loop uses the next value in the sequence.
●​ words <- c("Hi", "Bye", "Thanks")
○​ for (word in words) { print(word)}
○​ R prints "Hi", "Bye", and "Thanks".

TUTORIAL #8 NOTES

Bayesian Problem:
●​ When sample sizes are too small, the t.test is not suitable (according to the CLT).
●​ > sz = rpois(n=1, lambda=12.3)
●​ > rninom (n=1, size=sz, prob=0.5)
○​ > 8
●​ What is the prob that sz is equal to 10?
○​ > dpois (lambda=12.3, x=10)
○​ [1] 0.09941821
○​ The prob is 9.9%.
●​ If the output now becomes 11, instead of 8, the prob that sz = 10 change?
○​ This introduces a Bayesian problem, which involves updating probabilities after observing new
data.

What is the prob that sz = 10, given that we observe 6?


●​ > sizes = rep(NA, 1000000)
○​ Output: NA NA NA NA NA NA…..
●​ > obs = rep(NA, 1000000)
○​ This are initializing vectors that allow to store the results of poission and binomial draws.
●​ > for (i in 1:1000000) {sz=rpois(n=1, lambda=12.3); x=rbinom (n=1, size=sz, prob=0.5); sizes[i]=sz;
obs[i]=x}
○​ The loop iterates 1 million times
○​ For each iteration
■​ A poisson random variable (sz) is generated
■​ A binomial random variable (x) is generated
■​ These values are stored in the respective vectors sizes and obs.
●​ > sum(obs==6) ⇨ 160472
●​ > sum(obs==6 & sizes == 10) ⇨ 20347
●​ > 20347/160472 ⇨ 0.1267947

Bayesian Update
●​ The shift from 9.9% (initial probability) to 12.7% (after observation) is an example of a Bayesian update.
●​ This approach adjusts the probability based on observed data, rather than being a typical statistical
estimation or test.

Example #1 - Trying to Guess SZ


You are trying to guess the value of sz from:
●​ > sz = rpois(n=1, lambda=12.3)
●​ We have to update the prob of sz equaling a specific value (eg. 10) after observing outcomes from a
binomial distribution.
●​ > rbinom (n=1, size=sz, prob=0.4) ⇨ 5

What is the prob that sz is equal to 10?


●​ Initial probability: 9.9% for sz = 10

> sizes = rep(NA, 1000000)


> obs = rep(NA, 1000000)
> for (i in 1:1000000) {sz=rpois(n=1, lambda=12.3); x=rbinom (n=1, size=sz, prob=0.4); sizes[i]=sz; obs[i]=x}
> sum(obs==5) ⇨ 175349
> #sum(is.na(obs)); check for NA indicating if anything went wrong;
> sum(obs==5 & sizes ==10) ⇨ 19982
> 19982/175349 ⇨ 0.1139556

Example #2 - Magic Number & U Statistic


> x = rexp (n=3)
> y = rexp (n=5) + magic_number

We have to determine if the magic number is 0 or non-zero


●​ Null: x and y come from the same exponential distribution
●​ Alt: y has a shifted exponential distribution compared to x.

Using effect size for the statistic is problematic becaus the sample sizes are too small and exp is known to need
at least 30 values to apply the CLT to approximate a normal distribution.
●​ Thus, we can use relative positions via ranks where y should have larger values compared to x.
●​ > rank (c(x,y).
●​ If x & y are really far apart, x will always have the first 3 ranks and y will have the other 5 ranks but of the
magic # is 0, the ranks will be randomly mixed due to x and y having the exact same distribution.

Compute U Statistic:
●​ Can compute the mean of the ranks (mean(c(2,3,1)) - mean(c(6,5,7,8,4)) ⇨ -4 (u-statistic)
●​ > statistic <- function(x, y) { ranks_x <- rank(c(x, y))[1:length(x)] ranks_y <- rank(c(y, x))[1:length(y)]
mean(ranks_x) - mean(ranks_y)}

Simulate the Null Distribution::


●​ Obs statistic = -4 (statistic (x,y))
●​ > nulld = replicate(n = 1000000, statistic(rexp(n=3), rexp(n=5)))
●​ > quantile (nulld, 0.995) ⇨ 99.5% ⇨ 4
●​ > quantile (nulld, 0.005) ⇨ 0.5% ⇨ -4
●​ > abline (v=c(-4,4))
●​ Its exactly on the line, so we accept the null hypothesis.

Example #3 - Magic Number


> x = rexp(n=7)
> y = rexp(n=5) + magic_number
> statistic (x, y) ⇨ - 6
> statistic <- function(x, y) { ranks_x <- rank(c(x, y))[1:length(x)] ranks_y <- rank(c(y, x))[1:length(y)]
mean(ranks_x) - mean(ranks_y)}
> nulld = replicate (n = 1000000, statistic(rexp(n=7), rexp(n=5)))
> quantile (nulld, 0.995) ⇨ 99.5% ⇨ 4.971429
> quantile (nulld, 0.005) ⇨ 0.5% ⇨ -5.314286
> abline (v=c(-5.314286, 4.971429))
-6 is outside the cutoff; thus reject the null distribution.

How to Determine if We Use the T-Test or Wilmax:


> x = rexp(n=35)
> y = rexp(n=45) + 0.2
●​ In this example the null hypothesis and x & y do not come fro the same distribution (y is shifted by 0.2)

> t.test (x,y)


●​ t = -1.7834, df = 76.407, p-value = 0.07849 alternative hypothesis: true difference in means is not equal
to 0 95 percent confidence interval: -0.77891635 0.04293989 sample estimates: mean of x mean of y
0.9253395 1.2933278
●​ P-value (0.07849), is greater than 0.05, indicating that we fail to reject the null hypothesis at the 5%
significance level.
●​ Since we know that the null hypothesis is false, this result suggests that the t-test is not rejecting the
null hypothesis correctly.

> wilcox.test(x,y)
●​ W = 572, p-value = 0.03655 alternative hypothesis: true location shift is not equal to 0
●​ The p-value is approximately 0.03655, which is less than 0.05, suggesting that we reject the null
hypothesis at the 5% significance level.
●​ However, at the 1% significance level, we still fail to reject the null hypothesis.

Comparison of t-test and Wilcoxon test


●​ t-test gives a p-value of 0.07849, while the Wilcoxon test gives a p-value of 0.03655.
●​ The Wilcoxon test is more likely to reject the null hypothesis in this case, indicating that it may be more
appropriate for non-normal data or data with shifts.
●​ The results suggest that the Wilcoxon test has higher statistical power than the t-test under these
conditions.

Statistical Power:
●​ Statistical power is the probability of correctly rejecting the null hypothesis when it is false.
●​ The power is calculated by simulating the tests many times (1 million times) and counting how often the
null hypothesis is rejected at a specific significance level (e.g., 0.01).

> pval_t <- replicate(1000000, t.test(rexp(n=35), rexp(n=45) + 0.2)$p.value)


> pval_u <- replicate(1000000, wilcox.test(rexp(n=35), rexp(n=45) + 0.2)$p.value)
> sum(pval_t < 0.01) # Example: 54 times
> sum(pval_u < 0.01) # Example: 142 times
The Wilcoxon test rejects the null hypothesis more frequently, suggesting it has greater statistical power in this
scenario.
●​ The choice between the t-test and Wilcoxon test depends on:
●​ Distribution of the data: If data is normal, the t-test is appropriate; if not, the Wilcoxon test is more
robust.
●​ Sample size: Small sample sizes may affect the power of both tests, but the Wilcoxon test tends to
perform better in non-normal conditions.

VIDEO #9 NOTES

Squaring a Normal Random Variable:


●​ When we square values from a standard normal distribution
(rnorm), the resulting distribution is related to the gamma
distribution with shape = 0.5 and scale = 2.
●​ > x <- rnorm(n = 100000)^2
●​ > plot(density(x), main = "Density of Squared Normals vs Gamma
Distribution")
●​ > y <- rgamma(n = 100000, shape = 0.5, scale = 2)
●​ > lines(density(y), col = "green")
●​ > legend("topright", legend = c("Squared Normal", "Gamma (0.5, 2)"), col = c("black", "green"), lwd = 2)
●​ Observation: the density of squared normals = the gamma distribution with shape = 0.5 and scale = 2.

Sum of Squared Normals:


●​ When you SUM several squared normal random variables (eg.
5), the resulting distribution no longer matches a gamma with
shape = 0.5.
○​ It instead coincides a gamma where the shape is
multiplied by 0.5, with scale 2.
●​ > x <- replicate(n = 100000, sum(rnorm(n = 5)^2))
●​ > plot(density(x), main = "Sum of Squared Normals vs Gamma
Distribution")
●​ > y <- rgamma(n = 100000, shape = 5 * 0.5, scale = 2)
●​ > lines(density(y), col = "red")
●​ Observation: The sum of squared normals matches the gamma distribution with shape = 5 × 0.5 = 2.5
and scale = 2.

Connection to Chi-Square Distribution:


●​ The sum of squared normal random variables is also equivalent to a chi-square distribution with degrees
of freedom equal to the number of terms summed.
●​ In the example, summing 5 squared normals corresponds to a chi-square distribution with 5 degrees of
freedom.
●​ A chi-square distribution with df degrees of freedom is equivalent to a gamma distribution with:
○​ Shape = df / 2
○​ Scale = 2

Logical Operations:
●​ Logical Comparisions: evaluate conditions and return TRUE, FALSE, or NA for each element of a vector.
○​ > x <- rpois(n = 20, lambda = 1827)
○​ x > 1830; output: TRUE or FALSE for each value in x
●​ Subsetting Using Logical Conditions:
○​ Vector [condition]
○​ x [x > 1830]; returns only the values of x where the condition is TRUE.
●​ Modulo Operation (%%); finds the remainder when dividing by a number
○​ x %% 2 # Remainders when dividing each value in x by 2
■​ 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 1 0 0 0 1
○​ x %% 2 == 0 # TRUE for even numbers, FALSE for odd numbers
■​ TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE…
○​ x[x %% 2 == 0]; subset even numbers
■​ 1758 1816 1870 1798 1844 1806 1828 1884 1866 1796 1788 1848..
●​ Combining Logical Conditions:
○​ & (AND): Both conditions must be TRUE.
○​ | (OR): At least one condition must be TRUE.
○​ > x[(x %% 2 == 0) & (x %% 3 == 1)]
■​ #’s that are divisible by 2 AND leave a remainder of 1 when divided by 3.
○​ > x[(x %% 2 == 0) | (x %% 3 == 1)]
■​ #’s that are divisible by 2 OR leave a remainder of 1 when divided by 3.

Handling Missing Values (NA):


●​ Functions like which() or is.na() to handle missing values.
●​ x <- c(1, 2, 3, 4, NA)
●​ x < 3 # Returns TRUE, FALSE, or NA
●​ x[x < 3] # Subset values less than 3 (includes NA)
●​ which (x < 3) # get values where condition is TRUE (ignores NA)
●​ which () function; returns values of elements satisfying a condition; useful for finding positions, not
values.

Frequency Tables:
●​ x = rbinom(n = 1000, prob = 0.5, size = 13)
●​ table(x) - Counts how many times each value occurs in x (a frequency table).
●​ plot(table(x)) - A bar plot showing the frequencies of each value in x.
●​ Using data (esoph), can make contingency tables;
○​ table(esoph$alcgp); frequency table of alcohol
consumption
●​ table(esoph$alcgp, esoph$tobgp); Cross-tabulate alcgp and tobgp.
●​ Extract specific rows or columns from the contingency table:
○​ y = table(esoph$alcgp, esoph$tobgp)
○​ y["120+", ] # All columns for "120+" alcohol consumption
○​ y[, "30+"] # All rows for "30+" tobacco consumption

Generate a Chi-Square Distribution:


●​ x <- replicate(n = 100000, sum(rnorm(n = 7)^2))
●​ plot(density(x))
●​ lines(density(rgamma(n = 100000, shape = 3.5, scale = 2)), col = 3)
●​ Simulates a Chi-Square distribution by summing squared normal
values (rnorm(n = 7)^2).
●​ Overlay the density of the theoretical gamma distribution (rgamma()).

TUTORIAL #9 NOTES

Hypothesis Testing for Independence:


●​ We test whether blood type and geographic location are
independent using probabilities.
●​ Independence: Two events, A and B, are independent if P (A ⋂ B) = P(A) x P(B)
○​ Basically two variables are independent if knowing variable A, tells you nothing about the
variable B (eg. knowing about the location like NL/SLP, tells me nothing about blood types).
■​ Knowing the prob of A does not change the prob of B.
○​ They are dependent if knowing one variable tells you something about the other.
●​ (1) Calculate Marginal Probabilites
○​ N <- sum(btO) = 1539
○​ pNL (prob of being from Nuevo Leon)
■​ pNL <- (575 + 346) / N
○​ p0 (prob of having blood type O)
■​ p0 <- (201 + 346) / N
●​ (2) Expected Joint Probability:
○​ Since the two variables are independent, we multiply the two probabilities
○​ pExpected <- pNL * p0
○​ [1] 0.2127011
○​ > #thus 21% is the expected probability.
●​ (3) Expected Counts:
○​ but to get the actual count of expected # of people, you multiply with 1539.
○​ Multiply the joint probability by the total sample size to get expected counts
○​ pNL * pO * N
○​ 327.347 #327 people.
●​ Looking at the original contingency table the observed amount of people from NL with blood type O is
326 which is very close to the expected (327).
●​ If the observed count is close to the expected count, the variables are likely independent.
●​ (4) Chi-Squared Test for Independence
○​ The Chi-Squared Test formalizes this comparison to check independence.
○​ chi_sq_test <- chisq.test(btO)
○​ p-value: If p-value < 0.05, reject the null hypothesis (the variables are not independent).
○​ Expected Counts: Automatically calculated by chisq.test().

Hypothesis Testing Using Rbinom:


●​ Null Hypothesis: The two variables (location and blood type) are independent.
○​ Statistic: Number of people from Nuevo Leon with blood type O.
○​ Proxy for Null Distribution: Sample 10,000 values from a binomial distribution
●​ Simulating Null Distribution:
○​ Using pNL, p0, pNL * p0, and N
○​ nulld <- rbinom(n = 10000, size = 1539, prob = pNL * p0)
○​ hist(nulld, main = "Null Distribution", xlab = "Number of People", col = "lightblue")
●​ Setting the Rejection Region:
○​ Level1%; rejection region will be the lower and upper
0.5% of the null distribution.
○​ quantile(nulld, 0.005) ⇨ abline (v=286, col = “red”)
○​ quantile(nulld, 0.995) ⇨ abline (v=370, col = “red”)
●​ Add Observed value: abline (v=346, col = “blue’)
●​ Decision rule: observed value is within the bounds so we
accept the null hypothesis.
○​ The observed data is consistent with the null
hypothesis of independence between location (Nuevo
Leon) and blood type (O). Thus, we accept that the two
variables are likely independent at the 1% significance
level.
○​ Aka; no its not in the rejection leve

Question #4 - Creating a Data Frame for Patients:


●​ This question is about understanding two ways of thinking about how the data could have been
collected: Fixed Sampling or Random Sampling.
●​ (1) Fixed Sampling:
○​ The total # of patients from each location (NL and SLP) is fixed and cannot change; and the total #
of patients with blood type (and not O) is fixed.
○​ Eg. exactly 921 patients from NL and 618 patients from SLP; we know the exact counts of those
with and without blood type O.
○​ If the totals are fixed, you are not sampling randomly from the population.
○​ the numbers are predetermined and constant, so your model must reflect this setup (no
randomness in totals).
●​ (2) Random Sampling:
○​ The totals for each group (location or blood type) can vary because they are random outcomes of
the sampling process.
○​ You randomly pick patients from the population, and the number of people from Nuevo Leon or
with blood type O could be different each time you sample.
○​ Eg. survey by randomly selecting 1,539 patients from a hospital. The number of patients from
Nuevo Leon or with blood type O isn’t predetermined—it depends on the randomness of the
sampling process.
○​ If the data collection was random, the totals are not fixed and follow a distribution (e.g.,
binomial).
■​ This introduces variability, so your model needs to account for randomness.
●​ Thus, the data suggests fixed sampling because the counts are already given and fixed.
○​ the data likely does not follow a binomial distribution, which assumes random sampling.
●​ The question is asking if the data can be modeled using fixed or random sampling.
●​ Creating a data frame (patients) with one row per patient
○​ The columns in the data frame will be:
■​ N1: indicates if a patient is from NL (1 for NL, 0 for SLP).
■​ o: indicates: if a patient has blood type O (1 for O, 0 for not O).
○​ n_NL <- 575 + 346 # Patients from Nuevo Leon
○​ n_SLP <- 417 + 201 # Patients from San Luis Potosi
○​ n1 <- c(rep(1, n_NL), rep(0, n_SLP)) # Repeat 1 for NL and 0 for SLP
○​ o <- c(rep(1, 346), rep(0, 575), rep(1, 201), rep(0, 417)) # Blood type O and Not O
○​ patients <- data.frame(n1, o)
○​ dim(patients) # Output: 1539 rows, 2 columns
●​ Then we verify the contingency tab;e
○​ > table(patients)
○​ This matches the original bt0 data.
●​ Shuffle the o Column (Resampling):
○​ Resampling means randomly shuffling the values in a column while keep the total counts the
same; thus we shuffle the o column (blood type) to simulate what the data might look like if the
location (n1) and blood type (o) were independent.
○​ Shuffling the o column would break any existing relationship between location and blood type.
■​ The totals for each category remain the sae but the relationship between location and
blood type is randomized.
○​ Use sample() to randomly reorder the values in the o column.
■​ This does not change the total number of 1s (blood type O) or 0s
(not blood type O); it only reassigns them randomly across patients.
■​ > patients$o <- sample(patients$o)
■​ > table (patients)
■​ Output: The row and column totals (margins) remain the same after resampling.
○​ By comparing the shuffled data to the original data, we can test whether the relationship
between location and blood type in the original data is meaningful or could happen by random
chance.
●​ Then confirm margins match original data
○​ After shuffling o column, we randomize the relationship between n1 and o but the row totals and
column totals should remain unchanged.
○​ > rowSums(table(patients)) # Total per location
■​ 0 (618) 1 (921)
●​ This matches the total 1539.
○​ > colSums(table(patients)) # Total per blood type
■​ 0 (992) 1(547)
●​ NL: 417 + 575 = 992
●​ SLP: 346 + 201 = 547
○​ Totals match the original btO data; confirms the independence.
●​ Total # of 1s in the patients data frame is 1468 (921+547).

Question #5: Shuffling only Second Column:

●​ Shuffle only the second column (o, blood type) in the patients data frame.
●​ Find the count of patients from Nuevo Leon (n1 = 1) with blood type O (o = 1) after shuffling.
●​ set.seed(123)
●​ patients$o <- sample(orig$o) # Shuffle the `o` column
●​ table(patients)
●​ table(patients)[2,2] ⇨ 323.

Question #6: Generating a Null Distribution Using Resampling:

●​ This question involves simulating a null distribution by shuffling data 10,000 times and determining
whether the observed value lies within the rejection region.
●​ Shuffle the second column (o, blood type) of the patients data frame 10,000 times using a for loop.
●​ Record the number of patients from Nuevo Leon (NL) (n1 = 1) with blood type O (o = 1) in each shuffled
iteration.
●​ nulld <- rep(NA, 10000)
●​ for ( i in 1:10000) {patients$o = sample(patients$o); nulld[i] =
table(patients)[2,2]}
○​ sample(patients$o) shuffles the o column
○​ nulld[i] stores the count of NL ppl with type O.
●​ hist(nulld, main = "Null Distribution", xlab = "Count of NL with
Blood Type O", col = "lightblue")
○​ The histogram shows the range of counts we expect if
blood type and location were independent.
●​ Find rejection region
○​ quantile(nulld, 0.005) ⇨ abline(v =304, col = "red")
○​ quantile(nulld, 0.995) ⇨ abline (v=351, col = “red”)
●​ abline(v = 346, col = "blue")
●​ Lies within rejection region; accept the null hypothesis. This suggests that the observed relationship
between location and blood type could occur by random chance if they were independent.

Question #7: Using the Hypergeometric Distribution:


●​ how to use the hypergeometric distribution to simulate a null distribution when sampling without
replacement; contrast to last question where we sample with replacement.
●​ Patient with O type are white balls and other is black balls.
●​ Number of patients drawn from NL is the sample size (921); sampling without replacement.
●​ Compare the bounds obtained using the hypergeometric distribution to those from Question 6, where
sampling was done with replacement.
●​ (1) Hypergeometric Distribution (rhyper())
○​ nn = number of draws/simulations
○​ m = total number of "successes" in the population (patients with blood type O).
○​ n = total number of "failures" in the population (patients with other blood types).
○​ k = sample size (number of patients from Nuevo Leon).
●​ nulld <- rhyper(nn = 10000, m = 547 , n = 992, k = 921)
○​ m (201 + 346); n (575 + 417)
●​ hist(nulld, main = "Null Distribution (rhyper)", xlab = "Count of NL
with Blood Type O", col = "pink")
●​ Find rejection region
○​ quantile(nulld, 0.005) ⇨ abline(v =304, col = "red")
○​ quantile(nulld, 0.995) ⇨ abline (v=351, col = “red”)
●​ abline(v = 346, col = "blue")
●​ The observed value (346) lies within the rejection region.
●​ Conclusion: Fail to reject the null hypothesis. The observed count
is consistent with the assumption of independence.
●​ Comparison with Question 6:
○​ In Question 6, sampling was done with replacement using a binomial distribution.
○​ In Question 7, sampling is done without replacement using a hypergeometric distribution.
○​ The rejection region from the hypergeometric distribution is narrower (more compact) because
sampling without replacement reduces variability (fewer degrees of freedom).

Question #8 - Comparing Null Distribution to a Gaussian Approximation:

●​ This question explores generating a null distribution for a statistic using a binomial model with
replacement and transforming it into a standardized statistic.
The goal is to compare this null distribution to a Gaussian
(standard normal) distribution.
●​ First generate the statistic, simulate 10,000 values from this
transformed null distribution. Then compare the density of this
null distribution to the Gaussian distribution.
●​ nulld <- (rbinom(n = 10000, size = N, prob = pNL * pO) - pNL * pO *
N) / sqrt(pNL * pO * (1 - pNL * pO) * N)
●​ plot(density(nulld), main = "Comparison of Null Distribution and
Gaussian", xlab = "Standardized Statistic", col = "blue", lwd = 2)
●​ gauss <- rnorm(100000)
●​ lines(density(gauss), col = "red", lwd = 2)
●​ legend("topright", legend = c("Null Distribution", "Gaussian
Distribution"), col = c("blue", "red"), lwd=2)
●​ The null distribution and the Gaussian distribution are nearly
identical, confirming that the Gaussian approximation works well in
this case.

Question #9 - Calculating the P-Value Using the


Standard Normal Distribution:

●​ This Q requires computing the observed


statistic (z score) from the binomial model
and then determining the p-value for a
two-sided using the standard normal distribution as the null distribution.
●​ (346 - pNL * pO * N)/sqrt(pNL * pO * (1 - pNL * pO) * N)
○​ [1] 1.161917
●​ Find rejection region
○​ quantile(nulld, 0.005) ⇨ abline(v =-2.513258, col = "red")
○​ quantile(nulld, 0.995) ⇨ abline (v=2.584613, col = “red”)
●​ abline(v = 1.161917, col = "blue")
○​ Within the rejection region, so fail to reject the null hypothesis.
●​ Compute the P-value:
○​ We have to do a two-sided test to check for extremeness on both sides of the null distribution.
○​ sum(nulld >= 1.161) / length(nulld) #output 0.1295; 12.95% of the values are in the right tail.
○​ We also need to the left tail
■​ 2 * (sum(nulld >= 1.161) / length(nulld)) # output 0.259 (25.9% total p-value)
○​ Multiplying by 2 ensures we account for both tails of the distribution.
○​ p-value=0.259, which is much greater than 0.01, so we fail to reject the null hypothesis. The
observed value is consistent with the null model.

Question #10: Manual Calculation of the Chi-Square Test:

●​ This question explains how to compute the chi-square test statistic manually for a 2×2 contingency table.
This statistic evaluates whether two categorical variables (e.g., location and blood type) are independent.
●​ The outer() function calculates the expected values under the null hypothesis of independence:
○​ expected = outer(rowSums(btO), colSums(btO)) / N
●​ For each cell, compute the chi-square contribution:
○​ (btO - expected)^2 / expected
●​ Sum all the contributions to get the chi-square statistic:
○​ sum((btO - expected)^2 / expected)
○​ 4.106456

Question #11 & 12 - Using Chisq.test ( ) for Chi-Square Test:

●​ By default, chisq.test() applies Yates' continuity correction to the chi-square statistic. This adjusts the
statistic to account for small sample sizes in 2×2 tables
●​ chisq.test(btO) #statistic = 3.8893
○​ statistic is different from the manually computed value in Q10 due to the continuity correction.
●​ To match the manually computed statistic, set the correct parameter to FALSE to disable Yates' continuity
correction:
○​ chisq.test(btO, correct = FALSE) #statistic = 4.1065
○​ This value matches the manually computed statistic from Question 10
●​ Rmultinom () function can simulate random contingency tables based on the probabilities in the expected
table.
○​ P <- expected / N
○​ set.seed(123)
○​ mat <- matrix(rmultinom(n = 1, size = N, prob = as.vector(P)), ncol = 2)
○​ This simulates a contingency table with the same probabilities as the
expected values.
○​ chisq.test(mat, correct=FALSE #output statistic = 0.18666

Question 13: Generating a Null Distribution of Chi-Square Statistics:

●​ This task involves simulating the null distribution of the Chi-Square statistic by resampling the
contingency table 10,000 times and comparing it to the theoretical Chi-Square distribution with 1 degree
of freedom.
●​ nulld <- rep(NA, 10000) # Initialize an empty vector
●​ for (i in 1:10000) {mat <- matrix(rmultinom(n = 1, size = N, prob = P), ncol = 2); nulld[i] <- chisq.test(mat,
correct = FALSE)$statistic}
●​ plot(density(nulld), main = "Null Distribution of Chi-Square Statistics", lwd = 4, xlab = "Chi-Square
Statistic")
○​ Plot the density of the simulated null distribution
●​ Overlay the theoretical Chi-Square distribution
○​ chi2 <- rgamma(n = 10000, shape = 1 / 2, scale = 2) # Chi-square with df=1
○​ lines(density(chi2), col = "red", lwd = 3)
Q14:
rl <- "https://raw.githubusercontent.com/gui11aume/BIOB20/main/tutorial_09/ABO.txt"
> ABO <- as.matrix(read.delim(url, row.names = 1))
> row_totals <- rowSums(ABO)
> col_totals <- colSums(ABO)
> grand_total <- sum(ABO)
> expected <- outer(row_totals, col_totals) / grand_total
> set.seed(123)
> nulld <- rep(NA, 10000)
> for (i in 1:10000) {
+ # Generate a resampled contingency table
+ simulated_table <- matrix(rmultinom(n = 1, size = grand_total, prob = as.vector(expected / grand_total)),
+ nrow = nrow(ABO), ncol = ncol(ABO))
+
+ # Compute the Chi-Square statistic
+ nulld[i] <- sum((simulated_table - expected)^2 / expected)
+}
plot(density(nulld), main = "Null Distribution of Chi-Square Statistic",
+ xlab = "Chi-Square Statistic", lwd = 4, col = "blue", xlim = c(0, max(nulld)),
+ ylim = c(0, max(density(nulld)$y) + 0.13))
> chi2 <- rgamma(n = 10000, shape = 3 / 2, scale = 2)
> lines(density(chi2), col = "red", lwd = 3)
> legend("topright", legend = c("Simulated Null Distribution", "Theoretical Chi-Square (df=3)"),
+ col = c("blue", "red"), lwd = 3)

VIDEO #10 NOTES

Understanding the One-Sample T-Test:


●​ The one-sample t-test evaluates whether the mean of your sample data is significantly different from a
hypothesized population mean (μ).
○​ Null hypothesis: the true pop mean equals μ0
○​ Alt hypothesis: the true pop means does not equal μ0
●​ The t-test outputs:
○​ A t-statistic, which measures how far your sample mean is from the hypothesized mean in terms
of standard errors.
○​ A p-value, which tells you whether the observed data is likely under HO.
○​ A confidence interval, a range of values likely to contain the true mean with a given level of
confidence (e.g., 95%).
●​ t.test (x)
○​ HO: μ = 0
○​ Result: a very small p-value indicates strong evidence against H0​.
■​ Therefore, the sample mean is significantly different from 0.
●​ When shifting the sample (x - 18.27)
○​ t.test(x - 18.27)
○​ HO: μ = 18.27
●​ A large p-value suggests compatibility between the sample and the hypothesized mean of 18.27.

Exploring Multiple Hypothesized Means:


●​ Instead of testing one hypothesized mean, we explored a range of potential values for μ (from 15 to 22)
and calculated p-values for each:
●​ mu <- seq(from=15, to=22, length.out=1000)
●​ pvals <- rep(NA, 1000)
●​ for (i in 1:1000) {pvals[i] <- t.test(x - mu[i])$p.value}
●​ The result: A vector of p-values (pvals) corresponding to each hypothesized μ.
●​ By plotting μ against p-values, we can visually observe which hypothesized means are consistent with
the data at a given significance level.

Visualizing P-Values:
●​ The plot shows the relationship between hypothesized means (μ) and p-values:
●​ Blue line at p=0.05
○​ Values of μ with p>0.05 are "accepted" at the 5% significance level. These correspond to means
compatible with the data.
○​ These means form a confidence interval around
the sample mean.
●​ Red line at p=0.01
○​ A stricter threshold narrows the range of accepted
μ, reflecting greater certainty.
●​ The key takeaway is that the p-value tells us whether the
data is compatible with H0, not whether H0 is "true."

Key Results - From the Graph:


●​ Range of Accepted Means (at p = 0.05); [17.85, 19.60]
○​ This range aligns with the 95% confidence interval from the original t.test.
●​ Behavior of P-Values:
○​ P-values decrease as hypothesized means (μ) move further from the sample mean.
○​ At the edges of the confidence interval, p ≈ 0.05
●​ Impact of Thresholds
○​ Lowering the threshold to p=0.01 increases the range of accepted means.

Interpreting Confidence Intervals:


●​ The confidence interval output from a t-test can be understood in two ways:
○​ Statistical Interpretation: A range of values for μ\muμ where we fail to reject H0
○​ Practical Interpretation: A range of plausible population means compatible with the sample data.
●​ Confidence Interval vs Null Hypothesis
○​ The confidence interval and the range of μ\muμ values where p>0.05 are the same. This means:
○​ The confidence interval reflects the null hypotheses we would accept at a 5% significance level.

min(which(pvals>0.05))
> #112
> mu[112]
[1] 15.77778

Video #11 Notes


MLE and Linear Regression:
●​ x <- rnorm(n = 100) # Generate 100 random values from a standard normal distribution
●​ y <- 2.3 * x + rnorm(n = 100) # Add noise to create a linear relationship
●​ Not perfectly linear bc of random noise added to y.
●​ The goal is to estimate the true slope (2.3) that was used to generate y.

Using Trial and Error Approach:


●​ y_ <- 1 * x ; slope = 1 does not match the data points well
●​ y_ = 2 * x ; slope = 2 fits better but not perfect
●​ To find the optimum, we focus on the vertical distances between predicted points and actual data.
○​ Y_ - y (difference in vertical distance)
●​ prod(dnorm ( y_ - y )) ; Compute the likelihood as the product of all probabilities
●​ But better to work with logs
○​ sum (dnorm(y_- y, log = TRUE)) ⇨ -148.2164
●​ Testing another slope (2.1)
○​ y_ = 2.1 * x
○​ sum (dnorm(y_- y, log = TRUE)) ⇨ -146.1897 #higher likelihood

Automating Slope Search with For Loop:


●​ bs <- seq(from = 1, to = 3, by = 0.1) # Test slopes between 1 and 3
●​ lls <- rep(NA, length(bs)) # Create a vector to store log-likelihoods
●​ for (i in seq_along(bs)) {y_ <- bs[i] * x; lls[i] <- sum(dnorm(y_ - y, log = TRUE))}
●​ plot(bs, lls, type = "l") # Plot log-likelihoods for different slopes
●​ [which.max(lls) # Find the slope with the highest log-likelihood
○​ 14
●​ bs [14] ⇨ 2.3
●​ The best slope is around 2.3, matching the original value used to generate the data.

Using Lm for MLE:


●​ x <- rnorm (n=100)
●​ y = 2.3 * x + 4.5 + rnorm(n=100, sd = 1.9)
●​ adding a constant (4.5) and the gaussian noise has an SD of 1.9 thus finding MLE; becomes a bit more
complicated.
●​ Model = lm ( y ~ x) # Linear regression: y is a linear function of x
○​ Coefficients: Lists the intercept and slope estimates.
○​ These estimates are based on maximizing the likelihood of the observed data.
○​ Intercept (4.274) & slope (2.020).
●​ abline(model, col = 2) # Add the regression line in red

TUTORIAL #11 NOTES

Semi Random Models:


●​ x = rnorm (n=25) and y = 2.7 * x + rnorm (n=25)
●​ This is a semi random model because it has a deterministic part (2.7x) and a random part (rnorm)
●​ A model is something that has a deterministic part (fully predictable) and non deterministic (two kinds)
○​ Can have true randomness (eg. rnorm is designed to create randomness)
○​ Or lack of precision (eg. a divide that is accurate to a certain range but up to a certain point there is
a range of uncertainty.

Guessing the Value of the Slope (2.7 if we didnt know it).


●​ Y = a + bx (corresponds to a straight line); a is intercept, b is slope.
●​ Now we always plot the graph to check if it is linear because it could be circular.
●​ lm (y~x)
○​ Intercept: 0.1086 & x: 2.8948
○​ The true value of the slope is 2.7 and the intercept is 0 but these values are a close enough
approximation because in estimation problems, we never get the exact values.
●​ abline (lm(y~x), col = "red") ; corresponds to regression line
●​ abline(0, 2.7, col = "blue")
○​ Difference between the truth (blue) and the estimation from the data (red). not perfect but also
not very far off.

Testing when B is Equal to 0:


●​ When b is equal to 0, it is a flat line and y has nothing to do with x, thus we need to test if the points are
actually linear or if its supposed to be a flat line.
●​ Null (b equal to 0); Alt (b is not equal to 0).
●​ anova(lm(y~x))
○​ strongly reject the null hypothesis; p value (4.019e-13)
●​ Then using confidence intervals to get a range of plausible values of a and b
○​ confint(lm(y~x))
○​ range from 2.48 to 3.30 for b
○​ 2.7 is in the range so the range is correct
●​ hitting it with your 95% confidence interval so you can be confident that b is within this range.
●​ make many null hypothesis b =1, 2, 3... with every value in between. and when it surrounds 2 and 3, the
values were not rejected. infinit amount of statistical tests.

Making Predictions:
●​ If x = 1, what is y?
●​ Plug into equation: y = a(0.1086) + b (2.8948)* 1 EQUALs 3.0034.
○​ best guess is 3
●​ If you are predicting something outside range of observed values (eg. 7), it is called extrapolation
●​ Within range is interpolation
●​ confint (lm(y~x))
○​ 0.1086 + 2.4846760 * 1 = 2.593276
○​ 0.1086 + 3.3049405 * 1 = 3.41354
●​ Anything between 2.59 and 3.41 is a good estimate
for x = 1.

How to Get the Strength of the Relationship:


●​ If you plot x~y or y~x, you will get different estimates using lm.
●​ But using correlation coefficient; cor (x,y); value between -1 and 1; strength of the dependence between x
and y (eg. -0.992736)
○​ Or can do lm(y~x) and then -1.5223 * sd(x)/sd(y) = -0.992736
●​ cor.test(x,y)
○​ p-value < 2.2e-16; reject the null hypothesis.

VIDEO #12 NOTES

i <- rbinom(n=1, size=4, prob=0.45)


j <- print(rbinom(n=1, size=5, prob=i/4)) = 2
What is the prob that i is equal to 3, given that j is equal to 2?

Bayesian Problems Via Replace Method:


●​ Use a replace strategy by simulating a large number of trials to estimate the joint distribution of i and j.
○​ Then compute the conditional probability from the joint distribution
●​ mat <- matrix(0, nrow=5, ncol=6) # Create a 5x6 matrix for joint probabilities
●​ rownames(mat) <- 0:4 # Rows represent possible values of i (0 to 4)
●​ colnames(mat) <- 0:5 # Columns represent possible values of j (0 to 5)
●​ for (k in 1:10000) {
●​ i <- rbinom(n=1, size=4, prob=0.45) # Simulate i
●​ j <- rbinom(n=1, size=5, prob=i/4) # Simulate j given i
●​ mat[i + 1, j + 1] <- mat[i + 1, j + 1] + 1} # Increment the joint frequency
●​ i + 1 and j + 1 are used because R indexing starts at 1 (not 0).
●​ 170/(812+1172+170) = 0.07892293
Mat
Alternative Method:

You might also like