BN2102 1-6 Notes
BN2102 1-6 Notes
Contents
1. Estimation ........................................................................................................................................................ 2
Python .................................................................................................................................................................... 5
2. Hypothesis Testing ..................................................................................................................................... 6
Python ..................................................................................................................................................................11
3. ANOVA .............................................................................................................................................................12
Python ..................................................................................................................................................................17
4. Power and Sample Size & Paired t-test ..........................................................................................18
Part 1: Paired t-tests .....................................................................................................................................18
Part 2: Power & sample size .....................................................................................................................19
Python ..................................................................................................................................................................22
5. Nominal Data ................................................................................................................................................24
Python ..................................................................................................................................................................28
6. Survival Analysis .......................................................................................................................................29
Python ..................................................................................................................................................................34
1. Estimation
Normal Distribution
1. Using the full population as sample (census)
∑𝑋
• Population mean, 𝜇 = 𝑁 𝑖
∑(𝑋𝑖 −𝜇)2
• Population variance, 𝜎 2 = 𝑁
∑(𝑋𝑖 −𝜇)2
• Population standard deviation, 𝜎 = √
𝑁
Standardization
• transform the general N (µ, σ2) to the standard N (0, 1)
𝑍−𝜇
• if 𝑧~𝑁(𝜇, 𝜎 2 ), then 𝜎 ~𝑁(0,1)
***Using only a small random sample from the whole population as sample
• Sample mean 𝑋
• Sample variance s2
Sufficiently large:
- if n > MN, then the CLT applies regardless of the actual distribution of P
- if n < MN, the approximation 𝑋 ∼ N (µ, σ2/n) is only valid if the population P is normal
or at least symmetrical.
Traditionally, MN = 30. Modern simulations revealed that if P is not too skewed, then MN ∼
10 − 20.
Can be used for any sample size when 𝜎 is unknown, when n is large, the answer will be the
same to a normal distribution.
Confidence interval
Increasing sample size will decrease confidence interval for the mean
**For two different samples drawn from the same population, if the size of one sample is
greater than the other, its confidence interval (at same confidence level) for the mean is
smaller than the other confidence interval for the mean.
Estimation
• Visualize data with boxplot
import numpy as np Scipy: import library for t distribution
import matplotlib.pyplot as plt
from scipy import stats
frac_data = np.loadtxt('tutorial1.dat'); np.loadtxt(‘FILE NAME.dat’) imports data (must
be in same folder)
y = frac_data[:,1] Plots only first column of array (all rows)
n = len(y) Sample size (n) – the length of array (y)
plt.boxplot(y); Boxplot syntax !!!! IMPORTANT !!!
• 95% confidence interval
x_bar = np.mean(y); Calculate sample mean
sum_sq = 0; Calculate standard deviation
for i in range(0,n):
sum_sq = sum_sq + (y[i] - x_bar)**2
sample_std = np.sqrt(sum_sq/(n-1));
t_value = stats.t.ppf(0.975,n-1); Stats.t.ppf (percentage point function)
#0.975 = area, n-1 = degrees of freedom, 0.975 = Returns the value of t given an area
1- (0.05/2) under distribution curve
#99% confidence: area = 0.995
upper_bound = x_bar + Calculate upper and lower bound of
t_value*sample_std/np.sqrt(n); interval
lower_bound = x_bar -
t_value*sample_std/np.sqrt(n);
2. Hypothesis Testing
σ unknown. Hence, use the s (sample standard deviation) as an estimate for σ and the T
distribution. The tests will be called Student t tests.
1) Null hypothesis H0: mean of population (𝜇) = a given value of interest (𝜇)
Alt hypothesis H1: mean of population (𝜇) ≠ a given value of interest (𝜇)
2) Choose a significance level 𝛼 (𝛼 = 1 – confidence level%)
P value = 𝛼 / 2
3) Compute sample mean (𝑋) and sample standard deviation (s)
Main idea of hypothesis testing: We try to reject the NULL hypothesis H0 by showing that,
if indeed H0 is true (𝜇 = 𝜇), then the probability of drawing a sample of size n characterized by
X and s, are very small (smaller than α).
Errors:
***The NULL hypothesis (H0) is never confirmed by hypothesis testing. We either reject it, or
fail to reject it.
general rule: the NULL hypothesis is stated as the status quo/no effect/equality between
two things
We draw one sample X of n measurements from a population. Based on the collected data,
we want to know whether the mean of the population µ is greater than a given value of interest
µ or not AND we are completely uninterested in a possible 𝜇 < 𝜇 result
A typical scenario We select two groups of children: the first group X is made of 9 children
whose father has heart disease. The second group Y is made of 15 children whose father is
healthy. Is the average cholesterol level different in the two groups?
The approach: The idea is to use the same technique we used to test for a value of 𝜇 = 𝜇, but
instead test whether 𝜇𝑋 − 𝜇𝑌 = 𝜇 (𝜇 = 0)
If we have two independent normally distributed random variables A and B, their sum will be
also normally distributed, with mean equals the sum of their means and variance equals the
sum of the variances. Same for the difference.
We are going to assume equal variances between X and Y and then we apply CLT to the
difference of the two random variables. If nX and nY are large enough, by CLT, Z will also be
normally distributed.
If standard deviation (𝝈) is unknown → apply student t distribution
From previous lecture: If σ is unknown, we can use the sample variances as estimates,
provided that we assume the underlying populations are normal.
If we reject the NULL hypothesis, we can claim that the data support a difference in the mean
value of cholesterol levels between the population of children whose father is healthy as
compared with the children whose father has heart disease.
If we can’t reject the NULL hypothesis, we can only say that our data are unable to confirm a
difference between the two means. The p-value can be computed as before.
Python
healthy_data = np.loadtxt('Healthy.dat'); Extract all the rows of first column,
temp_inf_hea = healthy_data[:,1]; denote its length as sample size
n_hea = len(temp_inf_hea); n_hea
x_bar_hea = np.mean(temp_inf_hea); Sample mean
sum_sq = 0; Standard deviation
for i in range(0,n_hea):
sum_sq = sum_sq + (temp_inf_hea[i]- *the same is to be performed on
x_bar_hea)**2;
both data sets
s_hea = np.sqrt(sum_sq/(n_hea-1));
t_stat = (x_bar_hea - Obtain t statistic using formula
100)/(s_hea/np.sqrt(n_hea));
t_crit = stats.t.ppf(0.975,n_hea-1) 95% confidence interval: t_crit
Two-sided hypothesis test (𝑋 ≠ 𝜇)
if (t_stat_os>t_crit_os):
print('Reject NULL hypothesis.Data support mu > 86');
else:
print('Unable to reject NULL hypothesis. Data failed to support mu>86')
Hypothesis testing on TWO DATA SETS, assuming unequal variance
The main question is: does diet affect cardiac output? Note that we have 4 groups to compare.
H0: Diet has no effect on cardiac output
H1: Diet has effect on cardiac output
***The key difference from the t test (test the difference between two groups) is that we have
4 groups to compare.
However, if H0 is false instead, it would mean that one or more samples were drawn from a
different population with different mean cardiac outputs.
Key Idea:
F distribution
- x axis: values of F
- The highest probability is around 1
and probabilities get lower as we
move towards the tail.
- Identify a threshold value fcrit.
If F > fcrit → reject the NULL
hypothesis
Otherwise, unable to reject
***as ν2 increases, the area under the tail becomes smaller (as n increases, sampling will be
more accurate and the chances to calculate a big value F will get smaller)
Steps for ANOVA
***if all samples were drawn from the same population, then the probability of having a value
F > fcrit just due to random sampling is p
ANOVA: Interpretation
We are only saying that the chances that those 4 samples were drawn from the same
population (with the same cardiac output) are small. Therefore, we reject this notion and say
that they were drawn from different populations (with different cardiac outputs)
Common Mistake
ANOVA will not tell which one of the groups is responsible for the observed differences. To
find which one is the culprit, we could perform many pairwise t test. This approach appears
reasonable and will yield p-values as well as results of hypothesis testing.
However, there is something wrong with it.
Correct approach
STEP 1: Perform ANOVA and test whether the diet has an effect on cardiac output. If we reject
the NULL hypothesis, go to STEP 2.
STEP 2: perform pairwise t tests: Control versus pasta, control versus fruits, control versus
steak, etc but taking into account the problem of compounding errors
****take into account the problem of compounding errors: multiple comparison procedures or
post-hoc tests.
multiple comparison procedures:
ANOVA Test
F_ratio = s_sq_bet/s_sq_wit; 1. Calculate F-ratio value
Dfn = 3; 2. numerator dof m-1
Dfd = 4*(n_sub-1); 3. denominator dof
F_crit = stats.f.ppf(0.95,Dfn,Dfd); 4. Calculate F-critical value
if (F_ratio>F_crit):
print('Reject the NULL hypothesis.\
Data support a difference in GEARS scores \
among the 4 groups');
else:
print('Unable to reject the NULL hypothesis.\
Data failed to support any difference in GEARS\
scores among the 4 groups');
4. Power and Sample Size & Paired t-test
Part 1: Paired t-tests (for conducting before and after experiments on subjects)
We select one sample of individuals and measure blood pressure before and after taking a
certain drug. Question is: does the drug have any effect on blood pressure?
H0: the drug has no effect H1: drug has an effect on blood pressure
Quantify type I errors: We reject the NULL hypothesis (“report an effect”), when in fact H0
is true. The probability of making type I errors is ≤ α.
Estimate Type II errors: occurs when H0 is false and H1 is true, but we failed to reject the
NULL hypothesis. (false negatives: “miss an effect when there is one”)
• Probability of making a type II error = β.
• The power of a test is defined as 1 – β
• the higher the power, the better. The term “power” is used because 1 − β is the probability
“not to miss an effect” when there is one and therefore is associated with the idea of how
“powerful” a test is in detecting an effect.
scenario with two groups: one taking placebo and one taking a drug that is supposed to alter
cardiac output.
In this case we would use this t statistic to test the hypothesis. If we want to estimate β, we
need to have a scenario where the NULL hypothesis is false.
The curve below assumes that, in reality, there is a difference of δ between the two means,
so the curve is shifted.
If, in fact, the real distribution is this one in red below, then the region of failing to reject the
NULL hypothesis corresponds to the section highlighted in green.
Therefore, the probability of making a type II error (β) - that is failing to reject Ho when Ho is
false - is given by this area here in green
Ignore the area to the left of −tcrit in the red curve because it is very small. The power of the
test (1 – β) is the area in red here.
• Increasing 𝜎 will decrease power (inversely affect) – easier to miss an effect when
variability is high.
Sample size determination: we try increasing values of n until the desired power is reached.
If we swap the control and treatment group sizes, it will NOT affect the final power of the test.
Increasing significance level will decrease power, t_crit increases.
Effect of magnitude = delta; as delta increases, power increases.
Python
Two Groups: calculating p-value (assuming equal variance)
all_deltas = np.arange(2,30,0.1);
all_powers = np.zeros(len(all_deltas));
for i in range(0,len(all_deltas)):
D = all_deltas[i]/(sigma*np.sqrt(1/n_control+1/n_esrd));
t_star = t_crit-D;
all_powers[i] = 1.0 - stats.t.cdf(t_star,n_control+n_esrd-2);
plt.figure(1);
plt.plot(all_deltas,all_powers,'r-');
Plot of power versus sigma
all_sigmas = np.arange(0.1,50,0.1);
powers = np.zeros(len(all_sigmas));
n = 10; #assumed n_control=n_esrd=n=10
delta = 10;
t_crit_2 = stats.t.ppf(1-0.05/2,n+n-2);
for i in range (0,len(all_sigmas)):
D = delta/(all_sigmas[i]*np.sqrt(1/n+1/n));
t_star = t_crit_2-D;
powers[i] = 1.0 -stats.t.cdf(t_star,n+n-2);
plt.figure(2)
plt.plot(all_sigmas,powers,'k-');
Paired t-test
In this case, we would state the problem as to whether the proportions of infected and non-
infected is the same for all the ethnicities.
Chi-square test
We first produce a contingency table based on the data given (example below):
From the contingency table, we analyze and find expected table with expected values:
NULL HYPOTHESIS: incidence of dengue is the same in all ethnic groups
***see the total values first: what is the overall portion of the entire population that gets dengue?
Then apply that percentage/fraction to the individual group populations to get expected values.
The higher the 𝑋 2 , the more unlikely it is that the incidence is the same in all groups (since the
difference between observed and expected is higher).
Ratio X2 follows the X2 distribution.
Big value of X2 = reject Null hypothesis
X2 distribution
K=(r-1)(c-1)
k = DOF
X2 test
choose a 95% confidence level → This will determine a critical value X2 crit so that the area
to its right will be 0.05
If χ2 > χ2 crit → reject the NULL hypothesis.
If it is smaller, unable to reject the NULL hypothesis.
X2 test: interpretation (same as ANOVA)
• Must form 2x2 contingency tables and expected tables for each pairwise comparison
and apply Yates Correction. Then DOF = 1 and find X2 statistic.
Yates Correction for Continuity will make the value of the 𝜒2 statistic smaller thus making
rejection the NULL hypothesis more difficult.
Python
2
X test
tot_row_1 = sum(table[0,:]); 1. Compute sum of all elements in
tot_row_2 = sum(table[1,:]);
each row
tot_col_1 = sum(table[:,0]);
tot_col_2 =sum(table[:,1]);
tot_totals = tot_row_1 + tot_row_2;
E11 = tot_row_1*tot_col_1/tot_totals; 2. Calculate the elements in the
E12 = tot_row_1*tot_col_2/tot_totals;
expected matrix
E21 = tot_row_2*tot_col_1/tot_totals;
E22 = tot_row_2*tot_col_2/tot_totals;
chi_2_stat = 3. Compute chi square statistic
(abs(table[0,0]-E11)-0.5)**2/E11 + \
(abs(table[0,1]-E12)-0.5)**2/E12 + \ *can all 3 steps as a function
(abs(table[1,0]-E21)-0.5)**2/E21 + \ def compute_chi_sq_2_by_2(table): … and
(abs(table[1,1]-E22)-0.5)**2/E22; return chi_2_stat
chi_sq_crit = stats.chi2.ppf(0.95,3); 4. Calculate chi square crit
#3=(4-1)*(2-1) = dof, 95% confidence
if (chi_sq_stat>chi_sq_crit): 5. Test statements
print('Reject the NULL hypothesis. \
Data support a difference in cancer
distribution\ among the 4 blood types');
else:
print('Failed to reject the NULL
hypothesis.\ Data failed to support any
difference in cancer\ distribution among the
4 blood types');
p_value = 1.0 - 6. Compute p value
stats.chi2.cdf(chi_sq_stat,3);
O_vs_B =data_table[[0,2],:];
chi_sq_O_vs_B = compute_chi_sq_2_by_2(O_vs_B);
O_vs_AB = data_table[[0,3],:];
chi_sq_O_vs_AB = compute_chi_sq_2_by_2(O_vs_AB);
A_vs_B = data_table[[1,2],:];
chi_sq_A_vs_B = compute_chi_sq_2_by_2(A_vs_B);
A_vs_AB = data_table[[1,3],:];
chi_sq_A_vs_AB = compute_chi_sq_2_by_2(A_vs_AB);
B_vs_AB =data_table[[2,3],:];
chi_sq_B_vs_AB = compute_chi_sq_2_by_2(B_vs_AB);
chi_sq_crit_BONFE = stats.chi2.ppf(1-0.05/6,1);
*BONFERRONI TEST CONCLUSION
#We note that all comparisons involving group 0 will lead to rejection of the NULL
hypothesis. This can be summarized as "Data support a difference between blood group 0 and
all the other groups in pancreatic cancer incidence". Group 0, according to the data
collected, is affected less than other blood types by cancer.
6. Survival Analysis
Purpose: We run a clinical trial where we recruit a sample of patients who undergo a
certain procedure (e.g., a surgery) or a treatment. We monitor them for a period of
time and record, for each of them, the moment when another significant event happens.
(eg. Death)
Three KEY Features:
1. Clearly identifiable start time for each subject
2. Clearly identifiable end time for each subject
3. Sample slected randomly from a population of interest
Issues with running such trials:
1. Subjects start at different times
2. We do not know the time of significant event for all subjects because
a. Some are still alive at the end of observation period
b. Investigator fails to monitor all subjects throughout study, lost to follow
up (uncontactable, unrelated death)
*data relative to subjects which we do not know the time of significant event (eg. Death)
are CENSORED DATA.
• Find the survival probabilities for all the times where significant events happen
to test subjects, starting with the earliest time
• P_aft = (alive past time #) / (alive before time #) = (𝑛𝑖 − 𝑑𝑖 )/𝑛𝑖
• P_bef = number of ppl live up to time t/total number of ppl
̂(t) = 𝑺
𝑺 ̂(t-1) (𝒏𝒊 − 𝒅𝒊 )/𝒏𝒊
• Drop any censored data from calculations
4. Computing the cumulative survival rate 𝑆̂(t)
Test statistic, zs
We first choose one of the two curves and introduce the concept of expected number
of deaths at time i as the number of individuals that I would expect to die at time i if
the deaths were equally distributed between the two groups (i.e., the scenario
where H0 would be true)
Rejected H0:
Data supported a statistically significant improvement in survival outcome for patients
without cluster formation compared with patients with cluster formation.
Python
Survival Functions
#This function takes in two arrays of the same length. Each element of the array
corresponds to an individual. The first array contains the time, since the start of the study of
an event related to the individual. The event can be deat or lost to follow up. If the event is
death, the corresponding position in the second array is 1. If the event is not death, the
corresponding position in the second array is 0. The two arrays do not need to be sorted by
time.
#argosrt will internally sort the array and give the original indices in order
indices = np.argsort(unsorted_time_of_events);
time_of_events = np.zeros(len(indices));
type_of_events = np.zeros(len(indices));
for i in range (0,len(indices)):
index = indices[i];
time_of_events[i] =unsorted_time_of_events[index];
if (unsorted_type_of_events[index]==1):
type_of_events[i]=1;
N = len(time_of_events)
total_surviving = N;
n_i = [N];
times_of_death=[0.0];
times_of_death_plot=[0.0];
S_hat = [1.0];
S_hat_plot = [1.0];
d_i = [0];
lost_i = [0];
i=0;
all_times = [0.0];
while(i<N):
time_of_interest = time_of_events[i]
#determine number of events at this time
n_events = np.count_nonzero(time_of_events == time_of_interest)
deaths_at_i=0;
lost_at_i = 0;
for j in range (i,i+n_events):
if (type_of_events[j]==1):#There was a death
deaths_at_i=deaths_at_i+1;
else:
lost_at_i=lost_at_i+1;
if (deaths_at_i>0):
S_hat_plot.append(S_hat[-1]);
times_of_death_plot.append(time_of_interest);
new_surv_frac = float(n_i[-1]-deaths_at_i)/n_i[-1];
new_value = S_hat[-1]*new_surv_frac
S_hat.append(new_value);
S_hat_plot.append(new_value);
times_of_death.append(time_of_interest);
times_of_death_plot.append(time_of_interest);
all_times.append(time_of_interest);
i=i+deaths_at_i+lost_at_i;
#First, obtain, from the two, all the times where we need to do something to_be_added =
[];
for i in range(0, len(survival_2[2])):
position = np.where(np.isclose(survival_1[2],survival_2[2][i],1e-4));
if (len(position[0])==0):#if not there, we add it
to_be_added.append(survival_2[2][i]);
u_l=0.0;
s_2_l=0.0;
n_1_i = survival_1[3][0];
n_2_i = survival_2[3][0];
d_1_i = 0;
d_2_i = 0;
lost_1_i = 0;
lost_2_i = 0;
for i in range(1,len(all_times)):#Note we ignore t=0.
#Check whether this time in the first, second or both
pos_2 = np.where(np.isclose(survival_2[2],all_times[i],1e-4))
pos_1 = np.where(np.isclose(survival_1[2],all_times[i],1e-4))
if len(pos_2[0])>0 :
d_2_i = survival_2[4][pos_2[0][0]];
n_2_i = survival_2[3][pos_2[0][0]-1];
lost_2_i = survival_2[5][pos_2[0][0]];
else:
n_2_i = n_2_i - d_2_i - lost_2_i;
d_2_i = 0;
lost_2_i =0;
if len(pos_1[0])>0 :
d_1_i = survival_1[4][pos_1[0][0]];
n_1_i = survival_1[3][pos_1[0][0]-1];
lost_1_i = survival_1[5][pos_1[0][0]];
else:
n_1_i = n_1_i - d_1_i - lost_1_i;
d_1_i = 0;
lost_1_i = 0;
import numpy as np
plt.figure(1);
plt.plot(survival_existing[6],survival_existing[7], 'b-',
survival_drug[6],survival_drug[7], 'r-');
t_deaths_existing = survival_existing[0]; Compute mean survival times for
s_hat_existing = survival_existing[1]; both drugs
median_existing = 0;
for i in range(0,len(t_deaths_existing)):
if (s_hat_existing[i]<0.5):
median_existing = t_deaths_existing[i];
break;
t_deaths_drug = survival_drug[0];
s_hat_drug = survival_drug[1];
median_drug = 0;
for i in range(0,len(t_deaths_drug)):
if (s_hat_drug[i]<0.5):
median_drug = t_deaths_drug[i];
break;
print ('Median survival time for existing drug is
',median_existing, ' months');
print ('Median survival time for the new drug is
',median_drug, ' months');
result = compare_survivals(survival_existing,
Log-rank test aimed at unveiling
survival_drug); any statistically significant (95%)
u_L = result[0]; difference between the two
s_2_UL = result[1]; drugs in terms of survival
#After looking at the plot, I notice
z_stat = u_L/np.sqrt(s_2_UL); that the curve for the new drug is
z_crit = stats.norm.ppf(0.975);#0.975=1-0.05/2 "higher" (meaning better
if (abs(z_stat)>z_crit): survival). After rejecting the
print('Reject the NULL hypothesis. \Data support a NULL hypothesis, I can state that
difference in survival outcome\ between the new and data support that the new drug
existing drug'); improves survival outcome as
else: compared with the existing drug.
print('Unable to reject the NULL hypothesis. \Data
failed to support any difference in survival
outcome\between the new and the existing drug');