0% found this document useful (0 votes)
191 views38 pages

BN2102 1-6 Notes

This document provides notes on statistical concepts including estimation, hypothesis testing, ANOVA, power and sample size calculations, analysis of nominal data, and survival analysis. For estimation, it discusses parameters of the normal distribution like the mean and standard deviation. It introduces the central limit theorem and explains how sample means are normally distributed for large sample sizes. It also covers confidence intervals and how their width decreases with larger sample sizes. The document then discusses hypothesis testing, ANOVA, calculating power and sample sizes, analyzing nominal data with chi-squared tests, and survival analysis techniques. Python code examples are provided for implementing these statistical analyses.

Uploaded by

Jiayao Huang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
191 views38 pages

BN2102 1-6 Notes

This document provides notes on statistical concepts including estimation, hypothesis testing, ANOVA, power and sample size calculations, analysis of nominal data, and survival analysis. For estimation, it discusses parameters of the normal distribution like the mean and standard deviation. It introduces the central limit theorem and explains how sample means are normally distributed for large sample sizes. It also covers confidence intervals and how their width decreases with larger sample sizes. The document then discusses hypothesis testing, ANOVA, calculating power and sample sizes, analyzing nominal data with chi-squared tests, and survival analysis techniques. Python code examples are provided for implementing these statistical analyses.

Uploaded by

Jiayao Huang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

BN2102 NOTES 1-6

Contents
1. Estimation ........................................................................................................................................................ 2
Python .................................................................................................................................................................... 5
2. Hypothesis Testing ..................................................................................................................................... 6
Python ..................................................................................................................................................................11
3. ANOVA .............................................................................................................................................................12
Python ..................................................................................................................................................................17
4. Power and Sample Size & Paired t-test ..........................................................................................18
Part 1: Paired t-tests .....................................................................................................................................18
Part 2: Power & sample size .....................................................................................................................19
Python ..................................................................................................................................................................22
5. Nominal Data ................................................................................................................................................24
Python ..................................................................................................................................................................28
6. Survival Analysis .......................................................................................................................................29
Python ..................................................................................................................................................................34
1. Estimation
Normal Distribution
1. Using the full population as sample (census)
∑𝑋
• Population mean, 𝜇 = 𝑁 𝑖
∑(𝑋𝑖 −𝜇)2
• Population variance, 𝜎 2 = 𝑁
∑(𝑋𝑖 −𝜇)2
• Population standard deviation, 𝜎 = √
𝑁

Probability-density function (pdf):


a function such that the area under the curve between two points a and b is equal to the
probability that the value of X falls between a and b. We say that X “follows” the given
probability distribution, or alternatively that X “is distributed as” the given distribution.
• Area under the whole distribution curve = 1 (probability)

Normal distribution 𝑵(𝝁, 𝝈𝟐 )


Normal distribution curve is characterized by 2 parameters, mean and standard deviation.
1 2 2
pdf: 𝑓(𝑥) = 𝜎 2𝜋 𝑒 −(𝑥−𝜇) /2𝜎

Standardization
• transform the general N (µ, σ2) to the standard N (0, 1)
𝑍−𝜇
• if 𝑧~𝑁(𝜇, 𝜎 2 ), then 𝜎 ~𝑁(0,1)

***Using only a small random sample from the whole population as sample
• Sample mean 𝑋
• Sample variance s2

• Sample standard deviation


(no variability when n = 1, hence -1)
Comparing the variability of means of different random samples
I can pick many different random samples from my population, so what is the variability of
the means of each different sample?
➔ We do hypothesis testing on the means of each random sample selected.

• Mean of the sample means (𝑋𝑋 )


• Standard deviation of the means (𝜎𝑋 )
o Standard Error of the Mean, SEM
o Estimate how well 𝑋𝑋 approximate the true value 𝜇
o Smaller than population standard deviation 𝜎 (extreme values cancel out
when computing means)
*Value of standard error of the mean (SEM) decreases as the sample size increases.

Central Limit Theorem


The CLT says that for a random sample X of sufficiently large size n:
• The sample mean will follow a normal distribution N, regardless of the distribution of
the original population.
• The mean of N, will equal the mean of the underlying population,
∑𝑛
𝑖=1 𝑋𝑖
𝑋𝑋 = 𝑛
=𝜇
𝜎
• The standard deviation of N, (SEM) is 𝜎𝑋 =
𝑛

- allow us to estimate the population mean


The CLT is of crucial importance because it tells us the distribution that the sample mean (𝑋)
follows. This allows us to make inferences on the population mean based only on one
sample of size n taken from a population P.

Sufficiently large:
- if n > MN, then the CLT applies regardless of the actual distribution of P
- if n < MN, the approximation 𝑋 ∼ N (µ, σ2/n) is only valid if the population P is normal
or at least symmetrical.
Traditionally, MN = 30. Modern simulations revealed that if P is not too skewed, then MN ∼
10 − 20.

Estimation of the population mean, with 𝝈 known


Student t distribution when n is small: lower
magnitude in the centre
with larger tails (blue)
reflection of uncertainty
of having a small sample
As n increases, the t
distribution converges to
the normal distribution

Can be used for any sample size when 𝜎 is unknown, when n is large, the answer will be the
same to a normal distribution.

➔ Estimating 𝝁 (population mean), when 𝝈 is unknown (unknown population sd)


We assume the sample standard deviation (s) to be the population standard deviation.
• Provided the underlying population is normally distributed
• It is true for small samples as well

Confidence interval
Increasing sample size will decrease confidence interval for the mean
**For two different samples drawn from the same population, if the size of one sample is
greater than the other, its confidence interval (at same confidence level) for the mean is
smaller than the other confidence interval for the mean.

Increasing confidence level will increase the amplitude of confidence interval


Python
NumPy
• import NumPy as np #imports math library to be used
• np.fmod(a,b) = returns a%b (remainder)
Plotting
• import matplotlib.pyplot as plt #imports plotting library, must import numpy too
x = np.arange(a,b,c) plots on the x-axis from a to b (excluded) in increments
of c
y = np.zeros(len(x)) plots on the y-axis an equal number of zeros to the no.
of points on x-axis
for i in range(0,len(y)): Range is 0 to len(y)-1 (exclusive)
y[i] = np.sin(3*x[i]) + 0.2*x[i] np.sin = import sin from numpy
plt.plot(x,y,'k-'); The plot instruction. Needs 3 things
1. Array on the x axis (x)
2. Array on the y axis (y) x and y must be of the same
length
3. 'k-' the letter represents a colour (k for black, b for
blue, r for red; - tells to show the plot with a line)

Estimation
• Visualize data with boxplot
import numpy as np Scipy: import library for t distribution
import matplotlib.pyplot as plt
from scipy import stats
frac_data = np.loadtxt('tutorial1.dat'); np.loadtxt(‘FILE NAME.dat’) imports data (must
be in same folder)
y = frac_data[:,1] Plots only first column of array (all rows)
n = len(y) Sample size (n) – the length of array (y)
plt.boxplot(y); Boxplot syntax !!!! IMPORTANT !!!
• 95% confidence interval
x_bar = np.mean(y); Calculate sample mean
sum_sq = 0; Calculate standard deviation
for i in range(0,n):
sum_sq = sum_sq + (y[i] - x_bar)**2
sample_std = np.sqrt(sum_sq/(n-1));
t_value = stats.t.ppf(0.975,n-1); Stats.t.ppf (percentage point function)
#0.975 = area, n-1 = degrees of freedom, 0.975 = Returns the value of t given an area
1- (0.05/2) under distribution curve
#99% confidence: area = 0.995
upper_bound = x_bar + Calculate upper and lower bound of
t_value*sample_std/np.sqrt(n); interval
lower_bound = x_bar -
t_value*sample_std/np.sqrt(n);
2. Hypothesis Testing
σ unknown. Hence, use the s (sample standard deviation) as an estimate for σ and the T
distribution. The tests will be called Student t tests.

Assumption: the underlying population is approximately normal.

Hypothesis testing on the mean

1) Null hypothesis H0: mean of population (𝜇) = a given value of interest (𝜇)
Alt hypothesis H1: mean of population (𝜇) ≠ a given value of interest (𝜇)
2) Choose a significance level 𝛼 (𝛼 = 1 – confidence level%)
P value = 𝛼 / 2
3) Compute sample mean (𝑋) and sample standard deviation (s)

Main idea of hypothesis testing: We try to reject the NULL hypothesis H0 by showing that,
if indeed H0 is true (𝜇 = 𝜇), then the probability of drawing a sample of size n characterized by
X and s, are very small (smaller than α).

Errors:

• Type 1 error: we reject H0 when it is actually true


• Type 2 error: we fail to reject H0 when it is actually false
• Probability of type 1 error is always ≤ 𝛼

***The NULL hypothesis (H0) is never confirmed by hypothesis testing. We either reject it, or
fail to reject it.
general rule: the NULL hypothesis is stated as the status quo/no effect/equality between
two things

Hypothesis Testing on the mean


|𝑡| > 𝑡𝑐𝑟𝑖𝑡 , P-value<𝛼: We reject the NULL
hypothesis. Collected data support 𝜇 ≠ 𝜇.
The p-value represents the probability of
having drawn a sample of size n with
sample mean 𝑋 and sample standard
deviation s, if, in fact, 𝜇 = 𝜇.

|𝑡| < 𝑡𝑐𝑟𝑖𝑡 , P-value >𝛼: We fail to reject


the NULL hypothesis. Collected data
failed to support 𝜇 ≠ 𝜇.
***This does not conclude that H0 (NULL
hypothesis) is true.

One-sided hypothesis testing on the mean (instead of ≠ choose </>)

We draw one sample X of n measurements from a population. Based on the collected data,
we want to know whether the mean of the population µ is greater than a given value of interest
µ or not AND we are completely uninterested in a possible 𝜇 < 𝜇 result

1) We formulate two hypothesis and we call them H0 and H1


• The NULL hypothesis H0: 𝜇 = 𝜇
• The alternate hypothesis H1: 𝜇 > 𝜇
2) We choose a significance level α (α = 0.05 for 95% confidence)
3) We compute sample mean 𝑋 and sample standard deviation s.
Hypothesis testing on difference of two means (whether two means are different)

A typical scenario We select two groups of children: the first group X is made of 9 children
whose father has heart disease. The second group Y is made of 15 children whose father is
healthy. Is the average cholesterol level different in the two groups?

• NULL hypothesis H0: no difference between the groups (µX = µY)


• alternate hypothesis H1: there is a difference between the two groups (µX ≠ µY)

The approach: The idea is to use the same technique we used to test for a value of 𝜇 = 𝜇, but
instead test whether 𝜇𝑋 − 𝜇𝑌 = 𝜇 (𝜇 = 0)

If we have two independent normally distributed random variables A and B, their sum will be
also normally distributed, with mean equals the sum of their means and variance equals the
sum of the variances. Same for the difference.

We are going to assume equal variances between X and Y and then we apply CLT to the
difference of the two random variables. If nX and nY are large enough, by CLT, Z will also be
normally distributed.
If standard deviation (𝝈) is unknown → apply student t distribution

From previous lecture: If σ is unknown, we can use the sample variances as estimates,
provided that we assume the underlying populations are normal.

In this case, we can use the Student t distribution instead.

• pooled variance sp is used as an estimate (weighted average) for the underlying


population variance which is assumed to be the same for the populations from where
X and Y were drawn. (assuming equal variances)

If standard deviation (𝝈) is unknown, and variance of populations are unequal:

• instead of creating a pooled variance to estimate σ, we keep both the sample


variances sX and sY.
• In this case, the tricky part is estimating the number of degrees of freedom of the T
distribution.
Back to our scenario, let’s assume equal variances for a second. Once we compute mean
and standard deviation of our two samples, we look at our t variable here and we check in
which region it falls.

If we reject the NULL hypothesis, we can claim that the data support a difference in the mean
value of cholesterol levels between the population of children whose father is healthy as
compared with the children whose father has heart disease.

If we can’t reject the NULL hypothesis, we can only say that our data are unable to confirm a
difference between the two means. The p-value can be computed as before.
Python
healthy_data = np.loadtxt('Healthy.dat'); Extract all the rows of first column,
temp_inf_hea = healthy_data[:,1]; denote its length as sample size
n_hea = len(temp_inf_hea); n_hea
x_bar_hea = np.mean(temp_inf_hea); Sample mean
sum_sq = 0; Standard deviation
for i in range(0,n_hea):
sum_sq = sum_sq + (temp_inf_hea[i]- *the same is to be performed on
x_bar_hea)**2;
both data sets
s_hea = np.sqrt(sum_sq/(n_hea-1));
t_stat = (x_bar_hea - Obtain t statistic using formula
100)/(s_hea/np.sqrt(n_hea));
t_crit = stats.t.ppf(0.975,n_hea-1) 95% confidence interval: t_crit
Two-sided hypothesis test (𝑋 ≠ 𝜇)

if (np.abs(t_stat)>t_crit): If absolute value of t_stat is more


print('Reject NULL hypothesis Data than t_crit, reject NULL
support that mu differs from 100microns');
else: If absolute value of t_stat is less
print('Unable to reject the NULL than t_crit, fail to reject NULL
hypothesis Data failed to support any Saying "Data support mu=100" is
difference between mu and 100 microns');
WRONG
One-sided hypothesis test (𝑋 >/< 𝜇)

if (t_stat_os>t_crit_os):
print('Reject NULL hypothesis.Data support mu > 86');
else:
print('Unable to reject NULL hypothesis. Data failed to support mu>86')
Hypothesis testing on TWO DATA SETS, assuming unequal variance

t_stat_2 = (x_bar_hea-x_bar_gla) / Calculate t statistic using formula


np.sqrt(s_hea**2/n_hea+s_gla**2/n_gla);
dof = round (((s_gla**2/n_gla+s_hea**2 Calculate degrees of freedom using
/n_hea)**2) / ((s_gla**2/n_gla)**2/(n_gla- formula
1)+ (s_hea**2/n_hea)**2/(n_hea-1)));
t_crit_2 = stats.t.ppf(0.975,dof); T critical value
if (np.abs(t_stat_2) > t_crit_2): If t_stat is greater than t_crit, reject
print('Reject NULL hypothesis. Data null hypothesis as there is a
support a difference in mean thickness difference between 2 groups
between the two groups');
else: If t_stat is less than t_crit, fail to
print('Unable to reject the NULL reject null hypothesis, cannot
hypothesis. Data failed to support any confirm difference.
difference in mean thickness between the two
groups');
p_value_2 = 2.0 * (1.0 - P_value of two data sets hypo testing
stats.t.cdf(np.abs(t_stat_2),dof));
3. ANOVA
Purpose: for comparison of multiple test groups to determine if a null hypothesis is rejected
or not (hypothesis testing on multiple test groups, not just 2).

Scenario: We randomly select 4 groups of 10 people from a population of a small village of


200 healthy individuals. Each group is told to eat only a certain type of food. 10 are told to
continue eating normally (control group). 10 are told to eat fruits only, 10 are told to eat pasta
only and 10 are told to eat steaks only. After some time of this diet, we measure their cardiac
output.

The main question is: does diet affect cardiac output? Note that we have 4 groups to compare.
H0: Diet has no effect on cardiac output
H1: Diet has effect on cardiac output

***The key difference from the t test (test the difference between two groups) is that we have
4 groups to compare.

The idea is that there will be variability among the samples.


If H0 is true, this variability is simply due to random sampling and does not reflect a real
effect of the diet on cardiac output.
In this case, all samples would come from the same population.

However, if H0 is false instead, it would mean that one or more samples were drawn from a
different population with different mean cardiac outputs.

Estimate variance of population:


1) The average of the 4 sample variances we obtained within each group. In our
scenario
1
𝑆𝑤𝑖𝑡 2 = (𝑆𝑐𝑜𝑛 2 + 𝑆𝑓𝑟𝑢 2 + 𝑆𝑝𝑎𝑠 2 + 𝑆𝑠𝑡𝑒 2 )
4
2) We look at the means of the samples. Recall that the SEM 𝜎𝑋 = 𝜎/√𝑛, hence 𝜎 2 =
𝑛𝜎𝑋 2 . We estimate the population variance from the variability between the samples
𝑆𝑏𝑒𝑡 2 = 𝑛𝑠𝑋 2

Key Idea:
F distribution

- x axis: values of F
- The highest probability is around 1
and probabilities get lower as we
move towards the tail.
- Identify a threshold value fcrit.
If F > fcrit → reject the NULL
hypothesis
Otherwise, unable to reject

F distribution is characterized by 2 parameters, referred to as degrees of freedom:


- numerator degrees of freedom ν1 = m-1
- denominator degrees of freedom ν2 = m(n-1)
- m is the number of groups
- n is the size of our samples

***as ν2 increases, the area under the tail becomes smaller (as n increases, sampling will be
more accurate and the chances to calculate a big value F will get smaller)
Steps for ANOVA

***if all samples were drawn from the same population, then the probability of having a value
F > fcrit just due to random sampling is p
ANOVA: Interpretation

We are only saying that the chances that those 4 samples were drawn from the same
population (with the same cardiac output) are small. Therefore, we reject this notion and say
that they were drawn from different populations (with different cardiac outputs)

Common Mistake
ANOVA will not tell which one of the groups is responsible for the observed differences. To
find which one is the culprit, we could perform many pairwise t test. This approach appears
reasonable and will yield p-values as well as results of hypothesis testing.
However, there is something wrong with it.

Correct approach
STEP 1: Perform ANOVA and test whether the diet has an effect on cardiac output. If we reject
the NULL hypothesis, go to STEP 2.

STEP 2: perform pairwise t tests: Control versus pasta, control versus fruits, control versus
steak, etc but taking into account the problem of compounding errors
****take into account the problem of compounding errors: multiple comparison procedures or
post-hoc tests.
multiple comparison procedures:

Unequal sample size


Python
Computing sum_sq within (wit) and between (bet) – equal sample size
s_sq_wit = (var_G1+var_G2+var_G3+var_SUT)/4.0; Within: sum of variances
divide by no. of groups
x_bar_x_bar = Between:
(x_bar_G1+x_bar_G2+x_bar_G3+x_bar_SUT)/4.0; 1. Find the mean of
means
s_x_bar_sq = ((x_bar_G1-x_bar_x_bar)**2 + \ 2. Find the sum square
(x_bar_G2-x_bar_x_bar)**2 + \ of means
(x_bar_G3-x_bar_x_bar)**2 + \ 3. Calculate sum square
(x_bar_SUT-x_bar_x_bar)**2)/3.0; between
s_sq_bet = n_sub*s_x_bar_sq;

ANOVA Test
F_ratio = s_sq_bet/s_sq_wit; 1. Calculate F-ratio value
Dfn = 3; 2. numerator dof m-1
Dfd = 4*(n_sub-1); 3. denominator dof
F_crit = stats.f.ppf(0.95,Dfn,Dfd); 4. Calculate F-critical value
if (F_ratio>F_crit):
print('Reject the NULL hypothesis.\
Data support a difference in GEARS scores \
among the 4 groups');
else:
print('Unable to reject the NULL hypothesis.\
Data failed to support any difference in GEARS\
scores among the 4 groups');
4. Power and Sample Size & Paired t-test
Part 1: Paired t-tests (for conducting before and after experiments on subjects)
We select one sample of individuals and measure blood pressure before and after taking a
certain drug. Question is: does the drug have any effect on blood pressure?
H0: the drug has no effect H1: drug has an effect on blood pressure

Only applicable for such tests:


Study A: I recruit 150 patients and measure their glucose levels. I then prescribe them to take
a drug for 6 months. After 6 months I measure their blood glucose levels again.
Not these:
Study B: I recruit 150 patients. 75 of them will take placebo for 6 months. 75 will take a drug
for 6 months. After 6 months, I measure their blood glucose levels.
Study C: I recruit 150 patients. 50 of them will take placebo. 50 of them will take a drug and
50 of them will take nothing. After 6 months, I measure their blood glucose levels.
Part 2: Power & sample size

Quantify type I errors: We reject the NULL hypothesis (“report an effect”), when in fact H0
is true. The probability of making type I errors is ≤ α.
Estimate Type II errors: occurs when H0 is false and H1 is true, but we failed to reject the
NULL hypothesis. (false negatives: “miss an effect when there is one”)
• Probability of making a type II error = β.
• The power of a test is defined as 1 – β
• the higher the power, the better. The term “power” is used because 1 − β is the probability
“not to miss an effect” when there is one and therefore is associated with the idea of how
“powerful” a test is in detecting an effect.

scenario with two groups: one taking placebo and one taking a drug that is supposed to alter
cardiac output.
In this case we would use this t statistic to test the hypothesis. If we want to estimate β, we
need to have a scenario where the NULL hypothesis is false.
The curve below assumes that, in reality, there is a difference of δ between the two means,
so the curve is shifted.
If, in fact, the real distribution is this one in red below, then the region of failing to reject the
NULL hypothesis corresponds to the section highlighted in green.
Therefore, the probability of making a type II error (β) - that is failing to reject Ho when Ho is
false - is given by this area here in green
Ignore the area to the left of −tcrit in the red curve because it is very small. The power of the
test (1 – β) is the area in red here.

Power of a Student t test – calculations

To perform power calculations, we need to know:


1. Characteristics of true distribution (𝜎 & 𝛿) – usually unknown
2. Significance level 𝛼 which limit type I errors. This is used to determine tcrit (usually
use 0.05)
3. Sample sizes (chosen before start of study)
Effects of 𝜹, 𝝈 and 𝜶 on power
• Increasing 𝛿 increases the power of our test (a bigger difference is harder to miss)

• Increase 𝛼 → increase power


o For a given δ, increasing α will shift tcrit towards the centre of the graph.
(increases the red area)
o bigger α and smaller tcrit → easier to reject the NULL hypothesis (the tcrit
threshold is lower), therefore decreasing the chances of a type II error
o increasing α will also increase chances of making type I errors (trade-off)

• The only way to increase power without compromising on 𝛼 for a given 𝛿 –


INCREASE SAMPLE SIZE (n)

• Increasing 𝜎 will decrease power (inversely affect) – easier to miss an effect when
variability is high.

Sample size determination: we try increasing values of n until the desired power is reached.
If we swap the control and treatment group sizes, it will NOT affect the final power of the test.
Increasing significance level will decrease power, t_crit increases.
Effect of magnitude = delta; as delta increases, power increases.

Python
Two Groups: calculating p-value (assuming equal variance)

s_p = np.sqrt(((n_control- Calculating t statistic assuming equal


1)*std_control**2 + (n_esrd-1) variance. N_control and n_esrd are the
*std_esrd**2) /(n_control+n_esrd-2)); two sample sizes, std are the 2 standard
deviations, means are the 2 means for
t_stat = (mean_control-mean_esrd)/
(s_p*np.sqrt(1/n_control+1/n_esrd)) each group
p_value = 2.0*(1.0 - stats.t.cdf(np.abs(t_stat),n_control+n_esrd-2));

Compute power of the test for a given delta (= 5)

delta = 5; Given values of delta and sigma


sigma = s_p;
t_crit = stats.t.ppf(1-0.05/2, First value = area, second value = degrees
n_control+n_esrd-2); of freedom
D = delta/ (sigma*np.sqrt
(1/n_control+1/n_esrd));
t_star_slide = t_crit-D; t* = refer to lect slides!!
power = 1.0 - stats.t.cdf(t_star_slide, CDF = cumulative density function;
n_control+n_esrd-2); returns value on x -axis given area for
norm dist.

Plot of power versus delta

all_deltas = np.arange(2,30,0.1);
all_powers = np.zeros(len(all_deltas));
for i in range(0,len(all_deltas)):
D = all_deltas[i]/(sigma*np.sqrt(1/n_control+1/n_esrd));
t_star = t_crit-D;
all_powers[i] = 1.0 - stats.t.cdf(t_star,n_control+n_esrd-2);

plt.figure(1);
plt.plot(all_deltas,all_powers,'r-');
Plot of power versus sigma

all_sigmas = np.arange(0.1,50,0.1);
powers = np.zeros(len(all_sigmas));
n = 10; #assumed n_control=n_esrd=n=10
delta = 10;
t_crit_2 = stats.t.ppf(1-0.05/2,n+n-2);
for i in range (0,len(all_sigmas)):
D = delta/(all_sigmas[i]*np.sqrt(1/n+1/n));
t_star = t_crit_2-D;
powers[i] = 1.0 -stats.t.cdf(t_star,n+n-2);

plt.figure(2)
plt.plot(all_sigmas,powers,'k-');

Paired t-test

esrd_after = np.loadtxt('ESRD_after.dat'); Load the two data sets,


diff_array = esrd - esrd_after; calculate a difference
array
mean_diff = np.mean(diff_array);
sum_sq = 0 ;
for i in range (0,n_esrd): Calculate sample mean
sum_sq = sum_sq + (diff_array[i] - mean_diff)**2 and standard deviation
std_diff = np.sqrt(sum_sq/(n_esrd-1));
t_stat_paired = mean_diff/Calculate t statistic and
(std_diff/np.sqrt(n_esrd)); critical values for paired
t_crit_paired = stats.t.ppf(1-0.01/2,n_esrd-1); tests
if (np.abs(t_stat_paired)>t_crit_paired): If t_stat is greater than
print('Reject the NULL hypothesis. Data support t_crit, reject NULL
a mean difference in ORAIP concentration between hypothesis.
before and after');
else: If t_stat is less than
print('Unable to reject the NULL hypothesis. t_crit, fail to reject NULL
Data failed to support any difference in ORAIP hypothesis.
concentration between before and after');
p_value_paired = 2.0 *(1.0 - stats.t.cdf(np.abs(t_stat_paired), n_esrd-1));
#2 times of each tail end area
5. Nominal Data
Purpose: Analyze nominal data (categorical data)
Two types of data:
• Continuous data: variable that can take any real number as a value
• Nominal data: variables measured on a nominal scale (categorical). Eg: male/female
We randomly select a group of Singaporeans living in the area. We divide them according to
their ethnicity. We ask them if they contracted dengue fever in the last year. We want to know
whether ethnicity plays a role in the incidence of dengue fever.
H0: incidence of dengue is the same in all ethnic groups.
H1: incidence of dengue is the different among ethnic groups.

In this case, we would state the problem as to whether the proportions of infected and non-
infected is the same for all the ethnicities.
Chi-square test
We first produce a contingency table based on the data given (example below):

From the contingency table, we analyze and find expected table with expected values:
NULL HYPOTHESIS: incidence of dengue is the same in all ethnic groups

***see the total values first: what is the overall portion of the entire population that gets dengue?
Then apply that percentage/fraction to the individual group populations to get expected values.

Or: use this practical tip to find Expected values -


*if all groups have equal sizes and equal portions, X2 value is 0! Transposing/switching rows
and columns on a contingency table doesn’t change the X2 statistic.
Determine the X2 value:

The higher the 𝑋 2 , the more unlikely it is that the incidence is the same in all groups (since the
difference between observed and expected is higher).
Ratio X2 follows the X2 distribution.
Big value of X2 = reject Null hypothesis

X2 distribution
K=(r-1)(c-1)

k = DOF
X2 test
choose a 95% confidence level → This will determine a critical value X2 crit so that the area
to its right will be 0.05
If χ2 > χ2 crit → reject the NULL hypothesis.
If it is smaller, unable to reject the NULL hypothesis.
X2 test: interpretation (same as ANOVA)

If reject Null hypothesis: Multiple pairwise comparisons (same as ANOVA): Bonferroni


t tests

• Must form 2x2 contingency tables and expected tables for each pairwise comparison
and apply Yates Correction. Then DOF = 1 and find X2 statistic.

How to report X2 test results


Using a X2 test, we found that the distribution of ABO blood types in patients with pancreatic
cancer was (not) statistically significantly different from the normal distribution in the US
population (P<0.001).
Special case of 2x2 contingency tables (k=1)

Yates Correction for Continuity will make the value of the 𝜒2 statistic smaller thus making
rejection the NULL hypothesis more difficult.
Python
2
X test
tot_row_1 = sum(table[0,:]); 1. Compute sum of all elements in
tot_row_2 = sum(table[1,:]);
each row
tot_col_1 = sum(table[:,0]);
tot_col_2 =sum(table[:,1]);
tot_totals = tot_row_1 + tot_row_2;
E11 = tot_row_1*tot_col_1/tot_totals; 2. Calculate the elements in the
E12 = tot_row_1*tot_col_2/tot_totals;
expected matrix
E21 = tot_row_2*tot_col_1/tot_totals;
E22 = tot_row_2*tot_col_2/tot_totals;
chi_2_stat = 3. Compute chi square statistic
(abs(table[0,0]-E11)-0.5)**2/E11 + \
(abs(table[0,1]-E12)-0.5)**2/E12 + \ *can all 3 steps as a function
(abs(table[1,0]-E21)-0.5)**2/E21 + \ def compute_chi_sq_2_by_2(table): … and
(abs(table[1,1]-E22)-0.5)**2/E22; return chi_2_stat
chi_sq_crit = stats.chi2.ppf(0.95,3); 4. Calculate chi square crit
#3=(4-1)*(2-1) = dof, 95% confidence
if (chi_sq_stat>chi_sq_crit): 5. Test statements
print('Reject the NULL hypothesis. \
Data support a difference in cancer
distribution\ among the 4 blood types');
else:
print('Failed to reject the NULL
hypothesis.\ Data failed to support any
difference in cancer\ distribution among the
4 blood types');
p_value = 1.0 - 6. Compute p value
stats.chi2.cdf(chi_sq_stat,3);

Bonferroni with compute_chi_sq_2_by_2(table) function


O_vs_A = data_table[[0,1],:];#Note all columns, only two rows
chi_sq_O_vs_A = compute_chi_sq_2_by_2(O_vs_A);

O_vs_B =data_table[[0,2],:];
chi_sq_O_vs_B = compute_chi_sq_2_by_2(O_vs_B);

O_vs_AB = data_table[[0,3],:];
chi_sq_O_vs_AB = compute_chi_sq_2_by_2(O_vs_AB);

A_vs_B = data_table[[1,2],:];
chi_sq_A_vs_B = compute_chi_sq_2_by_2(A_vs_B);

A_vs_AB = data_table[[1,3],:];
chi_sq_A_vs_AB = compute_chi_sq_2_by_2(A_vs_AB);

B_vs_AB =data_table[[2,3],:];
chi_sq_B_vs_AB = compute_chi_sq_2_by_2(B_vs_AB);

chi_sq_crit_BONFE = stats.chi2.ppf(1-0.05/6,1);
*BONFERRONI TEST CONCLUSION
#We note that all comparisons involving group 0 will lead to rejection of the NULL
hypothesis. This can be summarized as "Data support a difference between blood group 0 and
all the other groups in pancreatic cancer incidence". Group 0, according to the data
collected, is affected less than other blood types by cancer.
6. Survival Analysis
Purpose: We run a clinical trial where we recruit a sample of patients who undergo a
certain procedure (e.g., a surgery) or a treatment. We monitor them for a period of
time and record, for each of them, the moment when another significant event happens.
(eg. Death)
Three KEY Features:
1. Clearly identifiable start time for each subject
2. Clearly identifiable end time for each subject
3. Sample slected randomly from a population of interest
Issues with running such trials:
1. Subjects start at different times
2. We do not know the time of significant event for all subjects because
a. Some are still alive at the end of observation period
b. Investigator fails to monitor all subjects throughout study, lost to follow
up (uncontactable, unrelated death)
*data relative to subjects which we do not know the time of significant event (eg. Death)
are CENSORED DATA.

It is ok to still run the analysis if:

• Not all patients are dead before ending the trial


• If some patients join after other patients aka. Not all enroll at the same time
• Unable to report if all subjects are dead or alive by end of the trial
• There is censored data in data set
• Some die of unrelated deaths

Estimating the survival function S(t)


Survival function is the probability of an individual in the population of surviving past
time t.
Steps to obtaining survival function S(t)
1. Resetting the start time

2. Ordering by time of death or last available information

3. Estimate survival probabilities

• Find the survival probabilities for all the times where significant events happen
to test subjects, starting with the earliest time
• P_aft = (alive past time #) / (alive before time #) = (𝑛𝑖 − 𝑑𝑖 )/𝑛𝑖
• P_bef = number of ppl live up to time t/total number of ppl
̂(t) = 𝑺
𝑺 ̂(t-1) (𝒏𝒊 − 𝒅𝒊 )/𝒏𝒊
• Drop any censored data from calculations
4. Computing the cumulative survival rate 𝑆̂(t)

- Exclude the times of censored observations

5. Plotting Kaplan-Meier Survival Curve 𝑆̂(t)


Comparison of two survival curves: Log Rank Test
Null hypothesis: no difference between the 2 survival curves
Alt hypothesis: there is a difference between the 2 survival curves

Test statistic, zs

We first choose one of the two curves and introduce the concept of expected number
of deaths at time i as the number of individuals that I would expect to die at time i if
the deaths were equally distributed between the two groups (i.e., the scenario
where H0 would be true)

- UL is then given by the summation of the differences between actual deaths


and expected deaths
Reporting of results from log rank test
Failed to reject H0:
Data failed to support any statistically significant difference in survival outcome
between cancer patients with and without the cluster formation (p = 0.066, log-rank
test).

Rejected H0:
Data supported a statistically significant improvement in survival outcome for patients
without cluster formation compared with patients with cluster formation.

Retrospective, prospective and cross-sectional data collections


The time of death must be across both groups. So the times of death recorded here
are: 1, 3, 5 (shared so count once only), 15, 28, 30

Python
Survival Functions

#This function takes in two arrays of the same length. Each element of the array
corresponds to an individual. The first array contains the time, since the start of the study of
an event related to the individual. The event can be deat or lost to follow up. If the event is
death, the corresponding position in the second array is 1. If the event is not death, the
corresponding position in the second array is 0. The two arrays do not need to be sorted by
time.

Given these two arrays, this function returns a list containing:


At position 0: an array with all the times of deaths in chronological order (first element is
zero)
At position 1: an array with the values of the Kaplan Meyer survival curve S_hat (first
element is 1)
At position 2: an array with all the times of event (deaths or otherwise)
At position 3: an array with the total number of individuals still alive just before the
corresponding time in the array at position 2
At position 4: an array with the total number of individuals who dies at the corresponding
time in the array at position 2
At position 5: an array with the total number of individuals who are lost to follow up at the
corresponding time in the array at position 2
At position 6: an array that can be used for plotting time versus survival in the typical
"staircase" plots. Time is at this position
At position 7: an array that can be used for plotting time versus survival in the typical
"staircase" plots. Survival is at this position
def compute_survival(unsorted_time_of_events, unsorted_type_of_events):

#argosrt will internally sort the array and give the original indices in order
indices = np.argsort(unsorted_time_of_events);
time_of_events = np.zeros(len(indices));
type_of_events = np.zeros(len(indices));
for i in range (0,len(indices)):
index = indices[i];
time_of_events[i] =unsorted_time_of_events[index];
if (unsorted_type_of_events[index]==1):
type_of_events[i]=1;

N = len(time_of_events)
total_surviving = N;
n_i = [N];
times_of_death=[0.0];
times_of_death_plot=[0.0];
S_hat = [1.0];
S_hat_plot = [1.0];
d_i = [0];
lost_i = [0];
i=0;
all_times = [0.0];
while(i<N):
time_of_interest = time_of_events[i]
#determine number of events at this time
n_events = np.count_nonzero(time_of_events == time_of_interest)
deaths_at_i=0;
lost_at_i = 0;
for j in range (i,i+n_events):
if (type_of_events[j]==1):#There was a death
deaths_at_i=deaths_at_i+1;
else:
lost_at_i=lost_at_i+1;
if (deaths_at_i>0):
S_hat_plot.append(S_hat[-1]);
times_of_death_plot.append(time_of_interest);
new_surv_frac = float(n_i[-1]-deaths_at_i)/n_i[-1];
new_value = S_hat[-1]*new_surv_frac
S_hat.append(new_value);
S_hat_plot.append(new_value);
times_of_death.append(time_of_interest);
times_of_death_plot.append(time_of_interest);

all_times.append(time_of_interest);

i=i+deaths_at_i+lost_at_i;

total_surviving = total_surviving - deaths_at_i - lost_at_i;


n_i.append(total_surviving);
d_i.append(deaths_at_i);
lost_i.append(lost_at_i);
return [times_of_death,S_hat,all_times, n_i, d_i, lost_i, times_of_death_plot, S_hat_plot];
#This function takes in two survival curves in the format that is returned by the
compute_survival functions. It goes through them and calculates the values of u_L and
s_L^2 that are necessary to perform the log-rank test. u_L and s_L^2 are returned in a list
(u_L at position 0, s_L^2 at position 1)
def compare_survivals(survival_1, survival_2):

#First, obtain, from the two, all the times where we need to do something to_be_added =
[];
for i in range(0, len(survival_2[2])):
position = np.where(np.isclose(survival_1[2],survival_2[2][i],1e-4));
if (len(position[0])==0):#if not there, we add it
to_be_added.append(survival_2[2][i]);

all_times = survival_1[2] + to_be_added;


all_times.sort();

u_l=0.0;
s_2_l=0.0;
n_1_i = survival_1[3][0];
n_2_i = survival_2[3][0];
d_1_i = 0;
d_2_i = 0;
lost_1_i = 0;
lost_2_i = 0;
for i in range(1,len(all_times)):#Note we ignore t=0.
#Check whether this time in the first, second or both
pos_2 = np.where(np.isclose(survival_2[2],all_times[i],1e-4))
pos_1 = np.where(np.isclose(survival_1[2],all_times[i],1e-4))
if len(pos_2[0])>0 :
d_2_i = survival_2[4][pos_2[0][0]];
n_2_i = survival_2[3][pos_2[0][0]-1];
lost_2_i = survival_2[5][pos_2[0][0]];
else:
n_2_i = n_2_i - d_2_i - lost_2_i;
d_2_i = 0;
lost_2_i =0;

if len(pos_1[0])>0 :
d_1_i = survival_1[4][pos_1[0][0]];
n_1_i = survival_1[3][pos_1[0][0]-1];
lost_1_i = survival_1[5][pos_1[0][0]];
else:
n_1_i = n_1_i - d_1_i - lost_1_i;
d_1_i = 0;
lost_1_i = 0;

if (d_2_i>0 or d_1_i>0):#u_L is computed only when there is a death


d_total_i = d_2_i + d_1_i;
n_total_i = n_2_i + n_1_i;
f_i = float(d_total_i)/n_total_i;#float is key otherwise it may do an integer division
e_i = n_2_i*f_i;
o_minus_e = d_2_i - e_i;
u_l = u_l + o_minus_e;
if (n_total_i>1):
s_2_l = s_2_l + (float(n_1_i)*n_2_i*d_total_i*(n_total_i-
d_total_i))/(n_total_i*n_total_i*(n_total_i-1));

return [u_l, s_2_l];

Survival Analysis Tasks

import numpy as np

import matplotlib.pyplot as plt

from scipy import stats

from SurvivalFunctions import compute_survival, compare_survivals

times_existing = data_existing[:,2] - Plot Kaplan-Meyer curves for


data_existing[:,1];#Time of event since enrolment existing and new drugs
types_existing = data_existing[:,3];
survival_existing = compute_survival(times_existing, data_existing =
types_existing); np.loadtxt('existing.dat');#Data
for existing drug
times_drug = data_drug[:,2] - data_drug[:,1]; data_drug =
types_drug = data_drug[:,3]; np.loadtxt('new_drug.dat');#Data
survival_drug = for new drug
compute_survival(times_drug,types_drug);

plt.figure(1);
plt.plot(survival_existing[6],survival_existing[7], 'b-',
survival_drug[6],survival_drug[7], 'r-');
t_deaths_existing = survival_existing[0]; Compute mean survival times for
s_hat_existing = survival_existing[1]; both drugs
median_existing = 0;
for i in range(0,len(t_deaths_existing)):
if (s_hat_existing[i]<0.5):
median_existing = t_deaths_existing[i];
break;

t_deaths_drug = survival_drug[0];
s_hat_drug = survival_drug[1];
median_drug = 0;
for i in range(0,len(t_deaths_drug)):
if (s_hat_drug[i]<0.5):
median_drug = t_deaths_drug[i];
break;
print ('Median survival time for existing drug is
',median_existing, ' months');
print ('Median survival time for the new drug is
',median_drug, ' months');
result = compare_survivals(survival_existing,
Log-rank test aimed at unveiling
survival_drug); any statistically significant (95%)
u_L = result[0]; difference between the two
s_2_UL = result[1]; drugs in terms of survival
#After looking at the plot, I notice
z_stat = u_L/np.sqrt(s_2_UL); that the curve for the new drug is
z_crit = stats.norm.ppf(0.975);#0.975=1-0.05/2 "higher" (meaning better
if (abs(z_stat)>z_crit): survival). After rejecting the
print('Reject the NULL hypothesis. \Data support a NULL hypothesis, I can state that
difference in survival outcome\ between the new and data support that the new drug
existing drug'); improves survival outcome as
else: compared with the existing drug.
print('Unable to reject the NULL hypothesis. \Data
failed to support any difference in survival
outcome\between the new and the existing drug');

#Adter rejecting the NULL hypothesis, we know p<0.05


p_value = 2.0*(1.0 - stats.norm.cdf(abs(z_stat)));

You might also like