Statistics Notes
Statistics Notes
Semester 2, 2020
Measures of Performance
• There are a variety of errors and measures for evaluating a model’s performance.
Actual Positive D+ Actual Negative D-
Test Positive S+ A B A+B
Test Negative S- C D C+D
A+C B+D A+B+C+D
𝐶
o False negative rate: P( S- | D+ ) = – Had disease but test was negative.
𝐴+𝐶
𝐵
o False positive rate: P( S+ | D- ) = – Had no disease but test was positive.
𝐵+𝐷
𝐴
o Sensitivity/Recall: P( S+ | D+ ) = – p(test positive) given has disease.
𝐴+𝐶
𝐷
o Specificity: P( S- | D- ) = – p(test negative) given has no disease.
𝐵+𝐷
𝐴
o Precision/Positive predictive value: P( D+ | S+ ) = – p(disease) given that he
𝐴+𝐵
test was positive
𝐷
o Negative predictive value: P( D- | S- ) = – p(no disease) given negative test
𝐶+𝐷
𝐴+𝐷
o Accuracy = – Overall average correctness of results.
𝐴+𝐵+𝐶+𝐷
Measures of Risk
• Prospective studies – forward-looking analysis involving initially identifying disease-free
individuals classified by presence or absence of a risk factor
o After a certain time period, we examine whether the presence of disease emerges
o E.g. examining the impact of sun exposure via analysing volleyball player’s skin across those
who wore sunscreen and those who didn’t
• Retrospective studies – back-ward looking analysis involving identifying those with and without the
disease (outcome categories) and tracing backwards to determine whether the risk factor was
present/absent
o E.g. To examine the effectiveness of zinc oxide as skin protection, we analyse lifeguards who
did and didn’t develop skin cancer and ask them to recall if they used zinc oxide.
• Estimating population sample:
o Consider two events A and B.
▪ If we can take a random sample from the whole population, we can estimate P(A)
using the observed sample proportion with attribute A
▪ If we take a random sample from the subpopulation defined by B, we can estimate
P(A|B) using the observed sample proportion of the subpopulation with attribute A
o Prospective Study:
▪ P(D+ | R+) & P(D- |R-) – can be estimated because we controlled on the risk
factor and examined to see whether the disease developed
▪ We can’t estimate – P ( R+ |D+) or P (R - | D-) because we did not take random
samples from the disease group
o Retrospective Study:
▪ P(R+ | D+) & P(R- |D-) – can be estimated because we controlled on the disease
developing and traced back to see if the risk factor was present
▪ We can’t estimate – P ( D+ |R+) or P (D- | R-) because we did not take random
samples from the risk factor sub-groups
▪ Therefore you cannot identify RELATIVE RISK (RR)!
• The result is significant if the confidence interval constructed doesn’t contain 1.
o Else, you’re looking to see if 0 is in the log-odds confidence ratio.
• Relative Risk:
• This is a stronger measure where u can directly say that R+ is [x]x more likely to have D+.
o Relative risk is defined by the
probability of having the disease given
two conditional probabilities:
𝑷(𝑫+|𝑹+)
▪ 𝑹𝑹 =
𝑷(𝑫+|𝑹−)
▪ RR = 1 means no difference
▪ RR < 1 implies risk factor
group less likely to have
disease
▪ RR > 1 implies risk factor
group more likely to have disease
• When we estimate, we are estimating the observed value of the mean data 𝒙 ̅,
̅ which is different to 𝑿
the population mean, which we say is a fixed and known parameter.
• Standard error, or the standard deviation of the estimator, is then:
𝝈
o 𝑆𝐸 = 𝑆𝐷(𝑋̅) = √𝑉𝑎𝑟(𝑋̅) =
√𝒏
o Represents the “likely size of the estimation error”.
• Estimation of standard error:
o Like the population mean, we never know the exact standard error. However, we estimate
from sample variance:
1 𝑠
o 𝑠2 = ̂ =
∑𝑛𝑖=1(𝑋𝑖 − 𝑋̅)2 𝒂𝒏𝒅 𝒔𝒐, 𝑺𝑬
𝑛−1 √𝑛
• Rough Idea behind inference:
o 1. Compute the value of estimate 𝒙
̅
𝒔
o 2. Compute the value of the estimated standard error
√𝒏
o 3. See if the discrepancy 𝒙
̅ − 𝜇0 is “large” compared to the standard error
▪ We should be explicit in which kinds of “discrepancies” we’re after:
• Positive, 𝒙 ̅ is significantly more – one-sided
• Negative, 𝒙 ̅ is significantly less – one-sided
• Both, 𝒙 ̅ is significantly different – two-sided
• To make the judgement call – how “large” is large enough for significance?:
o There are 2 approaches that are equivalent:
𝑠
▪ 1. “t-test approach” – declare 𝜇0 as not plausible if |𝒙 ̅ − 𝜇0 | > 𝑐 for a
√𝑛
“suitably chosen” c
▪ 2. “Confidence interval approach” – determine that the set of plausible values for
𝑠
unknown 𝜇 is 𝒙̅ ± 𝑐 , for a “suitably chosen” c
√𝑛
• Thus, if the observed 𝜇 was outside of these plausible values it is “too large”
▪ Note: “c” effectively standards for critical value
• How to choose the constant “c” or critical value?
o Testing: Controlling the false alarm rate i.e. 𝛼
▪ Alternatively called the “significance level”, this is when we choose the probability
at which we incorrectly reject the null.
o Confidence Intervals: Controlling the coverage probability
▪ The coverage probability refers to the probability that the “true” value of the
unknown parameter lies inside the confidence interval.
▪ Effectively another way of describing the false alarm rate
• As you increase your confidence % / decrease your false alarm rate – your confidence
intervals widen (to ensure you are correct more often).
• All this means is that instead of computing p-values, we can just compare the test statistic
with the quantile at the specified 𝛼
o This effectively calculates a ‘p-value’ from the set significance level, and compares to the
quantile value there
• It is also possible to rescale the quantiles back to the actual data-scale that the data exists in:
o I’d say this just increases the interpretability but doesn’t functionally add a lot.
o Type I error – was not meant to reject the null - but did.
o Type II error – was meant to reject the null – but didn’t.
o Power = (1 − 𝛽) = 𝑃(𝑟𝑒𝑗𝑒𝑐𝑡 𝐻0 |𝐻1 𝑡𝑟𝑢𝑒)
• Why do we need a false alarm rate at all?
o Otherwise, we will never reject tests and then they will have no power.
• Power is affected by 3 key factors:
o 1. False alarm rate – 𝛼
▪ A larger 𝛼, thus a lower significance level, also increases power (making it easier to
reject the null – almost more blindly though.)
o 2. Measure of difference - 𝑑
▪ For a t-test, we typically use “Cohen’s d”
o 3. Sample size – 𝑛
▪ Larger sample sizes tend to increase power (most practical)
Sign test
• Non-parametric replacement for one-sample t-test / paired t-test when normality assumption is
not met.
• It is effectively a binomial test of proportions:
o The rationale is that if the null were true, then there should be an equal amount of observations
scattered on either side of 𝜇 or 0 for the one-sample t-test and paired t-test respectively.
• Rough intuition:
o 1. We count the number of positive differences in the sample – ignoring any ties in the data.
o 2. We compare the likelihood of seeing that result using the binomial distribution
▪ Theoretically, under the null, there should be a 50% chance of being +ve or -ve
• W+ = 19 | W- = -2
Normal Approximation to Wilcoxon Signed-Rank (WSR) test statistic
• With a large enough sample size, the WSR statistic follows a normal distribution (CLT effectively)
• This means we can convert the W+ statistic into another test statistic and compare to a standard
normal distribution instead of WSR distribution.
𝒏(𝒏+𝟏) 𝒏(𝒏+𝟏)(𝟐𝒏+𝟏)
o 𝑾+ ~ 𝑵 ( , ) 𝒂𝒑𝒑𝒓𝒐𝒙𝒊𝒎𝒂𝒕𝒆𝒍𝒚.
𝟒 𝟐𝟒
𝑾+ −𝑬(𝑾+ )
o 𝑻= ~𝑵(𝟎, 𝟏)
√𝑽𝒂𝒓(𝑾+ )
𝒏(𝒏+𝟏) 𝒏(𝒏+𝟏)(𝟐𝒏+𝟏)
o 𝑬(𝑾+ ) = 𝒂𝒏𝒅 𝑽𝒂𝒓(𝑾+ ) =
𝟒 𝟐𝟒
• In R, this means determining p-value with the new T with pnorm.
• This is particularly useful when there are ties in the data! Because WSR is a DISCRETE
distribution – it actually can’t handle ties and so the normal distribution approximation is
used.
Permutation testing
• Permutation applies to any test that involves a categorical variable.
o Randomly generate a distribution for the test statistic by randomly permuting the groups /
labels associated with each data-point
o Using the new permuted distribution, we compute the p-value based on how many t-values
permuted are equal or more extreme than what we saw.
o This removes all assumptions about underlying distributions of variables and is
parametric. We shuffle the categorical labels without replacement!
• Any test statistic can be used:
o T-test
o Wilcox
o Welch t-test
o Chi-Square
o Some new ones:
▪ Difference in medians
▪ Robustly difference in medians (MAD is just a function)
• MAD standards for Mean Absolute Median
Bootstrapping
• Bootstrapping involves repeatedly resample from the sample (with replacement) to extrapolate
to the whole population
o Implicitly assumes that the sample is representative and all proportions are reflective of the
true proportions
• A bootstrapped confidence interval requires a slight adjustment
• Bootstrapping is useful when:
o Theoretical distribution of a statistic is complicated or unknown (e.g. coefficient of
variation, quantile regression parameter estimates, etc.)
o The sample size is too small to make sensible parametric inferences
• Advantages of bootstrapping:
• Frees us from making parametric assumptions to carry out inferences
• Provides answers when analytic solutions are impossible
• Verifies and checks the stability of results
• Asymptotically consistent
• Watch out for strong dependence between observations which will cause problems with corrections
ANOVA
• ANOVA refers to the Analysis of Variance and is used for identifying differences in mean across
multiple groups – effectively a generalization of a two-sided t-test with equal variance
o Key assumption of independence and equal variance are carried over
• ANOVA simply compares the variance between groups to the variance within the groups to
determine if the difference is due to a particular factor:
o If the variance between groups is much larger when compared to the variance that occurs
within that group, it is likely because the means are not the same.
𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 𝑀𝑒𝑎𝑛 𝑆𝑞.
o 𝑇=𝐹= 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑀𝑒𝑎𝑛 𝑆𝑞.
– the F statistic is a ratio of 2 chi-square distributions
▪ 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 𝑚𝑒𝑎𝑛 𝑆𝑞. = Group means vs. Overall mean
▪ 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑀𝑒𝑎𝑛 𝑆𝑞. = Individual Observations vs. their respective Group mean
ANOVA Contrasts
• We perform contrasts so we can find out exactly which groups are different - as ANOVA only tells
you that at least one group was different.
• Contrasts test for mean differences between specific groups, while making use of the whole dataset.
o It does this by using coefficient weights to zero-out the groups of no interest, but still
keeping the data relevant (this is what makes it more powerful than a t-test, which would
only examine data within the 2 groups of interest)
▪ Essentially a fancy two-sample t-test that has potential for a smaller standard error
• Mathematically, contrasts are a linear combination where the coefficients add to zero:
o Under our normal-with-equal-variances assumption – the contrast test statistic follows a
normal distribution
▪ This is based on the fact that a linear combination of normally distributed variables
itself is also normally distributed:
o Under the null hypothesis – all contrasts = 0 because the means are theoretically the same:\
• Confidence Intervals for contrasts are obtained by multiplying the critical value by the
standard error (as usual):
• The difference between a Contrast and Post-hoc is that:
o For a contrast, you decide which groups you believe should be different based on
scientific theory / a hypothesis BEFORE examining the data.
o For a post-hoc, you essentially have no hypothesis and instead SEARCH FOR
DIFFERENCES (typically pair-wise) in the data.
ANOVA Post-hoc
• Here we examine all pair-wise differences as they are equally interesting.
• We can construct individual 95% confidence intervals individually:
• However, we can also examine the intervals simultaneously applying appropriate adjustments:
o Bonferroni method:
▪ If we have k confidence intervals and we wish to have simultaneous coverage
probability of at least 100(1 − 𝛼)% - then we need to construct each interval to
𝛼
have an individual coverage probability of 100 (1 − 𝑘 ) %
▪ Alternatively, this is also the same as multiplying each obtained p-value by k and
comparing it to plain 𝛼.
o Tukey method:
▪ Derived the exact multiplier needed for simultaneous confidence intervals for all
pairwise comparisons when the ample sizes are equal.
• When sample sizes are unequal – Tukey will be more conservative and
yield narrower results than the Bonferroni method
o Scheffe’s method:
▪ Computed a particular multiplier
▪ Allows for unlimited data snooping
• All that happens here in R is that you use the `emmeans` package and see which confidence
interval doesn’t cover 0 -> significant difference.
Residual Res Sum Sq. (n-1)(g-1) Trt MS = Res Sum Sq/(n-1) (g-1)
Total Total Sum Sq. ng -1
𝑌𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜖𝑖 𝑓𝑜𝑟 𝑖 = 1,2, … 𝑛
• This is essentially 𝑦 = 𝑚𝑥 + 𝑏:
o 𝑛 𝑖𝑠 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 (𝑟𝑜𝑤𝑠) 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎 𝑠𝑒𝑡
o 𝛽0 𝑖𝑠 𝑡ℎ𝑒 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟
o 𝛽1 𝑖𝑠 𝑡ℎ𝑒 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑠𝑙𝑜𝑝𝑒 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟
o 𝜖𝑖 𝑖𝑠 𝑡ℎ𝑒 𝑒𝑟𝑟𝑜𝑟 𝑡𝑒𝑟𝑚 𝑎𝑛𝑑 𝑡𝑦𝑝𝑖𝑐𝑎𝑙𝑙𝑦 𝑎𝑠𝑠𝑢𝑚𝑒𝑑 𝑡𝑜 𝑓𝑜𝑙𝑙𝑜𝑤 𝑁(0, 𝜎 2 )
𝑌𝑖 ~ 𝑁(𝛽0 + 𝛽1 𝑥𝑖 , 𝜎 2 )
• The model is fit (e.g. estimating betas) via minimizing the sum of squared residuals:
o Residual: 𝑅𝑖 = 𝑦𝑖 − 𝑦̂𝑖 where:
▪ 𝑦𝑖 𝑖𝑠 𝑡ℎ𝑒 𝒂𝒄𝒕𝒖𝒂𝒍 𝒗𝒂𝒍𝒖𝒆
▪ 𝑦̂𝑖 𝑖𝑠 𝑡ℎ𝑒 𝒇𝒊𝒕𝒕𝒆𝒅 𝒗𝒂𝒍𝒖𝒆 𝑖. 𝑒. 𝑦̂𝑖 = 𝛽0 + 𝛽1 𝑥𝑖
• 4 key assumptions:
o 1. Linearity – the relationship between Y and x is linear
▪ Check via Y and x plot
▪ Check via residuals plotted against x after model is fitted
o 2. Independence – all the errors are independent of each other
▪ Random and independent observations by experimental design
o 3. Homoskedasticity – all errors have constant variance 𝑉𝑎𝑟(𝜖𝑖 ) = 𝜎 2 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑖 = 1, . 𝑛
▪ Residual plot does not exhibit any frown or smile shape – equally distributed
o 4. Normality – the errors follow a normal distribution
▪ QQPlot is followed
• Decomposing the error (R2):
o Very weirdly defined as 1 – unexplained / total variation from the mean.
𝑌 = 𝑋𝛽 + 𝜖
o 1. 𝑺𝒎𝒂𝒍𝒍𝒆𝒓 𝝈𝟐 𝑙𝑒𝑎𝑑𝑠 𝑡𝑜 𝑎 𝒃𝒆𝒕𝒕𝒆𝒓 𝒇𝒊𝒕 𝑎𝑛𝑑 𝒔𝒎𝒂𝒍𝒍𝒆𝒓 𝒗𝒂𝒓𝒊𝒂𝒏𝒄𝒆𝒔 𝒇𝒐𝒓 𝛽̂ 𝑎𝑛𝑑 𝑦̂0
o 2. 𝑳𝒂𝒓𝒈𝒆𝒓 𝒔𝒑𝒓𝒆𝒂𝒅 𝒐𝒇 𝒙 𝑙𝑒𝑎𝑑𝑠 𝑡𝑜 𝑚𝑜𝑟𝑒 𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑎𝑏𝑜𝑢𝑡 𝑌
𝑙𝑒𝑎𝑑𝑠 𝑡𝑜 𝑎 𝒃𝒆𝒕𝒕𝒆𝒓 𝒇𝒊𝒕 𝑎𝑛𝑑 𝒔𝒎𝒂𝒍𝒍𝒆𝒓 𝒗𝒂𝒓𝒊𝒂𝒏𝒄𝒆𝒔 𝒇𝒐𝒓 𝛽̂ 𝑎𝑛𝑑 𝑦
̂0
o 3. 𝑳𝒂𝒓𝒈𝒆𝒓 𝒔𝒂𝒎𝒑𝒍𝒆 𝒔𝒊𝒛𝒆 𝑙𝑒𝑎𝑑𝑠 𝑡𝑜 𝑎 𝒃𝒆𝒕𝒕𝒆𝒓 𝒇𝒊𝒕 𝑎𝑛𝑑 𝒔𝒎𝒂𝒍𝒍𝒆𝒓 𝒗𝒂𝒓𝒊𝒂𝒏𝒄𝒆𝒔 𝒇𝒐𝒓 𝛽̂ 𝑎𝑛𝑑 𝑦
̂0
o 4. 𝐶𝑙𝑜𝑠𝑒𝑟 𝑥0 𝑖𝑠 𝑡𝑜 𝑥̅ 𝑡ℎ𝑒 𝒔𝒎𝒂𝒍𝒍𝒆𝒓 𝒗𝒂𝒓𝒊𝒂𝒏𝒄𝒆𝒔 𝒇𝒐𝒓 𝑦
̂0
• The effect of this – is that we’re predicting log-odds with the regression:
• To convert back to a probability, we unravel the logit function:
o Link = Log-odds
o Response = probability
Nearest Neighbours
• Nearest neighbours is a classification technique, especially useful when:
o There is high class separation
o We have non-linear cominations of predictors influencing the response
▪ At these times, it is difficult/impossible to use logistic regression.
▪ Decision trees can be used, but can often be:
• Complicated
• Overfit
• Error prone
• K-nn is a non-parametric algorithm that votes on the class of new data by looking at the majority
class of the k-nearest neighbours.
o The original assignment has an effect on how it all plays out.
o Typically the Euclidean distance is used between the points
▪ The assumption is that points with similar proximity are similar classes
• The kNN classifier doesn’t pre-process the training data so it can complete rather slowly
o However, the distance metric only makes sense for quantitative predictors, not categorical.
o Considered a lazy learning algorithm
o Constructed algorithm is discarded along with any intermediate results
• Advantages:
o kNN is easy to understand
o Doesn’t require any pre-processing time
o Analytically tractable and simple implementation
o Performance improves as sample size grows
o Uses local information and is highly adaptive
o Can easily be parallelized
• Disadvantages:
o Predictors can be slow (data processed at that time) – computationally intensive
o Large storage requirements
o Usefulness depends on the geometry of the data
▪ Clustering and scale can have negative effect
• Cross-validation can be used to optimize the choice of k to minimize error
Clustering
• Clustering is an unsupervised learning method of grouping together a set of objects into similar
classes
• K-means clustering is a process of classifying data-points based on its closest cluster
o To use this, you must specify the K groups you will cluster into (either permuting to
find the lowest error, or based on prior knowledge)
o 1. Randomly assign a number from 1 to K to each of the observations
o 2. Iterate until the clusters stop changing:
▪ A) For each K cluster, compute the cluster centroid.
▪ B) For each observation to the cluster whose centroid is the closest, decided by
Euclidean distance, re-assign its label.
• Advantages of K-means include:
o Can be much faster than hierarchical clustering
o Nice theoretical framework that’s easy to interpret
o Can incorporate new data and reform clusters easily
• Hierarchical clustering is a method to produce tree / dendograms – which avoids specifying
how many groups (e.g. K)
o 1. Bottom up – agglomerative clustering
o 2. Top down – divisive clustering
• Advantages of Hierarchical clustering:
o Don’t need to know how many clusters you’re after
o Can cut hierarchy at any level to a get any number of clusters
o Easy to interpret hierarchy for particular applications
o Deterministic