0% found this document useful (0 votes)
47 views

Statistics Notes

This document provides an overview of key concepts in categorical data analysis and statistical testing. It covers topics such as sampling and bias, measures of model performance for classification tasks, methods for estimating risk like relative risk and odds ratios, and statistical tests for categorical data like chi-square tests. Specific modules covered include categorical data, testing means, multiple factor analysis (ANOVA), and machine learning.

Uploaded by

mike chan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

Statistics Notes

This document provides an overview of key concepts in categorical data analysis and statistical testing. It covers topics such as sampling and bias, measures of model performance for classification tasks, methods for estimating risk like relative risk and odds ratios, and statistical tests for categorical data like chi-square tests. Specific modules covered include categorical data, testing means, multiple factor analysis (ANOVA), and machine learning.

Uploaded by

mike chan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

HD DATA2002 NOTES

Semester 2, 2020

Module 1: Categorical Data


Module 2: Testing Means
Module 3: Multiple Factor Analysis (ANOVA)
Module 4: Learning and Prediction (ML)
Module 1: Categorical Data
Collecting Data
• Key term revision:
o Sample is part of a Population
o Parameter is a numerical fact about a population
▪ This is what we wish to know
o A parameter cannot be determined exactly, it can only be estimated
o A statistic is computed from a sample, and used to estimate the parameter
▪ This is what we do know
• The key issue when estimating is accuracy – when we don’t know the true parameter.
o We can’t just measure the whole population because:
▪ Difficult to organize
▪ Limited time, money and other resources
o And that’s why we collect samples
• Bias:
o Selection/Sampling Bias – the sample does not accurately reflect the population
o Non-response Bias – certain groups are under-represented because they elect not to
participate
o Measurement/Design Bias – The sampling method induces bias e.g. way questions are
asked
• Randomised Controlled Double-Blind Study (Gold Standard):
o Strong evidence produced that can establish causation but difficult and rare:
▪ Investigators obtain a representative sample
▪ Sample randomly allocated into treatment group and control group
▪ Control group given a placebo, but neither the subjects nor investigators are aware
▪ Investigators later compare the responses of the 2 groups
• Observational Studies:
o Weaker evidence that can only establish association but often the only option (e.g.
smoking):
▪ Observe two groups of people that differ in a particular attribute you’re testing
▪ Compare the results
• Simpson’s Paradox:
o When a trend in individual groups disappears when groups are pooled together
▪ Often due to another confounding variable that isn’t properly controlled for

Chi-square Goodness of Fit


• Essentially considers the difference between the observed count and the
expected count if the null was true:
o k = the number of groups
• The degrees of freedom = k-1 because the last observation is fixed,
adding no new information
• New estimated parameters ‘q’ also reduce the degrees of freedom
• Requires no expected frequency too small, and as a rule of thumb this is 5.

Measures of Performance
• There are a variety of errors and measures for evaluating a model’s performance.
Actual Positive D+ Actual Negative D-
Test Positive S+ A B A+B
Test Negative S- C D C+D
A+C B+D A+B+C+D

𝐶
o False negative rate: P( S- | D+ ) = – Had disease but test was negative.
𝐴+𝐶
𝐵
o False positive rate: P( S+ | D- ) = – Had no disease but test was positive.
𝐵+𝐷
𝐴
o Sensitivity/Recall: P( S+ | D+ ) = – p(test positive) given has disease.
𝐴+𝐶
𝐷
o Specificity: P( S- | D- ) = – p(test negative) given has no disease.
𝐵+𝐷
𝐴
o Precision/Positive predictive value: P( D+ | S+ ) = – p(disease) given that he
𝐴+𝐵
test was positive
𝐷
o Negative predictive value: P( D- | S- ) = – p(no disease) given negative test
𝐶+𝐷
𝐴+𝐷
o Accuracy = – Overall average correctness of results.
𝐴+𝐵+𝐶+𝐷

Measures of Risk
• Prospective studies – forward-looking analysis involving initially identifying disease-free
individuals classified by presence or absence of a risk factor
o After a certain time period, we examine whether the presence of disease emerges
o E.g. examining the impact of sun exposure via analysing volleyball player’s skin across those
who wore sunscreen and those who didn’t
• Retrospective studies – back-ward looking analysis involving identifying those with and without the
disease (outcome categories) and tracing backwards to determine whether the risk factor was
present/absent
o E.g. To examine the effectiveness of zinc oxide as skin protection, we analyse lifeguards who
did and didn’t develop skin cancer and ask them to recall if they used zinc oxide.
• Estimating population sample:
o Consider two events A and B.
▪ If we can take a random sample from the whole population, we can estimate P(A)
using the observed sample proportion with attribute A
▪ If we take a random sample from the subpopulation defined by B, we can estimate
P(A|B) using the observed sample proportion of the subpopulation with attribute A
o Prospective Study:
▪ P(D+ | R+) & P(D- |R-) – can be estimated because we controlled on the risk
factor and examined to see whether the disease developed
▪ We can’t estimate – P ( R+ |D+) or P (R - | D-) because we did not take random
samples from the disease group
o Retrospective Study:
▪ P(R+ | D+) & P(R- |D-) – can be estimated because we controlled on the disease
developing and traced back to see if the risk factor was present
▪ We can’t estimate – P ( D+ |R+) or P (D- | R-) because we did not take random
samples from the risk factor sub-groups
▪ Therefore you cannot identify RELATIVE RISK (RR)!
• The result is significant if the confidence interval constructed doesn’t contain 1.
o Else, you’re looking to see if 0 is in the log-odds confidence ratio.
• Relative Risk:
• This is a stronger measure where u can directly say that R+ is [x]x more likely to have D+.
o Relative risk is defined by the
probability of having the disease given
two conditional probabilities:
𝑷(𝑫+|𝑹+)
▪ 𝑹𝑹 =
𝑷(𝑫+|𝑹−)
▪ RR = 1 means no difference
▪ RR < 1 implies risk factor
group less likely to have
disease
▪ RR > 1 implies risk factor
group more likely to have disease

Actual Positive D+ Actual Negative D-


Risk Factor R+ A B A+B
No Risk Factor R- C D C+D
A+C B+D A+B+C+D
𝐴
𝑷(𝑫+|𝑹+) 𝐴+𝐵 𝑨(𝑪+𝑫)
▪ 𝑹𝑹 = = 𝑪 =
𝑷(𝑫+|𝑹−) 𝑪(𝑨+𝑩)
𝑪+𝑫
• Odds Ratio:
o Odds are a ratio of probabilities – an alternative way of measuring the likelihood of an event
occurring
o If P(A) is the probability of an event occurring, then the odds of A occurring:
𝑷(𝑨)
▪ 𝑶(𝑨) =
𝟏−𝑷(𝑨)
𝑷(𝑫+|𝑹+) 𝑷(𝑫+|𝑹+)
▪ 𝑶(𝑫 + |𝑹 +) = =
𝟏−𝑷(𝑫+|𝑹+) 𝑷(𝑫−|𝑹+)
o Odds ratios can be used for both prospective and retrospective studies
▪ No matter how you define it, the odds radio is calculated by:
𝑨∗𝑫
▪ 𝑶𝑹 =
𝑩∗𝑪

o If OR = 1 then D and R are independent and there is no relationship between R and D


o If OR > 1 then R increases the risk of disease
o If OR < 1 then R decreases the risk of disease
• Standard errors and confidence intervals for odds ratios:
o OR sits on a skewed distribution from (0,∞), where the neutral value is 1.
o Log(OR) produces a more symmetric distribution centered at 0
▪ 𝑺𝑬(𝒍𝒐𝒈(𝑶𝑹)) = √(𝟏/𝒂 + 𝟏/𝒃 + 𝟏/𝒄 + 𝟏/𝒅)
▪ 𝑪𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆 𝒊𝒏𝒕𝒆𝒓𝒗𝒂𝒍 = 𝒍𝒐𝒈(𝑶𝑹) ± 𝟏. 𝟗𝟔 × √(𝟏/𝒂 + 𝟏/𝒃 + 𝟏/𝒄 + 𝟏/𝒅)

Chi-square Homogeneity/Independence Test


• Homogeneity is from 2+ population whereas Independence is from 1 populations.

Testing in Small Samples


• When the expected cell count assumption is broken, we can no longer expect accurate results from
applying the Chi-square distribution. We have the following options:
o Fisher’s test – Using binomial combinations (hypergeometric distribution) we permute all
the likelihood of such an event out of all possible events under null.
o Yates Correction – A correction that attempts to increase accuracy of Chi-square
o Permutation / MonteCarlo sim – Permute the test-statistic multiple times and use the
distribution that is created from that.
Module 2: Testing means
T-tests
• Normal Distribution background:
o The sample mean from a normal sample itself is normally distributed
o The sample variance from a normal sample has a scaled chi-squared distribution
o The sample mean and variance from a normal sample are statistically independent
• One-sample t-test – testing a sample mean against a hypothesized value
• Two-sample t-test – testing the differences in mean across 2 independent samples
• Paired t-test – testing the whether the difference in mean across 2 paired samples
• Welch t-test – relaxes the equal variance assumption that underpins the t-test

Critical values, rejection regions and confidence intervals


• Random variable review:
o A random variable is a mathematical object which takes certain values with certain
probabilities.
o We have both discrete and continuous random variables
▪ Continuous variables however can always be “approximated” by a discrete one
o A simple discrete random variable X can be described as a single random draw from a “box”
containing tickets, each with numbers written on them:
▪ E(x) = 𝜇 (the average of numbers in the box)
▪ Var(x) = 𝜎 2 (the population variance of the numbers in the box)
o The expectation of a sum is always equal to the sum of the expectations:
▪ 𝐸(𝑇) = 𝐸(𝑋1 + ⋯ + 𝑋𝑛 ) = 𝐸(𝑋1 ) + ⋯ + 𝐸(𝑋𝑛 ) = 𝜇. + ⋯ + 𝜇 = 𝑛𝜇
o The expectation of variance is not always equal to the sum of the expectations:
▪ Only if the 𝑋𝑖 ’s are independent
▪ 𝑉𝑎𝑟(𝑇) = 𝑉𝑎𝑟(𝑋1 + ⋯ + 𝑋𝑛 ) = 𝑉𝑎𝑟(𝑋1 ) + ⋯ + 𝑉𝑎𝑟(𝑋𝑛 ) = 𝜎 2 + ⋯ + 𝜎 2 =
𝑛𝜎 2

• When we estimate, we are estimating the observed value of the mean data 𝒙 ̅,
̅ which is different to 𝑿
the population mean, which we say is a fixed and known parameter.
• Standard error, or the standard deviation of the estimator, is then:
𝝈
o 𝑆𝐸 = 𝑆𝐷(𝑋̅) = √𝑉𝑎𝑟(𝑋̅) =
√𝒏
o Represents the “likely size of the estimation error”.
• Estimation of standard error:
o Like the population mean, we never know the exact standard error. However, we estimate
from sample variance:
1 𝑠
o 𝑠2 = ̂ =
∑𝑛𝑖=1(𝑋𝑖 − 𝑋̅)2 𝒂𝒏𝒅 𝒔𝒐, 𝑺𝑬
𝑛−1 √𝑛
• Rough Idea behind inference:
o 1. Compute the value of estimate 𝒙
̅
𝒔
o 2. Compute the value of the estimated standard error
√𝒏
o 3. See if the discrepancy 𝒙
̅ − 𝜇0 is “large” compared to the standard error
▪ We should be explicit in which kinds of “discrepancies” we’re after:
• Positive, 𝒙 ̅ is significantly more – one-sided
• Negative, 𝒙 ̅ is significantly less – one-sided
• Both, 𝒙 ̅ is significantly different – two-sided
• To make the judgement call – how “large” is large enough for significance?:
o There are 2 approaches that are equivalent:
𝑠
▪ 1. “t-test approach” – declare 𝜇0 as not plausible if |𝒙 ̅ − 𝜇0 | > 𝑐 for a
√𝑛
“suitably chosen” c
▪ 2. “Confidence interval approach” – determine that the set of plausible values for
𝑠
unknown 𝜇 is 𝒙̅ ± 𝑐 , for a “suitably chosen” c
√𝑛
• Thus, if the observed 𝜇 was outside of these plausible values it is “too large”
▪ Note: “c” effectively standards for critical value
• How to choose the constant “c” or critical value?
o Testing: Controlling the false alarm rate i.e. 𝛼
▪ Alternatively called the “significance level”, this is when we choose the probability
at which we incorrectly reject the null.
o Confidence Intervals: Controlling the coverage probability
▪ The coverage probability refers to the probability that the “true” value of the
unknown parameter lies inside the confidence interval.
▪ Effectively another way of describing the false alarm rate
• As you increase your confidence % / decrease your false alarm rate – your confidence
intervals widen (to ensure you are correct more often).
• All this means is that instead of computing p-values, we can just compare the test statistic
with the quantile at the specified 𝛼
o This effectively calculates a ‘p-value’ from the set significance level, and compares to the
quantile value there

• It is also possible to rescale the quantiles back to the actual data-scale that the data exists in:
o I’d say this just increases the interpretability but doesn’t functionally add a lot.

Sample size calculations and power


• Errors in hypothesis testing:

H0 true (innocent) H0 false (guilty)


H0 not rejected (acquitted) Correct decision Type II error – 𝛽
H0 rejected (sentenced) Type I error – 𝛼 Correct decision (1 − 𝛽)

o Type I error – was not meant to reject the null - but did.
o Type II error – was meant to reject the null – but didn’t.
o Power = (1 − 𝛽) = 𝑃(𝑟𝑒𝑗𝑒𝑐𝑡 𝐻0 |𝐻1 𝑡𝑟𝑢𝑒)
• Why do we need a false alarm rate at all?
o Otherwise, we will never reject tests and then they will have no power.
• Power is affected by 3 key factors:
o 1. False alarm rate – 𝛼
▪ A larger 𝛼, thus a lower significance level, also increases power (making it easier to
reject the null – almost more blindly though.)
o 2. Measure of difference - 𝑑
▪ For a t-test, we typically use “Cohen’s d”
o 3. Sample size – 𝑛
▪ Larger sample sizes tend to increase power (most practical)

Sign test
• Non-parametric replacement for one-sample t-test / paired t-test when normality assumption is
not met.
• It is effectively a binomial test of proportions:
o The rationale is that if the null were true, then there should be an equal amount of observations
scattered on either side of 𝜇 or 0 for the one-sample t-test and paired t-test respectively.
• Rough intuition:
o 1. We count the number of positive differences in the sample – ignoring any ties in the data.
o 2. We compare the likelihood of seeing that result using the binomial distribution
▪ Theoretically, under the null, there should be a 50% chance of being +ve or -ve

• Ignores important information for example the magnitude of the differences.


o Almost always use the Wilcoxon-Signed-Rank test in these situations.

Wilcoxon Signed-Rank Test


• Introduces ranks, based on magnitude of difference, to the sign test.
• General process for ranking:
o 1. Arrange the data in ascending order
o 2. Rank the observation
o 3. Tied observations are given an average of the corresponding ranks
• Wilcoxon Signed rank process:
o 1. Calculate the absolute difference between the samples
o 2. Rank the absolute difference
o 3. Assign the sign to the rank based on the original sign of difference
o 4. Calculate appropriate test statistic depending on the test
▪ W+ for a 1-sided test
▪ Min(W+, W-) for a 2-sided test
o To do this in R, simply feed the data values into functions.
Sample Y Sample X Differences |Differences| Ranks Signed-Ranks
85 83 2 2 1 1
69 78 -9 9 2 -2
81 70 11 11 4 4
112 72 40 40 6 6
77 67 10 10 3 3
86 68 18 18 5 5

• W+ = 19 | W- = -2
Normal Approximation to Wilcoxon Signed-Rank (WSR) test statistic

• With a large enough sample size, the WSR statistic follows a normal distribution (CLT effectively)
• This means we can convert the W+ statistic into another test statistic and compare to a standard
normal distribution instead of WSR distribution.
𝒏(𝒏+𝟏) 𝒏(𝒏+𝟏)(𝟐𝒏+𝟏)
o 𝑾+ ~ 𝑵 ( , ) 𝒂𝒑𝒑𝒓𝒐𝒙𝒊𝒎𝒂𝒕𝒆𝒍𝒚.
𝟒 𝟐𝟒
𝑾+ −𝑬(𝑾+ )
o 𝑻= ~𝑵(𝟎, 𝟏)
√𝑽𝒂𝒓(𝑾+ )
𝒏(𝒏+𝟏) 𝒏(𝒏+𝟏)(𝟐𝒏+𝟏)
o 𝑬(𝑾+ ) = 𝒂𝒏𝒅 𝑽𝒂𝒓(𝑾+ ) =
𝟒 𝟐𝟒
• In R, this means determining p-value with the new T with pnorm.
• This is particularly useful when there are ties in the data! Because WSR is a DISCRETE
distribution – it actually can’t handle ties and so the normal distribution approximation is
used.

Wilcoxon Rank-Sum Test


• Non-parametric test for the 2-sample t-test when normality assumption and symmetrical
assumptions are dropped.
o Simply requires assumption that the two distributions are from the same distribution
• Process:
o 1. Rank ALL of the observations across both the sample groups.
o Ranks are summed across one of the samples
▪ Instead of the positive / negative side as in the Signed-rank test
o i.e. 𝑊 = 𝑅1 + 𝑅2 + ⋯ + 𝑅𝑛𝑥

Yield Method Rank


1 32 A 8
2 29 A 5
3 35 A 9
4 28 A 4
5 27 B 3
6 31 B 7
7 26 B 2
8 25 B 1
9 30 B 6
• W_A = 26 | W_B = 19
• If null is true, then W should be close to its expected value:
𝑛𝑥 𝑁(𝑁+1) 𝑛𝑥 (𝑁+1)
o 𝐸(𝑊) = 𝑃𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛 × 𝑇𝑜𝑡𝑎𝑙 𝑅𝑎𝑛𝑘 𝑠𝑢𝑚 = × =
𝑁 2 2
• Feeding data through in R will simply find the p-value
• Like the Wilcoxon Signed-Rank Test, the test statistic W can be approximated by a normal
distribution.
• Application in R using the `pnorm` function.

Permutation testing
• Permutation applies to any test that involves a categorical variable.
o Randomly generate a distribution for the test statistic by randomly permuting the groups /
labels associated with each data-point
o Using the new permuted distribution, we compute the p-value based on how many t-values
permuted are equal or more extreme than what we saw.
o This removes all assumptions about underlying distributions of variables and is
parametric. We shuffle the categorical labels without replacement!
• Any test statistic can be used:
o T-test
o Wilcox
o Welch t-test
o Chi-Square
o Some new ones:
▪ Difference in medians
▪ Robustly difference in medians (MAD is just a function)
• MAD standards for Mean Absolute Median

Bootstrapping
• Bootstrapping involves repeatedly resample from the sample (with replacement) to extrapolate
to the whole population
o Implicitly assumes that the sample is representative and all proportions are reflective of the
true proportions
• A bootstrapped confidence interval requires a slight adjustment
• Bootstrapping is useful when:
o Theoretical distribution of a statistic is complicated or unknown (e.g. coefficient of
variation, quantile regression parameter estimates, etc.)
o The sample size is too small to make sensible parametric inferences
• Advantages of bootstrapping:
• Frees us from making parametric assumptions to carry out inferences
• Provides answers when analytic solutions are impossible
• Verifies and checks the stability of results
• Asymptotically consistent

Module 3: Multiple Factor (ANOVA)


Multiple Testing
• When we increase the number of statistical tests we do, we are bound to make more errors just purely
due to probability.
o E.g. if you do 10,000 tests (across genes per se), you will have 10,000 * 0.05 = 500 false
positives.

Actual H0 True Actual H0 False


Test accepts H0 A B A+B
Test rejects H0 C D C+D
A+C B+D A+B+C+D
𝐶
o False “positive” rate: E[
𝐴+𝐶] – null is true, but is rejected.
o Family wise error rate (FWER) – probability of at least one false positive 𝑷(𝑪 ≥ 𝟏)
𝐶
o False discovery rate (FDR) – rate at which claims of significance are false: E[
𝐶+𝐷]
• Controlling FWER to minimize false positives.
o 𝑭𝑾𝑬𝑹 = 𝑷(𝑪 ≥ 𝟏) = 𝟏 − (𝟏 − 𝜶)𝒎 = 𝒇𝒂𝒍𝒔𝒆𝒍𝒚 𝒓𝒆𝒋𝒆𝒄𝒕𝒊𝒏𝒈 𝒂𝒕 𝒍𝒆𝒂𝒔𝒕 𝟏 𝒏𝒖𝒍𝒍
o Bonferroni Correction is a method to control FWER through defining new 𝜶:
𝜶
▪ 𝜶∗ = where m is the number of tests
𝒎
𝟎.𝟎𝟓 𝟐𝟎
▪ 𝒆. 𝒈. 𝒎 = 𝟐𝟎 | 𝑭𝑾𝑬𝑹 = 𝟏 − (𝟏 − 𝜶∗ )𝒎 = 𝟏 − (𝟏 − 𝟐𝟎
) = 𝟎. 𝟎𝟒𝟖𝟖
▪ Pros: Easy co calculate, conservative | Cons: May be “too” conservative
• Controlling FDR to minimize false positives:
o Benjamin-Hochberg (BH) is a popular method to control FDR to 𝜶:
▪ 1. Calculate the p-values normally
▪ 2. Order the p-values from smallest to largest 𝒑𝟏 ≤ 𝒑𝟐 … ≤ 𝒑𝒎
𝒋
▪ 3. Find 𝒋∗ ≤ 𝑴𝒂𝒙(𝒋) 𝒔𝒖𝒄𝒉 𝒕𝒉𝒂𝒕 𝒑𝒋 ≤ 𝜶 where j = rank of p-value
𝒎
𝒋∗
▪ 4. Reject all 𝒑𝒊 ≤ 𝒎 𝜶 even if the moves above and back below.
▪ Pros: Easy to calculate, less conservative vs. Bonferroni | Cons: More false
positives, behaves strangely under dependence

• Watch out for strong dependence between observations which will cause problems with corrections

ANOVA
• ANOVA refers to the Analysis of Variance and is used for identifying differences in mean across
multiple groups – effectively a generalization of a two-sided t-test with equal variance
o Key assumption of independence and equal variance are carried over
• ANOVA simply compares the variance between groups to the variance within the groups to
determine if the difference is due to a particular factor:
o If the variance between groups is much larger when compared to the variance that occurs
within that group, it is likely because the means are not the same.
𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 𝑀𝑒𝑎𝑛 𝑆𝑞.
o 𝑇=𝐹= 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑀𝑒𝑎𝑛 𝑆𝑞.
– the F statistic is a ratio of 2 chi-square distributions
▪ 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 𝑚𝑒𝑎𝑛 𝑆𝑞. = Group means vs. Overall mean
▪ 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑀𝑒𝑎𝑛 𝑆𝑞. = Individual Observations vs. their respective Group mean

ANOVA Contrasts
• We perform contrasts so we can find out exactly which groups are different - as ANOVA only tells
you that at least one group was different.
• Contrasts test for mean differences between specific groups, while making use of the whole dataset.
o It does this by using coefficient weights to zero-out the groups of no interest, but still
keeping the data relevant (this is what makes it more powerful than a t-test, which would
only examine data within the 2 groups of interest)
▪ Essentially a fancy two-sample t-test that has potential for a smaller standard error
• Mathematically, contrasts are a linear combination where the coefficients add to zero:
o Under our normal-with-equal-variances assumption – the contrast test statistic follows a
normal distribution
▪ This is based on the fact that a linear combination of normally distributed variables
itself is also normally distributed:
o Under the null hypothesis – all contrasts = 0 because the means are theoretically the same:\
• Confidence Intervals for contrasts are obtained by multiplying the critical value by the
standard error (as usual):
• The difference between a Contrast and Post-hoc is that:
o For a contrast, you decide which groups you believe should be different based on
scientific theory / a hypothesis BEFORE examining the data.
o For a post-hoc, you essentially have no hypothesis and instead SEARCH FOR
DIFFERENCES (typically pair-wise) in the data.

ANOVA Post-hoc
• Here we examine all pair-wise differences as they are equally interesting.
• We can construct individual 95% confidence intervals individually:
• However, we can also examine the intervals simultaneously applying appropriate adjustments:
o Bonferroni method:
▪ If we have k confidence intervals and we wish to have simultaneous coverage
probability of at least 100(1 − 𝛼)% - then we need to construct each interval to
𝛼
have an individual coverage probability of 100 (1 − 𝑘 ) %
▪ Alternatively, this is also the same as multiplying each obtained p-value by k and
comparing it to plain 𝛼.
o Tukey method:
▪ Derived the exact multiplier needed for simultaneous confidence intervals for all
pairwise comparisons when the ample sizes are equal.
• When sample sizes are unequal – Tukey will be more conservative and
yield narrower results than the Bonferroni method
o Scheffe’s method:
▪ Computed a particular multiplier
▪ Allows for unlimited data snooping
• All that happens here in R is that you use the `emmeans` package and see which confidence
interval doesn’t cover 0 -> significant difference.

ANOVA Back-ups (Failed Assumptions)


• Relaxing equal variance assumption:
o Consider all pairwise Welch-tests and apply Bonferroni correction for multiple testing
o To get a “ANOVA-esque” figure, we can take the smallest p-value out of all the tests and
multiply it by k – where k is the number of tests that you’ve undertaken.
▪ Take the most significantly different mean and correct its p value – which in effect
acts as a test for if all the 3 groups are the same
• Relaxing the normality assumption to say that observations come from the same distribution:
o Permutation
▪ We condition on the combined sample and resample the categorical labels without
replacement.
▪ The total number of allocations can be excessively large, so sometimes we just take a
sample of the total – just to get a sense of the proportions
o Ranks – Kruskal Wallis
▪ Replace each observation with its “global” rank
▪ Compute the F-ratio on the ranks
▪ You can also do a permutation using a Kruskal-Wallis statistic if it isn’t
sensible to use the chi-square distribution that it typically follows.

Two-way ANOVA (Blocking)


• We introduce another “treatment”/ “class” term that we are uninterested in, but it improves the
analysis:
o The Residual Mean square is reduced, as the block provides explanatory power.
▪ This can allow you to detect more sensitive relationships that you would’ve
otherwise missed.
• The two-way ANOVA model is fit according to the following mode:
o 𝑌𝑖𝑗 = 𝜇 + 𝛼𝑖 + 𝛽𝑗 + 𝜖𝑖𝑗
▪ 𝜇 = 𝑜𝑣𝑒𝑟𝑎𝑙𝑙 𝑚𝑒𝑎𝑛
▪ 𝛼𝑖 = 𝑎𝑑𝑗𝑢𝑠𝑡𝑚𝑒𝑛𝑡 𝑓𝑜𝑟 𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 (𝑐𝑙𝑎𝑠𝑠)𝑓𝑜𝑟 𝑖 = 1,2, . . 𝑔
▪ 𝛽𝑗 = 𝑎𝑑𝑗𝑢𝑠𝑡𝑚𝑒𝑛𝑡 𝑓𝑜𝑟 𝑏𝑙𝑜𝑐𝑘 𝑓𝑜𝑟 𝑗 = 1,2, . . 𝑛
▪ n is the common sample (block) size and
▪ 𝜖𝑖𝑗 𝑎𝑟𝑒 𝑖𝑖𝑑 𝑁(0, 𝜎 2 )
▪ ∑𝑖 𝑎𝑖 = 0 and ∑𝑖 𝛽𝑖 = 0
o Each 𝑌𝑖𝑗 has a different expected mean, but there is an additive structure:
▪ ng different means explained by 1 + (g-1) + (n-1) = g + n – 1 free parameters
Source of Var. Sum of squares df Mean Square F-ratio
Blocks Block Sum Sq. n–1
Treatments Trt Sum Sq. g–1 Trt MS = Trt Sum Sq/g-1 Trt MS / Res MS

Residual Res Sum Sq. (n-1)(g-1) Trt MS = Res Sum Sq/(n-1) (g-1)
Total Total Sum Sq. ng -1

• Friedman Test – is the back-up for two-way ANOVA using ranks.


o Each observation replaced by its within-block rank
o One-way ANOVA F-test performed on the ranks
▪ Can alternatively also use a permutation or chi-square approximation
o Equivalent statistic:
𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑜𝑓 𝑡ℎ𝑒 𝑟𝑎𝑛𝑘𝑠
▪ 𝑇𝑜𝑡𝑎𝑙 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑜𝑓 𝑡ℎ𝑒 𝑟𝑎𝑛𝑘𝑠 / 𝑛(𝑔−1)

Multi-way ANOVA (Interactions)


• A “significant interaction” occurs whenever the effect of one variable is not the same at all levels
of the other variable
• Looking into multiple treatment groups and also interactions among treatments.
o Effectively “advanced” blocking, where we now care about the additional treatment factor.
• Interaction plots should cross if there is an interaction.
Module 4: Learning and Prediction
• Supervised learning – have knowledge of class labels and train model to predict class values
o Classification maps inputs to an output label (e.g. decision trees, nearest neighbour, logistic
regression, naïve bayes, SVM, neural networks and random forests)
o Regression – maps onto a continuous output
• Unsupervised learning – have no knowledge of output class or value but determines data patterns
and groupings

Simple Linear Regression


• Aims to predict an outcome variable Y using a single predictor variable x.

𝑌𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜖𝑖 𝑓𝑜𝑟 𝑖 = 1,2, … 𝑛

• This is essentially 𝑦 = 𝑚𝑥 + 𝑏:
o 𝑛 𝑖𝑠 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 (𝑟𝑜𝑤𝑠) 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎 𝑠𝑒𝑡
o 𝛽0 𝑖𝑠 𝑡ℎ𝑒 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟
o 𝛽1 𝑖𝑠 𝑡ℎ𝑒 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑠𝑙𝑜𝑝𝑒 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟
o 𝜖𝑖 𝑖𝑠 𝑡ℎ𝑒 𝑒𝑟𝑟𝑜𝑟 𝑡𝑒𝑟𝑚 𝑎𝑛𝑑 𝑡𝑦𝑝𝑖𝑐𝑎𝑙𝑙𝑦 𝑎𝑠𝑠𝑢𝑚𝑒𝑑 𝑡𝑜 𝑓𝑜𝑙𝑙𝑜𝑤 𝑁(0, 𝜎 2 )
𝑌𝑖 ~ 𝑁(𝛽0 + 𝛽1 𝑥𝑖 , 𝜎 2 )

• The model is fit (e.g. estimating betas) via minimizing the sum of squared residuals:
o Residual: 𝑅𝑖 = 𝑦𝑖 − 𝑦̂𝑖 where:
▪ 𝑦𝑖 𝑖𝑠 𝑡ℎ𝑒 𝒂𝒄𝒕𝒖𝒂𝒍 𝒗𝒂𝒍𝒖𝒆
▪ 𝑦̂𝑖 𝑖𝑠 𝑡ℎ𝑒 𝒇𝒊𝒕𝒕𝒆𝒅 𝒗𝒂𝒍𝒖𝒆 𝑖. 𝑒. 𝑦̂𝑖 = 𝛽0 + 𝛽1 𝑥𝑖
• 4 key assumptions:
o 1. Linearity – the relationship between Y and x is linear
▪ Check via Y and x plot
▪ Check via residuals plotted against x after model is fitted
o 2. Independence – all the errors are independent of each other
▪ Random and independent observations by experimental design
o 3. Homoskedasticity – all errors have constant variance 𝑉𝑎𝑟(𝜖𝑖 ) = 𝜎 2 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑖 = 1, . 𝑛
▪ Residual plot does not exhibit any frown or smile shape – equally distributed
o 4. Normality – the errors follow a normal distribution
▪ QQPlot is followed
• Decomposing the error (R2):
o Very weirdly defined as 1 – unexplained / total variation from the mean.

Multiple Regression and Model Selection


• Interpreting model coefficients:
o 𝛽0 is the expected value of Y when all explanatory factors are 0
o 𝛽𝑖 is the expected change to Y when there is an 1 unit increase in an explanatory factor
holding all else factors constant
• Multiple regression is an extension of regression where you have multiple explanatory
factors:
• There are 3 general approaches to building up the mode:
o 1. Backward variable selection – start with full model and remove least informative
o 2. Forward variable selection – start with the null model and add most informative
o 3. Exhaustive search – finds the best combination for every single model possible at every
single integer level of explanatory variables
o 1 and 2 can be done in a stepwise fashion.
• Akaike Information Criterion (AIC) is the most widely known and used model selection method:
o The smaller the AIC, the better the model.
𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠
o 𝐴𝐼𝐶 = 𝑛𝑙𝑜𝑔 ( 𝑛
) + 2𝑝 where p = number of parameters.

Prediction Intervals and Performance Assessment


• Confidence interval – is predicting for the AVERAGE day/event that has those x measures -
NARROWER
• Prediction interval – is predicting for a PARTICULAR DAY/event (e.g. tomorrow) that has
those x measures - WIDER
• Effect of variance on intervals:

𝑌 = 𝑋𝛽 + 𝜖

o 1. 𝑺𝒎𝒂𝒍𝒍𝒆𝒓 𝝈𝟐 𝑙𝑒𝑎𝑑𝑠 𝑡𝑜 𝑎 𝒃𝒆𝒕𝒕𝒆𝒓 𝒇𝒊𝒕 𝑎𝑛𝑑 𝒔𝒎𝒂𝒍𝒍𝒆𝒓 𝒗𝒂𝒓𝒊𝒂𝒏𝒄𝒆𝒔 𝒇𝒐𝒓 𝛽̂ 𝑎𝑛𝑑 𝑦̂0
o 2. 𝑳𝒂𝒓𝒈𝒆𝒓 𝒔𝒑𝒓𝒆𝒂𝒅 𝒐𝒇 𝒙 𝑙𝑒𝑎𝑑𝑠 𝑡𝑜 𝑚𝑜𝑟𝑒 𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑎𝑏𝑜𝑢𝑡 𝑌
𝑙𝑒𝑎𝑑𝑠 𝑡𝑜 𝑎 𝒃𝒆𝒕𝒕𝒆𝒓 𝒇𝒊𝒕 𝑎𝑛𝑑 𝒔𝒎𝒂𝒍𝒍𝒆𝒓 𝒗𝒂𝒓𝒊𝒂𝒏𝒄𝒆𝒔 𝒇𝒐𝒓 𝛽̂ 𝑎𝑛𝑑 𝑦
̂0
o 3. 𝑳𝒂𝒓𝒈𝒆𝒓 𝒔𝒂𝒎𝒑𝒍𝒆 𝒔𝒊𝒛𝒆 𝑙𝑒𝑎𝑑𝑠 𝑡𝑜 𝑎 𝒃𝒆𝒕𝒕𝒆𝒓 𝒇𝒊𝒕 𝑎𝑛𝑑 𝒔𝒎𝒂𝒍𝒍𝒆𝒓 𝒗𝒂𝒓𝒊𝒂𝒏𝒄𝒆𝒔 𝒇𝒐𝒓 𝛽̂ 𝑎𝑛𝑑 𝑦
̂0
o 4. 𝐶𝑙𝑜𝑠𝑒𝑟 𝑥0 𝑖𝑠 𝑡𝑜 𝑥̅ 𝑡ℎ𝑒 𝒔𝒎𝒂𝒍𝒍𝒆𝒓 𝒗𝒂𝒓𝒊𝒂𝒏𝒄𝒆𝒔 𝒇𝒐𝒓 𝑦
̂0

• Performance is measured in-sample and out-of-sample:


o In sample – re-testing on the data that is used to fit the model in the first place
▪ R2 – largely for regressions
▪ Resubstituting Error Rate is the proportion of data points we correct correctly
1
when we try to predict all the data points used to fit the model ∑𝑛𝑖=1(𝑦𝑖 = 𝑦̂)
𝑖
𝑛
▪ Confusion Matrix:
• Accuracy = 1 – Resubstituting Error
• Sensitivity
• Specificity
• Positive predictive value
• Negative predictive value
o Out of sample – is measured through a train/test set, which can be also applied through a
cross-validation procedure. The metrics measured here include:
▪ The CV error rate – refers to the accuracy when testing on new ‘test’ groups and
averaging the error across the k groups.
∑𝒏 ̂𝑖 )2
𝒊=𝟏(𝒚𝒊 −𝑦
▪ Root mean square error: 𝑹𝑴𝑺𝑬 =√
𝒏
∑𝒏 ̂𝑖 |
𝒊=𝟏 |𝒚𝒊 −𝑦
▪ Mean absolute error: 𝑴𝑨𝑬 =
𝒏
▪ Consideration for choosing k
o Larger K:
▪ Computationally intensive
▪ Can reduce bias in predictions
o Smaller K:
▪ Increase the accuracy
▪ Might build a model that isn’t sensible
o Generally, a k between 5-10 is selected.
Logistic Regression
• Logistic Regression is regression where the dependent variable Y is BINARY.
o We say the Y follows a Bernoulli distribution and we model it via a logistic function
• We use a logit function which acts as a link from the linear combination of predictors to the
probability of the outcome being 1.
𝑝
𝑙𝑜𝑔𝑖𝑡(𝑝) = log ( )
1−𝑝

• The effect of this – is that we’re predicting log-odds with the regression:
• To convert back to a probability, we unravel the logit function:
o Link = Log-odds
o Response = probability

Decision Trees and Random Forests


• Decision trees are a series of if-else statements that rely on a yes/no answer:
• It works by splitting along the axis and trying to create the most homogenous groups.
o A new branch, by default, should decrease the error by at least 1%
• A problem with decision trees is that they can grow too complex too quickly and over-fit the
data.
o Also, can only make decisions parallel to axis when a better fit is available.
• Random forests overcome this issue through growing many trees where each one learns from
different sub-samples of data:
o 1. Choose the number of decision trees to grow, and the number of variables to consider in
each tree
o 2. Randomly select the rows of the data frame with replacement
o 3. Randomly select the appropriate number of variables from the data frame
o 4. Build a decision tree on the resulting data set
o 5. Repeat this procedure a number of times
o 6. A decision is made via majority voting rule based on what all the trees predict

Nearest Neighbours
• Nearest neighbours is a classification technique, especially useful when:
o There is high class separation
o We have non-linear cominations of predictors influencing the response
▪ At these times, it is difficult/impossible to use logistic regression.
▪ Decision trees can be used, but can often be:
• Complicated
• Overfit
• Error prone
• K-nn is a non-parametric algorithm that votes on the class of new data by looking at the majority
class of the k-nearest neighbours.
o The original assignment has an effect on how it all plays out.
o Typically the Euclidean distance is used between the points
▪ The assumption is that points with similar proximity are similar classes
• The kNN classifier doesn’t pre-process the training data so it can complete rather slowly
o However, the distance metric only makes sense for quantitative predictors, not categorical.
o Considered a lazy learning algorithm
o Constructed algorithm is discarded along with any intermediate results
• Advantages:
o kNN is easy to understand
o Doesn’t require any pre-processing time
o Analytically tractable and simple implementation
o Performance improves as sample size grows
o Uses local information and is highly adaptive
o Can easily be parallelized
• Disadvantages:
o Predictors can be slow (data processed at that time) – computationally intensive
o Large storage requirements
o Usefulness depends on the geometry of the data
▪ Clustering and scale can have negative effect
• Cross-validation can be used to optimize the choice of k to minimize error

Dimension Reduction (PCA)


• Principal Component Analysis (PCA) produces a low-dimensional representation of the data-set.
• It finds a linear combination of the existing variables that have maximal variance and are
mutually uncorrelated.
o It does this itself – and is unsupervised learning.
• The benefits:
o There may be too many predictors for a regression
o Can understand relationships that you did not know existed previously – similar to a
correlation matrix
o Data visualization is enabled – as it is typically difficult/impossible for 3+ axis
• The first PC attempts to account for as much variation as possible in terms of projects onto the
line
• The second PC attempts to minimize the errors from the first PC
• The number of PC’s chosen aims to explain as much variation as possible while minimizing the
complexity
o This is visualized via scree-plot
• Interpreting the Biplot looks at the relationship of the PC and the explanatory variable
o Murder, Assault and Rape closely associated with PC1 – with Florida being the worst on
that dimension
o Urban Pop associated with PC2 with California being high and Vermon being least.

Clustering
• Clustering is an unsupervised learning method of grouping together a set of objects into similar
classes
• K-means clustering is a process of classifying data-points based on its closest cluster
o To use this, you must specify the K groups you will cluster into (either permuting to
find the lowest error, or based on prior knowledge)
o 1. Randomly assign a number from 1 to K to each of the observations
o 2. Iterate until the clusters stop changing:
▪ A) For each K cluster, compute the cluster centroid.
▪ B) For each observation to the cluster whose centroid is the closest, decided by
Euclidean distance, re-assign its label.
• Advantages of K-means include:
o Can be much faster than hierarchical clustering
o Nice theoretical framework that’s easy to interpret
o Can incorporate new data and reform clusters easily
• Hierarchical clustering is a method to produce tree / dendograms – which avoids specifying
how many groups (e.g. K)
o 1. Bottom up – agglomerative clustering
o 2. Top down – divisive clustering
• Advantages of Hierarchical clustering:
o Don’t need to know how many clusters you’re after
o Can cut hierarchy at any level to a get any number of clusters
o Easy to interpret hierarchy for particular applications
o Deterministic

You might also like