Gea Cheatsheet
Gea Cheatsheet
3 types of research questions: Generalisability Criteria: (to generalise findings from sample to
1) Makes an estimate about the population population)
2) Tests a claim about the population 1) Good sampling frame (= or > than population)
3) Compares 2 sub-populations/investigates a r/s between 2 2) Probability-based sampling (to minimise selection bias)
variables in the population 3) Use large sample size (to reduce variability/random errors in
sample)
Types of Biases and Errors associated w diff sampling 4) Minimise non-response rate (reduce )
mthds
1) Selection bias (imperfect sampling frame, non-prob sampling) Categorical variable:
2) Non-response bias (disinterested, inconvenient, sensitive info) 1) Ordinal (using numbers to represent ordering)
2) Nominal (no intrinsic ordering)
**impt that every unit in the sampling frame has a known
non-zero probability of being selected (but dunnid same)
Numerical variables:
4 main types of probability sampling: 1) Discrete 2) Continuous
1) Simple Random Sampling Simpson’s Paradox:
● Units selected randomly without replacement with equal A phenomenon where a trends in > half the groups
chance of being selected disappears/reverses when the groups are combined (direction of
● Diff samples selected from same sampling from using SRS wld association reversed)
be diff, any variability due to chance
Pro: good rep of population **simpson’s paradox = cfm confounder but not the other way ard
Con: subject to non-response; accessibility of information
Summary statistics: Confounders:
2) Systematic sampling Spread of data/measure of dispersion: S.D., variance, IQR A third variable associated to both independent and dependent
Assuming you want sample size k Central tendencies: mean, mode, median variable whose relationship we are investigating
●SRS a starting from the interval k
Types of study design: **to show association btwn suspected confounder the IV/DV, test
1) Experimental design rate ( |suspected confounder) (alw change the one behind!!!)
Blinding: subjects don’t know whether they are in treatment or control
grp Effects of confounders on a study:
Pro: simpler than SRS as dunnid to know no. of sampling units Double-blinding: ^^ same but both subjects and assessors are blinded Must measure a variable to check if it is a confounder
Con: potentially under-rep population if list is not random = BUT need collect data on lots of variables to identify THE one
Placebo effect: response observed when subjects receive a placebo
(potential selection bias) = not feasible (costly & difficult to analyse)
treatment, but still show some positive effects
= may still have hidden confounders affecting results of the study
3) Stratified Sampling
2) Observational studies (still got control & treatment grps) (able to obtain limited conclusion)
●Break down population into strata (each similar in nature but
may vary in size), sample generated through doing SRS for Researcher’s don’t directly manipulate one variable to cause an effect
each stratum (may need to take weighted average) on the other Data slicing: (used for observational studies ONLY)
Pro: good representative sample for each stratum **obv studies do not provide convincing evidence of causal ● Sample size will not change because of slicing
Con: need info on sampling frame & stratum relationship, only can provide evidence for association ● Simply categorising data using 3rd variable (sus confounder)
● Slicing to investigate presence of simpson’s paradox
4) Cluster Sampling
●Break down population into clusters, then SRS a fixed number To remove effects of confounder, use randomised assignment to
of clusters remove association btwn confounder and either dependent or
Pro: simpler, less time-consuming, less costly
independent variables
Con: HIGH VARIABILITY due to dissimilar clusters/small
number of clusters
Symmetry rule
lOMoARcPSD|49555067
Data visualisation for One variable EDA (univariate data) Using Linear Regression to predict one variable based on the
1) Histograms other for NON-LINEAR models
2) Boxplots
Histograms
● Shape of graph: peaks (eg uni/bimodal), skewness ● 0 ≤ P(E) ≤ 1
● P(S)=1 if S is the entire sample space
● Mutually Exclusive events E and F: P(a ∪ b)= P(A) + P(B)
● Uniform Probability = 1/size of sample space
● Conditional Probability:
Correlation Coefficient (r)
●A measure of linear association(direction and strength)
● Strength of association: 0 to 0.3 (weak), 0.3 to 0.7
● Center: mean, median, mode (moderate), 0.7 to 1.0 (strong) ● Given that E and F are mutually exclusive events that
● Spread about central tendency: range, stdev ● As r approaches -1 or 1, data points fall more closely to the make up the entire sample space,
● Deviations from the pattern: outliers regression line
Histograms vs Bar Graphs *r=0 does not necessarily imply that there is no association b/w the
1) Histograms show the distribution of a numerical variable 2 variables. r=0 or a small r value could be due to non-linear r/s
across a number line while bar graphs makes comparisons b/w variables
across categories of a variable
2) Ordering of histogram cannot be changed unlike bar graphs Properties of Correlation Coefficient (r)
3) No gaps b/w bars for histograms unlike that of bar graphs r value is not affected by:
1) Interchanging the x and y axis (swapping the 2
Boxplots (features of a boxplot: min,Q1,median,Q3,max) variables)
2) Adding a constant to all variables ● Independent Events A and B:
3) Multiplying a positive number to all values of a variable Definition 1:
Independent
Definition 2: Events
Limitations of Correlation Coefficient (r)
1) Correlation b/w 2 variables can only suggest a
Confidence Intervals
statistical relationship(can use the value of one variable
Interpreting the confidence
to obtain the AVERAGE value of the other variable),
interval: (eg)
NOT a causal relationship
95%CI: 0.254 +- 0.0191
2) r value does not measure non-linear association b/w 2
“We are 95% confident
variables. r value could be very small in cases where
(=/= chance) that the
association b/w 2 variables is non-linear
population parameter we
A point is considered an outlier if it is < Q1-1.5*IQR or > 3) Outliers may increase OR decrease the strength of the
are looking for lies within
Q3+1.5*IQR correlation. Sometimes, removing outliers may have
the confidence interval”
minimal effect on r value.
Data visualisation for Two variable EDA (bivariate data) Linear Regression
For each simple random
1) Scatter plots (to get an idea of pattern formed b/w 2 variables) ● Using one variable to predict another
sample you construct a
2) Correlation Coefficient (to check for linear relationship) ● Models the relationship b/w variables X and Y by a straight
confidence interval.
3) Linear Regression (to fit a line to data for making predictions) line Y=mX+b
Use least squares method to determine best fit line for our data.
Confidence interval differs
Scatter Plots (Least Squares Method)
from sample to sample as
● Direction: positive, negative, neutral Define the i-th residual of the observation: 𝑒𝑖= difference between the the sample proportion is diff
● Form: linear vs non-linear observed and predicted outcome. for each sample
● Strength of linear r/s between 2 variables (RELATED to Want to minimise where 𝑛 =no of data points
gradient of regression line but NOT EQUIVALENT)
Properties of Regression Line
● Deviations from the pattern: outliers
1) Regression line always passes through