Gea1000 Cheatsheet Finals
Gea1000 Cheatsheet Finals
Gea1000 Cheatsheet Finals
Sampling frame: source from which sample (a proportion of population which is entire group we want to know about) is
drawn. May not cover population of interest and may contain ppl not in population of interest. To be generalisable, samp
frame should be larger than or equal to target population. Large sample size, prob based sampling, min non response
Types of research qns: 1. Make estimate abt population (What is avg no. of hours students study each wk?) 2. Test a claim
about the population (Does the majority of students qualify for student loans?) 3. Compare 2 sub-populations/investigate a
relationship btwn 2 variables in the population (In sch X, do female students hv higher GPA than male students? OR does
drinking coffee help students pass the math exam)
Sampling methods- Probability sampling (via a known/random mechanism whr every unit in the population has non-zero
and known probability of being chosen. Eliminates biases associated with human selection)
1. Simple random sampling (ADV: sample tends to be gd representation of population. Disadv:Subject to non-
response, limited access to info as selected ppl may be located in diff geographical location)
2. Systematic sampling (ADV: simpler selection process than SRS. DISADV: May not be representative of population
if list is non-random)
3. Stratified sampling, population broke down into strata in which each stratum is similar in nature, but size may
vary across strata, SRS is then employed from every stratum. ADV: Can get representative sample from every
stratum. DISADV: Need info about sampling frame & stratum
4. Cluster sampling, breaking down population into clusters then randomly sample fixed number of clusters, then
randomly choose a fixed number of clusters, and include all observations from those clusters. ADV: Less tedious,
costly and time-consuming compared to other methods. DISADV: High variability due to dissimilar clusters or
small no. of clusters
Non-probability sampling methods: 1) Convenience sampling (select & NR Bias) 2)Volunteer sampling(Select bias)
Ordinal categorical -> Natural ordering, numbers used to represent ordering eg. happiness index 0-10. Nominal
categorical -> No intrinsic ordering, eg eye color
Discrete numerical -> Numbers have “gaps” eg population (gap of 1 person). Continuous numerical -> all possible
numbers in a given range eg. time, length
Mean properties : 1. Adding constant value to all data points changes mean by that constant. Multiplying all by
constant also multiplies the mean by the constant. Limitations: Does not tell the distribution over the total n, does not
tell about frequency of occurrence of values of the numerical variables.
Median: Middle value when data set arranged in ascending/descending order. PROPERTIES: Adding constant to all
points changes mean by that value, same for multiplying. Median does not reveal the total value, frequency of
occurrence or distribution of data points(similar to mean).
Mode: Value of most frequent variable, can be used for both numerical and categorical values. Not useful when values
are unique
SD = sqrt(variance)
Properties of SD: Always non-negative. Adding constant value to all data points doesn’t change SD. Multiplying
changes SD (SD multiplied by the absolute value of the constant)
2.Interquartile range (IQR) : Q3 – Q1. A small IQR means middle 50% of data have narrow spread. Big IQR=wide
spread. Properties of IQR: 1.Non-negative 2. Adding constant to all data points does not change the IQR 3. BUT
multiplying by constant also multiplies IQR by the absolute value of the constant amt.
2. rate(A|B) < rate(A|NB) ⇔ rate(B|A) < rate(B|NA) 3. rate(A|B) = rate(A|NB) ⇔ rate(B|A) = rate(B|NA)
Overall rate of A always between rate(A|B) and rate(A|NB). The closer rate(B) is to 100%, the closer rate(A) is to
rate(A|B). If rate(B)=50%, rate(A)=[rate(A|B) + rate(A|NB)] / 2.
Simpsons paradox when majority of individual subgroup rates are opposite from overall association. Eg treatment Y
positively associated with success, but across large and small stones, X was positively associated with success.
Simpsons paradox implies presence of confounders BUT CONVERSE IS NOT TRUE. Slicing used to control confounders.
CHAP 3: 1.Histograms : Left skewed means peak is on the right. Vice versa for right skewed. Symmetric mean peak in
middle. Symmetrical: Mean, mode and median very close to each other. Left skewed -> mean < median < mode.
right skewed mean > median > mode. HIGHER variability wider range in data.
OUTLIER for box plot IF greater than Q3 + 1.5 x IQR or less than Q1 – 1.5 x IQR. Q1, Q3, Median, min, max shown
NORMAL DIST : N(x,y) means a normal distribution with mean x, and variance of y. (smaller y means thinner and taller bell
shape graph)
The larger the sample size, the smaller the margin of error, the smaller
the confidence interval. 95% confidence interval means if we collect
many random samples and construct a confidence interval for each of
them, eg 100 simple random samples from population: approx. 95
confidence intervals would contain the population parameter.