Gea1000 Cheatsheet Finals

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

lOMoARcPSD|24529679

GEA1000 Cheatsheet Finals

Quantitative reasoning with data (National University of Singapore)

Studocu is not sponsored or endorsed by any college or university


Downloaded by Nicholas Chin ([email protected])
lOMoARcPSD|24529679

Sampling frame: source from which sample (a proportion of population which is entire group we want to know about) is
drawn. May not cover population of interest and may contain ppl not in population of interest. To be generalisable, samp
frame should be larger than or equal to target population. Large sample size, prob based sampling, min non response

Types of research qns: 1. Make estimate abt population (What is avg no. of hours students study each wk?) 2. Test a claim
about the population (Does the majority of students qualify for student loans?) 3. Compare 2 sub-populations/investigate a
relationship btwn 2 variables in the population (In sch X, do female students hv higher GPA than male students? OR does
drinking coffee help students pass the math exam)

Sampling methods- Probability sampling (via a known/random mechanism whr every unit in the population has non-zero
and known probability of being chosen. Eliminates biases associated with human selection)

1. Simple random sampling (ADV: sample tends to be gd representation of population. Disadv:Subject to non-
response, limited access to info as selected ppl may be located in diff geographical location)
2. Systematic sampling (ADV: simpler selection process than SRS. DISADV: May not be representative of population
if list is non-random)
3. Stratified sampling, population broke down into strata in which each stratum is similar in nature, but size may
vary across strata, SRS is then employed from every stratum. ADV: Can get representative sample from every
stratum. DISADV: Need info about sampling frame & stratum
4. Cluster sampling, breaking down population into clusters then randomly sample fixed number of clusters, then
randomly choose a fixed number of clusters, and include all observations from those clusters. ADV: Less tedious,
costly and time-consuming compared to other methods. DISADV: High variability due to dissimilar clusters or
small no. of clusters

Non-probability sampling methods: 1) Convenience sampling (select & NR Bias) 2)Volunteer sampling(Select bias)

Ordinal categorical -> Natural ordering, numbers used to represent ordering eg. happiness index 0-10. Nominal
categorical -> No intrinsic ordering, eg eye color

Discrete numerical -> Numbers have “gaps” eg population (gap of 1 person). Continuous numerical -> all possible
numbers in a given range eg. time, length

Mean properties : 1. Adding constant value to all data points changes mean by that constant. Multiplying all by
constant also multiplies the mean by the constant. Limitations: Does not tell the distribution over the total n, does not
tell about frequency of occurrence of values of the numerical variables.

Median: Middle value when data set arranged in ascending/descending order. PROPERTIES: Adding constant to all
points changes mean by that value, same for multiplying. Median does not reveal the total value, frequency of
occurrence or distribution of data points(similar to mean).

Mode: Value of most frequent variable, can be used for both numerical and categorical values. Not useful when values
are unique

Measure of spread: 1.Standard


deviation and sample variance.

SD = sqrt(variance)

Properties of SD: Always non-negative. Adding constant value to all data points doesn’t change SD. Multiplying
changes SD (SD multiplied by the absolute value of the constant)

2.Interquartile range (IQR) : Q3 – Q1. A small IQR means middle 50% of data have narrow spread. Big IQR=wide
spread. Properties of IQR: 1.Non-negative 2. Adding constant to all data points does not change the IQR 3. BUT
multiplying by constant also multiplies IQR by the absolute value of the constant amt.

Experimental: Assigned by researcher, can provide evidence of a cause-and-effect relationship. Observational:


Assigned by subjects themselves, can provide evidence of ‘association’. no direct manipulation of independent variable

Marginal rate: Rate(Y) = 350/1050. Rate(success) = 831/1050

Conditional rate: Rate(succ | X) = 542/700, rate(X | succ) = 542/831

Joint rate: NOT conditional rate. Rate(Y and fail) = 61/1050

Downloaded by Nicholas Chin ([email protected])


lOMoARcPSD|24529679

Association(Due to it being observational study)

ABSENT if rate(A|B) = rate(A|NB).

PRESENT if rate(A|B) > rate(A|NB)(+ve association) OR rate(A|B) < rate(A|NB) (-ve)

Symmetry rule 1. rate(A|B) > rate(A|NB) ⇔ rate(B|A) > rate(B|NA)

2. rate(A|B) < rate(A|NB) ⇔ rate(B|A) < rate(B|NA) 3. rate(A|B) = rate(A|NB) ⇔ rate(B|A) = rate(B|NA)

Overall rate of A always between rate(A|B) and rate(A|NB). The closer rate(B) is to 100%, the closer rate(A) is to
rate(A|B). If rate(B)=50%, rate(A)=[rate(A|B) + rate(A|NB)] / 2.

If rate(A|B) = rate(A|NB), then rate(A)=rate(A|B)= rate(A|NB)

Simpsons paradox when majority of individual subgroup rates are opposite from overall association. Eg treatment Y
positively associated with success, but across large and small stones, X was positively associated with success.

Simpsons paradox implies presence of confounders BUT CONVERSE IS NOT TRUE. Slicing used to control confounders.

CHAP 3: 1.Histograms : Left skewed means peak is on the right. Vice versa for right skewed. Symmetric mean peak in
middle. Symmetrical: Mean, mode and median very close to each other. Left skewed -> mean < median < mode.
right skewed  mean > median > mode. HIGHER variability  wider range in data.

OUTLIER for box plot IF greater than Q3 + 1.5 x IQR or less than Q1 – 1.5 x IQR. Q1, Q3, Median, min, max shown

r not affected by interchanging the 2 variables, adding of


number to all values of a variable, or multiplying a positive
number to all values of a variable. Correlation does not imply
causation. (r value near -1 or 1 implies strong statistical
relationship, NOT causal relationship). Outliers can increase/decrease strength of correlation coefficient. Y=mx +b, b is the y
intercept, m refers to gradient, m = (standard deviation of y / sd of x) * r. |r| is moderate if between 0.3 and 0.7. <0.3 is
weak, >0.7 is strong.

Mutually exclusive events, P(E u F) = P(E) + P(F).

P(E|F) = P(EnF)/P(F) . if P(F) = 0, then P(E|F) = 0. P(E|F) = rate(E|F).

Independence: If A and B independent, then P(A) x P(B) = P(AnB)

NORMAL DIST : N(x,y) means a normal distribution with mean x, and variance of y. (smaller y means thinner and taller bell
shape graph)

CONFIDENCE INTERVAL: Sample statistic = population parameter + random error

The larger the sample size, the smaller the margin of error, the smaller
the confidence interval. 95% confidence interval means if we collect
many random samples and construct a confidence interval for each of
them, eg 100 simple random samples from population: approx. 95
confidence intervals would contain the population parameter.

HYPOTHESIS TEST: Null hypothesis takes a stance of no effect, alternative


hypothesis is what we want to test against the null hypothesis. Both
must be mutually exclusive. If p value < significance level, then reject null
hypothesis. If p>= sig level, do not reject. Test result inconclusive.
(Cannot touch/rej alternative hypo)

Downloaded by Nicholas Chin ([email protected])

You might also like