FlowChart V20
FlowChart V20
Population Sample
Definition Pronunciation Pronunciation
Categorical data that has no particular Parameters Statistics
Nominal
order. Mean μ mu x̅ x bar
Categorical data that has a logical order. Standard Deviation σ sigma s s
Ordinal
Example Likert scale.
Variance σ2 sigma squared s2 s squared
Numerical Data that is grouped by numbers.
Proportion π pi p p
Data is grouped in fixed steps. Not possible
Discrete to have answer between steps. Example: Slope β1 beta one b1 b one
number of children in family Coefficient of
Data may have any value along a given ρ rho r r
Correlation
Continuous
scale. Example: weight, time speed.
Interval Numerical data with arbitrary 0 Inequalities
= ≠ ≤ ≥ < >
Ratio Numerical data with real 0
less than or greater than
Basic Vocab Population Entire group of interest. equal to not equal to less than greater than
equal to or equal to
Parameter Some truth about the population.
Subset of population. A good sample is is different at most at least under over
Sample
representative of population. at a at a
not fewer than more than
Statistic Some truth about the sample. maximum minimum
Variable What is measured or observed no more
no less than smaller larger
than
Data List of results from sample
value or
A sample where the entire population is value or less exceeds
Census more
measure or observed. Very rare.
Formula Syntax Description Example Usage
Descriptive
Some result/truth from the sample Adds all the numbers in a
Statistic =SUM(range) =SUM(A1:A10)
range.
Results/truth from the sample that are used Multiplies all numbers in a
Inferential =PRODUCT(range) =PRODUCT(A1:A10)
to make a conclusion about the entire range.
Statistic
population. Counts the number of non-
=COUNTA(range) =COUNTA(A1:A10)
Purpose of Purpose of empty cells in a range.
Reveal something previously unknown.
Question Question =COUNTIF(range, Counts the number of cells that =COUNTIF(A1:A10,
Recording something observed. There is no criteria) meet a specified condition. ">10")
Observational intervention here. Correlations may be Returns the absolute value of a
=ABS(number) =ABS(-5)
Experiment found but does not establish cause and number.
effect. =ROUND(number, Rounds a number to a specified
=ROUND(3.14159, 2)
Some treatment is given to group. Results digits) number of digits.
Experimental Rounds a number up to the
are observed and recorded. This is how =CEILING.MATH(7.3,
Experiment =CEILING.MATH(number nearest multiple of a specified
cause and effect is able to be established. 1)
) value.
Sampling
Good Sample Sample well represents the population. Rounds a number down to the
Methods
nearest multiple of a specified =FLOOR.MATH(7.3, 1)
Simple Every member of population is equally =FLOOR.MATH(number)
value.
Random likely to be observed. A subset of them are
Sample randomly selected.
Calculates the average (mean)
Systematic Every member of population is lined up and =AVERAGE(range) =AVERAGE(A1:A10)
of numbers in a range.
Sample kth member is chosen Returns the median (middle
Population is broken up into groups. The =MEDIAN(range) =MEDIAN(A1:A10)
Cluster value) of a range.
groups are randomly selected. Every Returns the most frequently
Sample =MODE.MULT(range)
member in the selected group is measured. occurring value in a range. =MODE.SNGL(A1:A10)
Some demographic is selected and the Returns the sample standard
Stratified =STDEV.S(range) =STDEV.S(A1:A10)
sample matches the ratio of the selected deviation.
Sample
demographic. Returns the sample variance of
=VAR.S(range) =VAR.S(A1:A10)
Convenience Taking a sample of just where it is easy to a dataset.
Sample gather data. Finds the smallest value in a
=MIN(range) =MIN(A1:A10)
Good range.
Good Directly answers the question and is free of Finds the largest value in a
Measureme =MAX(range) =MAX(A1:A10)
Measurement bias range.
nt
Finds the skewness of the data
Survivorship Only measuring members who make it to =SKEW(range) =SKEW(A1:A10)
(- is left skew, + is right skew)
Bias the end of a study.
Finds the kurtosis of the data
Recall Bias Relying upon people’s memory. =KURT(range) (Mesokuritc = 3, Leptokurtic > =KURT(A1:A10)
Who is funding the study? Is it an impartial 3,Platykurtic < 3)
Funding Bias 3rd party or do they have an interest in
making the study have a particular result?
Cause and Correlation being misidentified as
Effect Bias causation.
People get the opportunity to opt in or out
Selection Bias
of the study.
Confirmation Only collecting data to support a
Bias conclusion.
CLTRules Key Variables: Calculate:
μ = pop. Mean Given
Data: Numerical: Means μx̄ = mean of sampling dist wrt x̄ =μ
Numerical: Means
Sampling distribution of x̄ is approximatelynormal if either n = sample size Given or =COUNTA(range)
σ = pop. std. dev. Given
1. the orignial dist. of xis normal
std. error OR
OR σx̄ = SE = = σ/sqrt(n)
std. dev of sampling dist wrt x̄
2. n >30
x̄= sample mean Given or =AVERAGE(range)
z = Z score (# of SE from mean) =STANDARDIZE(x̄ , μ , SE)
Less than: Areadleft: P( x̄ <x̄crit) =? Cuttoff for Bottom%: P(x̄ <?) =prob Zscore
Z-Manipulation
Greater than: Arearight: P( x̄ >x̄crit ) =? Cuttoff for Top%: P(x̄ >?) =prob Solve for x̄
Between: Areabetween: P( x̄lower <x̄ <x̄upper ) =? Cuttoffs for Middle %: P(? <x̄ <?) =prob
=norm.dist(x̄upper,μ,SE,TRUE) x̄lower =norm.inv(0.5- prob/2,μ,SE)
- norm.dist(x̄lower,μ,SE,TRUE) x̄upper =norm.inv(0.5+prob/2,μ,SE)
Less than: Arealeft: P(p<pcrit) =? Cuttoff for Bottom%: P(p<?) =prob Zscore
Probability(Area under curve)
Greater than: Arearight: P(p>pcrit) =? Cuttoff for Top%: P(p>?) =prob Solve for p
Between: Areabetween: P(plower<p<pupper) =? Cuttoffs for Middle %: P(? <p<?) =prob Solve for k
=norm.dist(pupper,π,SE,TRUE) plower =norm.inv(0.5- prob/2,π,SE) k =p * n
- norm.dist(plower,π,SE,TRUE) pupper =norm.inv(0.5+prob/2,π,SE) [typicallyround up]
Key Variables: Calculate:
μ є x̄ ± MoE
CL = conf. level Given or 1 - α
2 Τailed
α = CI miss rate Given or 1 - CL L: x̄ - MoE
U: x̄ + MoE
n = sample size Given or =COUNTA(range)
x̄ = sample mean Given or =AVERAGE(range)
Pop. Standard Deviation μ ≥ x̄ - z1tail * SE
Lower
σ = pop. st. dev Given
Tail
SE = standard error = σ/sqrt(n) L: x̄ - z1tail * SE
z2tail = 2 tailed zcrit =NORM.S.INV(1 - α/2) U: INF
σ Known
Upper Tail
U: x̄ + z1tail * SE
2 Τailed
CL = conf. level Given or 1 - α
α = CI miss rate Given or 1 - CL L: x̄ - MoE
n = sample size Given or =COUNTA(range) U: x̄ + MoE
[Numerical Data]
Lower
x̄ = sample mean Given or =AVERAGE(range)
Tail
s = sample. st. dev Given or =STDEV.S(range) L: x̄ - t1tail * SE
Means
As n increases, the width of the CI decreases. We are (Confidence Level)% confident that the
Upper Tail
1 Tailed < true (population parameter) for (state
population) is less than [Upper] (units).
As α increases, the width of the CI decreases.
1. Identify Data Type Hypothesis Testing Steps
1. Identify Data Type
Decision: Is there a business decision we are trying to make
Ask: What question do we ask subjects and how do they respond?
1 Sample 2. Determine Population and Parameter
3. State Hypothesis
Data Type
Numerical
Want to know
True Mean
Symbol
μ
Hypothes 4. State α
5. Determine Testing Method
6. Check Assumptions
Categorical
2. Determine Population and Parameter
True Proportion π
is Testing 7. Design Experiment and Collect Data
8. Calculate Test Statistics and pvalue
Population: Who we are trying to make a statement about 9. Reject/Fail to Reject Ho
10. Make Conclusion
Parameter: Specific μ or π of interest for the group.
3. State Hypothesis
Means Matched Pairs Proportions
Null Hypothesis Ho : μ (≤,≥,=) μ0 Ho : μd (≤,≥,=) μd0 Ho : π (≤,≥,=) π0
Alt. Hypothesis H1 : μ (<,>,≠) μ0 H1 : μd (<,>,≠) μd0 H1 : π (<,>,≠) π0
Variables μ: true pop mean μd: true pop mean difference π: true pop proportion
μ0: hypothesized pop mean μd0: theorized pop mean difference π0: theorized pop proportion
Inequality (<,>,≠) in alt hypothesis is established in the scenario. The null hypothesis is always opposite of this symbol.
Two tailed test if alternative hypothesis is ≠. One tailed test if alternative hypothesis is either > or <.
Null Hypothesis The baseline assumption about the population, assumed to be true.
Alt. Hypothesis An alternative scenario that we test with sample data in contrast to the null. This is typically what we want to see.
4. Establish α
5. Determine Testing Method
Directly stated OR α = 1 - CL Definition Layman's
The H0 is Correct The H0 is Incorrect Percent of the time The null hypothesis is
that we incorrectly correct, but the data that
α
😀
reject the null we collect suggest that it is
Fail to Reject H0 Reject H0
😀
to reject the null we collect do not suggest
Type 2 Error: β hypothesis that the null is incorrect.
6. Check Assumptions
Check Means Μatched Pairs Proportions
Check to see if scenario states assumed
Check to see if scenario states assumed
CLT: Orig Dist Normal Orig. distribution NEVER normal.
normality OR…. normality OR….
CLT: Sample Size n ≥ 30 nd ≥ 30 n * π0 ≥ 15 & n * (1 - π0) ≥ 15
Good Sample? Is the sample representative of the population?
7. Design Experiment and Gather Data Typically this is done for us in this class. We will do a few projects where we collect data.
8. Calculate Test Statistic, pvalue. , and t zcrit OR tcrit
Means (know σ) Means (know s) Matched Pairs Proportions
Test Statistic: z = zscore = (x̄ - μ0) / SE t = tscore = (x̄ - μ0) / SE t = tscore = (x̄ d - 0) / SE z = zscore = (p - π0)/SE
Variables: x̄ = sample mean x̄ = sample mean x̄ d = sample diff mean k= # with trait of interest
μ0 = hyp. true mean μ0 = hyp. true mean sd = sample diff stdev n= sample size
σ= pop. stdev n= sample size nd = sample diff size p= sample prop. = k/n
n= sample size df = deg. of freedom df = deg. of freedom π0 = hyp. true prop.
SE = std. error = σ/sqrt(n) s= sample stdev SE = std. error = sd/sqrt(nd) SE = =sqrt(π0*(1-π0)/n)
SE = std. error = s/sqrt(n)
pvalue 2 tailed (≠) = (1-norm.s.dist(abs(z),TRUE)) * 2 = t.dist.2t(abs(t),df) = t.dist.2t(abs(t),df) = (1-norm.s.dist(abs(z),TRUE)) * 2
pvalue 1 tailed (<) = norm.s.dist(z,TRUE) = t.dist(t,df,TRUE) = t.dist(t,df,TRUE) = norm.s.dist(z,TRUE)
pvalue 1 tailed (>) = 1 - norm.s.dist(z,TRUE) = t.dist.rt(t,df) = t.dist.rt(t,df) = 1 - norm.s.dist(z,TRUE)
zcrit OR tcrit zcrit_2tail = =NORM.S.INV(1 - α/2) tcrit_2tail = =T.INV.2T(α, df) tcrit_2tail = =T.INV.2T(α, df) zcrit_2tail = =NORM.S.INV(1 - α/2)
zcrit_1tail =
=NORM.S.INV(1 - α) tcrit_1tail = =T.INV(1 - α, df) tcrit_1tail = =T.INV(1 - α, df) zcrit_1tail = =NORM.S.INV(1 - α)
Vocab
Test Statistic: The number of standard errors the sample statistic is from the hypothesized population parameter.
pvalue: The probability of observing the sample statistic (or something more extreme) if the Null Hypothesis is true.
9. Fail to Reject or Reject the H0
Fail to reject the Null Hypothesis and continue under the
IF pvalue ≥ α sample statistic outside test statistic t/z smaller than
ALSO null parameter inside CI
baseline assertation. rejection region critical t/z
Reject the Null Hypothesis H0 and conclude the Alternative sample statistic inside null parameter outside test statistic t/z larger than
IF pvalue < α ALSO
Hypothesis H1 rejection region CI critical t/z
10. Conclusion (and CI if necessary)
Conclusion Confidence Interval Statement
We collected insufficient evidence (test statistics, pvalue , α) to Not Required: Since we failed to reject the null hypothesis the CI would contain the Null
Fail to reject reject the claim that (state H0 in words). We will continue population parameter
under the assumption that the H0 is correct.
2 Tailed (≠) We are (Confidence Level)% confident that the true (population
parameter) for (state population) is somewhere between [Lower,
(2 Tailed CI) Upper] (units).
We collected sufficient evidence (test statistic, pvalue , α) to
reject the claim that (state H0 in words) and instead we 1 Tailed (<) We are (Confidence Level)% confident that the true (population
Reject
(Upper Tail CI) parameter) for (state population) is less than [Upper] (units).
conclude (state H1 in words).
1 Tailed (>) We are (Confidence Level)% confident that the true (population
(Lower Tail CI) parameter) for (state population) is greater than [Lower] (units).
1. Identify Data Type Hypothesis Testing Steps
Ask: What question do we ask subjects and how do they respond?
Data Type Want to know Symbol
2 Sample 1. Identify Data Type
2. Determine Population and Parameter
3. State Hypothesis
Numerical
Categorical
Difference of True Means
Difference of True Proportion
μ1 - μ 2
π1 - π 2
Hypothesis 4. State α
5. Determine Testing Method
Testing
6. Check Assumptions
2. Determine Population and Parameter 7. Design Experiment and Collect Data
Populations: The 2 groups of interest (g and g )
1 2 8. Calculate Test Statistics and pvalue
Parameter: True mean (μ) or true proportion (π) of interest to compare between g 1 and g2. 9. Reject/Fail to Reject Ho
10. Make Conclusion
3. State Hypothesis
Means Proportions
Null Hypothesis Ho : μ1 - μ 2 = 0 OR μ1 = μ2 Ho : π1 - π 2 = 0 OR π1 = π2
Alt. Hypothesis H1 : μ1 - μ2 (<,>,≠) 0 OR μ1 (<,>,≠) μ2 H1 : π1 - π2 (<,>,≠) 0 OR π1 (<,>,≠) π2
Variables μ1: true pop mean of group 1 π1: true pop proportion of group 1
μ2: true pop mean of group 2 π2: true pop proportion of group 2
Inequality (<,>,≠) in alt hypothesis is established in the scenario.
Two tailed test if alternative hypothesis is ≠. One tailed test if alternative hypothesis is either > or <.
Null Hypothesis The baseline assumption about the population, assumed to be true. Typically assumed the groups have equal true parameters.
Alt. Hypothesis An alternative scenario that we test with sample data in contrast to the null. This is typically what we want to see.
😀
Fail to Reject H0 Reject H0
😀 Type 2 Error:
β
Fail to Reject: Use equal variances t test
We collected insufficient evidence (test statistics, pvalue , α) to Not Required: Since we failed to reject the null hypothesis the CI would contain the Null
Fail to reject reject the claim that (state H0 in words). We will continue population parameter
under the assumption that the H0 is correct.
We are (Confidence Level)% confident that the true difference of
2 Tailed (population parameter) between the (state populations) is somewhere
between [Lower, Upper] (units) with (state larger group) as larger.
We collected sufficient evidence (test statistic, pvalue , α) to We are (Confidence Level)% confident that the true (population
Reject reject the claim that (state H0 in words) and instead we 1 Tailed < parameter) for (state group 1) is at least [Upper] (units) less than that of
conclude (state H1 in words). (state group 2).
😀
Independence
Fail to Reject H0 Reject H0
😀
Centered Residuals centered on 0 Not centered on 0
Type 2
Error: β QQ Plot Check Good Bad
Points deviate
Points lie on QQ plot line
7. Design Experiment and Gather Data Normality substantially from QQ
without major deviation.
8. Calculate Test Statistic and p value plot line.
How to Establish Cause and Effect
1 and 2+ Predictors Model Building
Treatment comes before the effect.
Test Statistic F statistic and pvalue (if significant
Found significant results (Reject H0 , pvalue < α) Overall Model Equation to make predictions
check each predictor)
Utilized a true experiment. Eliminates other
explanations Test Statistic t statistic and pvalue y = b0 + b1*x1 + (b2*x2 + …..)
per Predictor
b0, b1, … are the predictor
b0 = model intercept
coefficients.
Point b1 = slope between x1 and y
Estimates
(b2 …)= slope between x2 and y Plug in values of x1 , x2 … (for all
predictors) to make predictions of y
Test Stat: F statistic. As F gets larger, the rarer of an observation given H o is true.
The prob. of observing the sample statistic difference (or more extreme)
Vocab pvalue:
if the Ho is true.
The percent of variability in the data that the model explains. Better
R2 models explain more of the variability.
9. Fail to Reject or Reject the H0
IF pvalue ≥ α Fail to reject the Null Hypothesis (Ho) and conclude the Null Hypothesis (H1).
IF pvalue < α Reject the Null Hypothesis (H0) and conclude the Alternative Hypothesis (H1)
10. Conclusion (and CI if necessary)
Conclusion Confidence Interval Statement
We collected insufficient evidence (F=, pvalue =, α =) to reject the Not Required: Since we failed to reject the null hypothesis the CI would contain the Null
Fail to reject claim that (state H0 in words). We will continue under the population parameter
assumption that the H0 is correct.
We are (Confidence Level)% confident that for 1 (x unit) increase in (x) that
1 Predictor
the (y) changes by somewhere between [Lower , Upper] (y units).