Lecture 2 1
Lecture 2 1
10000
Histograms are good way of visualizing data
80
200
4000
▪ With enough data points histograms may indicate
8000
60
150
the potential distribution, multimodality, skewness.
3000
6000
Frequency
Frequency
Frequency
Frequency
▪
40
Number of bins is important.
100
2000
4000
➢ Too many bins might be very noisy
20
50
1000
2000
➢ Too few bins can mask out important features
0
0
0
0 5 10 0 10 20
-5 0 5 10 15 0 10 20 30 40 50 60
rn1 r
rn1 rn2
Frequency Distribution
▪ Normal distribution
▪ Skewed distribution
▪ Modality
Frequency Distribution
▪ Normal distribution
▪ Skewed distribution
▪ Modality
Frequency Distribution
▪ Normal distribution
▪ Skewed distribution
▪ Modality
Universality of normal distribution! Basis
of most statistical testing methods
99%
95%
8
Standard Deviation (σ)
99%
95%
9
Z-scores
▪ Used to convert any normal distribution, N(µ, σ) where
➢ Mean (µ) = 0
𝑋−μ
➢ Standard deviation (σ) = 1 𝑧=
σ
To the Standard normal distribution: N(0,1)
▪ Allows the use of probability tables (p-values)
▪ Important z-score: ±1.96 (removes the outer 2.5%) or a 95% CI
▪ https://www.mathsisfun.com/data/standard-normal-distribution-table.html
Z-scores
Z-scores
▪ Example:
Every year, 50,000 runners compete in the Victoria Park Fun Run. They run 10 kilometres. The average finishing
time is 55 minutes, with a standard deviation of 10 minutes. Fred and Wilma completed the race in 61 and 51
minutes, respectively. Barney and Betty had finishing times with z-scores of -0.3 and 0.7, respectively.
List the runners in order, starting with the fastest runner and ending with the slowest runner.
Z-scores
▪ Example:
This problem can be solved by converting Fred and Wilma's raw scores into z-scores. To do this, we use the z-
score equation:
𝑋 − 𝑋ത
𝑧=
𝑠
where z is the z-score, X is the runner's raw score, X is the mean finishing time, and s is the standard deviation of
finishing times.
Fred's z-score
= ( 61-55) / 10 = 0.60
Wilma's z-score
= ( 51-55) / 10 = - 0.4
Based on z-scores, we can order the runners from fastest to slowest as follows: Wilma (z = -0.4), Barney (z = -0.3),
Fred (z = 0.6), and Betty (z = 0.7).
Hypothesis Testing
▪ Null hypothesis (H0)
▪ Experimental hypothesis or alternative hypothesis (H1)
▪ The null hypothesis must take the opposite to the experimental hypothesis
▪ Collect data and seek evidence against H0 as a way of bolstering H1
(deduction)
P-values
▪ The probability that the observed test statistic is equal to or more extreme, than the observed
result when H0 is true
▪ The p-value is used in the context of null hypothesis testing in order to quantify the idea of
statistical significance of evidence
▪ P-value will answer the question: What is the probability of the observed test statistic when H0
is true?
▪ Smaller P-values provide stronger and stronger evidence against H0
P-values
▪ Example: In the 1970s, 20–29 year old men in the U.K. had a mean body weight of 78kg.
Standard deviation was 18 kg. Test whether mean body weight in the population now differs.
▪ Null hypothesis H0: μ = 78 (“no difference”)
▪ The alternative hypothesis can be either
➢ H1: μ > 78 (one-sided test)
➢ H1 : μ ≠ 78 (two-sided test)
P-values: One sided test
▪ The critical value is either + or -, but not
both.
17
P-values: Two sided test
▪ The critical value is the number that
separates the “blue zone” from the
middle (± 1.96 this example)
18
P-values
▪ Example: In the 1970s, 20–29 year old men in the U.K. had a mean body weight of 78kg.
Standard deviation was 18 kg. Test whether mean body weight in the population now differs.
▪ Null hypothesis H0: μ = 78 (“no difference”)
▪ The alternative hypothesis can be either
➢ H1: μ > 78 (one-sided test)
➢ H1 : μ ≠ 78 (two-sided test)
P-values
▪ Example: In the 1970s, 20–29 year old men in the U.K. had a mean body weight of 78kg.
Standard deviation was 18 kg. Test whether mean body weight in the population now differs.
▪ Null hypothesis H0: μ = 78 (“no difference”)
▪ The alternative hypothesis can be either
➢ H1: μ > 78 (one-sided test)
➢ H1 : μ ≠ 78 (two-sided test)
P-values
▪ Example: In the 1970s, 20–29 year old men in the U.K. had a mean body weight of 78kg.
Standard deviation for the population was 18 kg.
▪ A sample was taken from 64 people, finding a mean weight of 80kg
z-score = (80-78) / (18/√64) = 0.89
Probability tables can then be used to ascertain the P-value: 0.19
Is this strong evidence for or against the null hypothesis, H0?
P-values
▪ Example: In the 1970s, 20–29 year old men in the U.K. had a mean body weight of 78kg.
Standard deviation for the population was 18 kg.
▪ Another sample was taken from 64 people, finding a mean weight of 83kg
z-score = (83-78) / (18/√64) = 2.22
Probability tables can then be used to ascertain the P-value: 0.01
Is this strong evidence for or against the null hypothesis, H0?
P-values
▪ Example: In the 1970s, 20–29 year old men in the U.K. had a mean body weight of 78kg.
Standard deviation for the population was 18 kg.
▪ Set BEFORE we collect data, run ▪ Calculated AFTER we gather the data
statistics ▪ The calculated probability of a mistake
▪ Defines how much of an error we are by saying it works
willing to make to say we made a ▪ AKA: level of significance
difference ▪ Describes the percent of the
▪ If we’re wrong, it’s an α error or Type 1 population/area under the curve (in the
error tail) that is beyond our statistic
P-value or α-level??
▪ Let α ≡ probability of erroneously rejecting H0
▪ Set α threshold (e.g., let α = .10, .05, or whatever)
▪ Reject H0 when P ≤ α
➢ 1– β = Pr(reject H0 | H0 false)