Stats 2 Notes
Stats 2 Notes
Stats 2 Notes
But remember that sample means are Normally Distributed!! See Stats 1 Lecture 8 Notes Central
Limit Theorem! : The only thing you need if your distribution is not normally distributed is that your
sample size is large enough that the distribution of the sample mean will be normally distributed! The same
holds for a sum of series of observations.
Lecture 2: Confidence intervals:
Let’s begin with a research example:
Nitrate in the upper groundwater in sandy soil: Too much nitrate in groundwater can inhibit aquatic
life, like algae, from growing. The level of nitrate has been monitored in the NL in various types of land.
Population: Agricultural land on sandy soil in Sample: Random sample of hectares (More advanced:
the Netherlands. (E.g., all agricultural land on stratified sample, several regions and per region n
sandy soil divided into hectares.) samples (not exam material))
Unit: 1 hectare Response variable: Nitrate concentration in mg/litre
Statistics 2 – Page 1
Expected Value E(𝒚) = 𝝁𝒚 of a sample mean
Random sample: 𝑦1, 𝑦2, …, 𝑦𝑛. These are 𝑛 independent observations from the same distribution with
expected value (population mean) 𝜇y and standard deviation 𝜎y.
𝑦 is an unbiased estimator for 𝜇y
Aim: estimate 𝜇y
Estimator: (formula) sample mean 𝑦 = (𝑦1 + 𝑦2 + … + 𝑦𝑛)/n because 𝜇 = 𝜇y (the larger the
sample, the closer we tend to the
E(𝒚) = 𝝁𝒚 and 𝝈𝒚 = 𝝈𝟐𝒚 of a sample mean unknown true value 𝜇y
𝒏∗𝝁𝒚 𝒏∗𝝈𝟐𝒚 𝝈𝟐𝒚
𝝁𝒚 = = 𝝁𝒚 𝝈𝟐𝒚 = = so 𝝈𝒚 = 𝝈𝟐𝒚
𝒏 𝒏𝟐 𝒏
Implementation: 𝑦1, 𝑦2, …, 𝑦𝑛 → outcome of 𝑦 is an estimate of µy
Statistics 2 – Page 2
Exercise: Nitrate concentration
Suppose y is the nitrate concentration in the upper groundwater in sandy soil (mg/litre).
Suppose y is normally distributed with unknown μ and (known) σ = 30.3.
From a random sample with 41 observations, we get a sample mean of 55.7.
Construct a 90% confidence interval for μ n = 41 𝑦 = 55.7
y = nitrate concentration in the upper groundwater in sandy soil (mg/l) y ~ N(μ, 30.3)
.
Limits 90% confidence interval for µ: 𝑦 ± 𝑧 . ∗ = 55.7 ± 1.645 ∗ = 55.7 ± 7.78
√ √
90% confidence interval for µ : (47.92, 63.48)
We have constructed a 90% CI for µ, assuming we know that the σ is exactly known. Is that realistic? In
reality, the σ is hardly ever known. It is only really known when it is given in an exercise or historic
research has been done.
Unknown σ:
We estimate σ by using the sample standard deviation s, the square root of the sample variance s2
Statistics 2 – Page 3
Example: From textbook page 286:
A company that manufactures coffee for use in commercial machines monitors the caffeine content in its
coffee. The company selects 50 samples of coffee every hour from its production line and determines the
caffeine content. From historical data, the caffeine content (mg) is known to have a normal distribution
with 𝝈=7.1 mg. During a 1-hour time period, the 50 samples yielded a mean caffeine content of 𝑦 =110
mg.
a. Identify the population about which inferences can be made from the sample data.
Population: All coffee produced during the hour in which the 50 samples were selected.
b. Calculate a 95% confidence interval for the mean caffeine content 𝜇 of the coffee produced during the
hour in which the 50 samples were selected.
Let y = caffein content in mg. y~N(µ,7.1) n=50 𝑦 =110
𝜎 7.1
𝑦 ± 𝑧. ∗ = 110 ± 1.960 ∗ = 110 ± 1.968
√𝑛 √50
Limits 95% Confidence interval for µ: (108.032, 111.968)
c. What would happen to the width of the confidence intervals if the level of confidence of each interval
is increased from 95% to 99%? It would become wider
(95%: zα/2 = z0.025 = 1.960; 99%: zα/2 = z0.005 = 2.576)
d. What would happen to the width of the confidence intervals if the number of samples per hour was
increased from 50 to 100? It would become narrower (divide by √100 instead of √50).
Because the company is sampling the coffee production process every hour, there are 720 confidence
intervals for the mean caffeine content 𝜇 constructed every month.
a. If the level of confidence remains at 95% for the 720 confidence intervals in a given month, how
many of the confidence intervals would you expect to fail to contain the value of 𝜇?
5% of the 720 confidence intervals: 36
b. If the number of samples is increased from 50 to 100 each hour, how many of the 95% confidence
intervals would you expect to fail to contain the value of 𝜇 in a given month?
5% of the 720 confidence intervals: 36
c. If the number of samples remains at 50 each hour but the level of confidence is increased from 95% to
99% for each of the intervals, how many of the 99% confidence intervals would you expect to fail to
contain the value of 𝜇 in a given month?
1% of the 720 confidence intervals : 7.2 ≈ 7
Lecture 3: Single sample t-test:
Test procedure in 8 steps:
Define the parameter(s) and specify sequentially:
Determine step 1-5 prior to the
1. Null hypothesis H0 and alternative hypothesis Ha
2. The Test Statistic as a formula (T.S.), fill in allowed parts! experiment! Only in steps 6-8
3. The probability distribution of the T.S. under H 0 the data will be used.
4. The behaviour of the T.S. under Ha (i.e. under Ha the T.S. tends to
“higher”/”lower”/”higher or lower “values than under H0)
5. The type of p-value (right-, left-, two-sided)
-----------------------------------------------------------------------------------------------------------------
6. The outcome of the T.S.
7. The appropriate p-value
8. The conclusion (also in non-statistical terms)
Significance test: Procedure to compare data with a hypothesis. In a test procedure, we use a probability
to show how well the hypothesis and the data match.
Statistics 2 – Page 4
z-test for a population mean µ:
Steps 1 - 3: Step 6:
Calculate the outcome of the test statistic based on the
executed experiment
Step 7:
P-value: the prob of the T.S. obtaining the observed
value or anything more extreme (supporting Ha)
assuming that H0 is true.
Steps 4 & 5 Step 8:
Confidence interval for two independent samples (𝝈𝟐𝟏 = 𝝈𝟐𝟐 ) (equal variances):
Model Assumption:
Random sample of size n1 from Independent random sample of size n2 𝜎 =𝜎 =𝜎
N(µ1, σ1) from the population from N(µ2, σ2) from the population
*To verify normality in the samples, we use a Q-Q Plot
Parameter
𝟏 𝟏
100 (1-𝛼) % confidence interval for 𝜇1 – 𝜇2: (𝒚𝟏 − 𝒚𝟐 ) ± 𝒕𝜶/𝟐 ∗ 𝒔𝒑 + Estimator
𝒏𝟏 𝒏𝟐
Standard error
*with 𝑠 = 𝑠 + 𝑠
𝒔𝟐𝒑 is the pooled variance estimator of 𝜎2 (ie. Weighted average of the sample variances 𝑠 and 𝑠 )
tα/2 from t-distribution with n1 + n2 − 2 degrees of freedom (df)
Statistics 2 – Page 6
t-test for two independent samples (𝝈𝟐𝟏 = 𝝈𝟐𝟐 ) (equal variances) : steps 1-3
Parameters: Define µ1 and µ2, !Don’t forget to use information from the aim/question!
Step 1: H0 : µ1 - µ2 = D0 (Notation book, often D0 = 0
(𝒚𝟏 𝒚𝟐 ) 𝑫𝟎 (𝒏𝟏 𝟏)𝒔𝟐𝟏 (𝒏𝟐 𝟏)𝒔𝟐𝟐
Step 2: t = with sp =
𝒔𝒑
𝟏 𝟏 𝒏𝟏 𝒏𝟐 𝟐
𝒏𝟏 𝒏𝟐
Step 3: Under H0 T.S. (t) follows a t-distribution with n1 + n2 − 2 degrees of freedom (df)
Confidence interval for two independent samples (𝝈𝟐𝟏 ≠ 𝝈𝟐𝟐 ) (not equal variances): (Welch’s Test)
Model Assumption:
Random sample of size n1 from Independent random sample of size n2 𝜎 ≠𝜎
N(µ1, σ1) from the population from N(µ2, σ2) from the population
*To verify normality in the samples, we use a Q-Q Plot
Parameter
𝒔𝟐𝟏 𝒔𝟐𝟐
100 (1-𝛼) % confidence interval for 𝜇1 – 𝜇2: (𝒚𝟏 − 𝒚𝟐 ) ± 𝒕𝜶/𝟐 ∗ + Estimator
𝒏𝟏 𝒏𝟐
Standard error
tα/2 from t-distribution with degrees of freedom (df) = value seen on Rcmdr output
t-test for two independent samples (𝝈𝟐𝟏 ≠ 𝝈𝟐𝟐 ) (not equal variances) : steps 1-3 (Welch’s Test)
Parameters: Define µ1 and µ2, !Don’t forget to use information from the aim/question!
Step 1: H0 : µ1 - µ2 = D0 (Notation book, often D0 = 0)
(𝒚𝟏 𝒚𝟐 ) 𝑫𝟎
Step 2: t’ = *the ‘ on the t is an indicator that this is a test
𝒔𝟐
𝟏 𝒔𝟐
𝟐
statistic for unequal variances!
𝒏𝟏 𝒏𝟐
Step 3: Under H0 T.S. (t’) follows approximately a t-distribution with degrees of freedom (df) = value
seen on Rcmdr output (usually not a round number)
Reading Rcmdr output:
Here we have a two-sided P-value. We
know this because our Ha is looking for
≠. This P-value is a two-sided P-value
and accounts for both tails. If you are
given a two-sided Rcmdr output and you
only need one side, you divide this P-
value in half.
Statistics 2 – Page 7
How do we know if variances are equal? Levene’s Test! → Rcmdr
Levene's Test is used to test for equality of
variances. This test goes through the 8 steps of a
T.S. starting with: 𝐻 : 𝜎 = 𝜎 𝐻 :𝜎 ≠ 𝜎
Clearly, we see an increased mean cortisol level in the test hamsters. However, is this significant?
Steps: Answers: Steps: Answers:
1. H0: μA - μB = 0 2. (𝒚 𝒚 ) 𝑫𝟎
t = 𝑨 𝟏𝑩 𝟏 = 𝑨 𝑩
(𝒚 𝒚 ) 𝟎
H a: μ A - μ B > 0 𝒔𝒑
𝒏 𝒏 𝒔𝒑
𝟏 𝟏
𝑨 𝑩 𝟏𝟎 𝟏𝟎
Statistics 2 – Page 8
For paired observations: suppose that the differences d = x − y are N(µd, σd) distributed. Then a one-
sample t-test can be performed for the mean difference µd = µx - µy for differences d = x − y
Statistics 2 – Page 9
Lecture Example: t-test for paired population sample means
Medical training now includes communication training. Can we prove (with data) that communication
training is effective? | Units: Medical Students | Variable: Score of general impression of Dr’s before com.
training from patients and the score after. Calculate “by hand” the 95% confidence interval for the mean
difference between the scores for the general impression of the presentation and the student’s eye contact
with the people in the room, before and after the training. Use the Rcmdr output below:
Here we have pairs of variables (xi, yi); We have the mean of x (𝑥), and the mean of y (𝑦), and also use
the empirical standard deviation for all values of x and all values of y respectively.
What kind of values can we get from this formula?
The x’s represent a few data points:
Let’s for instants talk about a data point in the lower left quadrant:
We see that in these data points, the y value it is smaller than
(𝑦) and the x value is smaller than (𝑥). Looking at our equation,
the x portion would then be a negative value and same for the y
portion. Two negatives make a positive and thus we have a
positive contribution to our correlation coefficient bar. We can use
this logic for all other data points in the separate quadrants.
In the graph, the correlation coefficient is positive as we see a
positive trend in the scatterplot. We could also say that there is a
positive correlation between these two variables, meaning the
larger y gets, the larger x gets
Statistics 2 – Page 10
Properties of r:
r is sensitive to outliers Correlation between x and y = correlation between y and x
-1 ≤ r ≤ 1 Correlation is independent of scale (grams, kilograms, kilometres etc..)
Examples of correlation
strengths: her we see the
linear relationship between
variables of different
sample sizes. For larger
sample sizes, it can be
easier to spot patterns
within the data and
determine if a correlation
is perhaps present.
Statistics 2 – Page 11
Example: 6.1 in LN
Suppose that beer consumption (the average consumption by 100 persons) has been measured in 3
countries on 5 days with different maximum temperatures. The surprising outcome is a negative
relationship. Make a sketch of 3 x 5 points showing
120 what the dot diagram could look like, assuming that
there is a positive relationship between beer
Beer comsumptiom
100
consumption and maximum temperature during the
80
day on an individual level.
60 Show the maximum temperature on the x-axis, and
40 beer consumption on the y-axis. (Hint: Think of
20 Norway (15-25 ºC), France (25-35 ºC), and Iraq (35-
0 45 ºC)).
0 10 20 30 40 50 Red point = Norway
Max temp Blue points = France Green points = Iraq
In this hypothetical scatterplot, we can see that overall beer consumption goes down with temperature,
however within the individual countries, it actually goes up. This means we have to be aware of what is
going on inside groups of our data as well as the overall trends for the whole of the data.
Assumption of ε:
● Error terms εi are deviations ● All εi’s have the same variance σ2
from the mean, so εi’s vary ● εi’s are normally distributed, so:
around 0 εi ~ N(0,σ)
● εi’s are independent of each other
Assumptions differently formulated:
● observations y1, y2, ..., yn are independent and normally distributed with mean μ y = β0 + β1x and
standard deviation σ
Parameters of the model: β0, β1 and σ
Statistics 2 – Page 12
Linear Regression – estimation of regression parameters
Adapting line to the data by using Least-Squares-
Method (LSM):
∑ (Error)2 = ∑ (residual)2 = ∑𝜀̂ = ∑(yi - 𝑦i)2
= ∑(yi - 𝜷0 - 𝜷i1 xi )2
*The "hat" symbol generally denotes an estimate, as opposed to the "true" value. Therefore, 𝛽 is an
estimate of β. A few symbols have their own conventions: the sample variance, for example, is often
written as s2, not 𝜎2.
r2 in regression
● r2 (squared correlation coefficient) is the proportion of variation in 𝑦 explained by 𝑥 (Also called the
coefficient of determination).
● This measure indicates how successful the regression is in explaining the variation in the response
variable.
● r2 is the square of the correlation coefficient, therefore always between 0 and 1!
● r2 is also sensitive for outliers in the dataset
Estimator s2 for population variance σ2
∑(𝒓𝒆𝒔𝒊𝒅𝒖𝒂𝒍)𝟐 ∑(𝒚𝒊 𝒚𝟏 )𝟐
Residual variance s2 can be calculated by: 𝒔𝟐 = =
𝒏 𝟐 𝒏 𝟐
Comments on regression
Regression procedure is sensitive to outliers
In regression it is allowed to select the values for x
(y is random, but x is not)
Extrapolation beyond the measuring range of the
data is risky
Correlation is symmetric in x and y, but regression
is not. In regression it does matter which one is the
response y and which one the explanatory variable x. In regression we want to say something about y,
given the value of variable x.
Statistics 2 – Page 13
Rcmdr output for Linear regression:
Statistics 2 – Page 14
Outliers that fall within the x value range, but not Outliers that do not fall within the x value range nor
the y value range, will not have a great influence on the y value range, will have a great influence on the
the regression line (will not overly affect the slope, regression line. (Will have a different slope and
only the position of the positioning)
line)
Lecture 8:
You must be able to calculate the value of the estimate (‘prediction’)
𝜷𝟎 + 𝜷𝟏 𝒙𝒊 = 𝝁𝒚 = prediction
You don’t have to be able to calculate the standard error of the estimate
𝜺 = residual = observation - 𝝁𝒚
Confidence interval for mean µy at x*
Estimated regression line: 𝒚 = 𝜷𝟎 + 𝜷𝟏 𝒙
Estimator: : 𝝁𝒚 = 𝜷𝟎 + 𝜷𝟏 𝒙 ∗
Statistics 2 – Page 15
100 (1-𝛼) % confidence interval for 𝜇 = 𝛽 + 𝛽 𝑥 ∗ : 𝝁𝒚 ± 𝒕𝜶/𝟐 ∗ 𝑺𝑬(𝝁𝒚 )
Where:
- SE(𝜇̂ ) is the standard error of the estimator 𝜇̂
- 𝑡α/2 is the table value at n-2 df
SE(𝑦) > SE(𝜇̂ ) because it is more difficult to predict the value of one individual than of the mean.
Lecture 9: T-tests
Types of t-tests
T-test for one sample and one variable T-test for two independent samples
Paired t-test T-test for linear regression
T-test for one sample and one variable:
- One simple random sample
- One variable
Model assumptions:
- Simple random sample of size n from a population where the variable y is
normally distributed with unknown mean 𝜇 and unknown standard deviation 𝜎
Interested in (population) mean 𝜇
T-test for two independent samples
- Two independent simple random samples
- One variable
Model assumptions:
- Simple random sample of size n1 selected from N(𝜇1 , 𝜎1)-population,
independent simple random sample of size n2 selected from N(𝜇2 , 𝜎2)-
population
Interested in difference of population means 𝜇1 – 𝜇2
Paired t-test
- One simple random sample
- Two variables (x and y)
Model assumptions:
- Differences d = x – y is independent, and N(𝜇d , 𝜎d) distributed
Interested in mean difference 𝜇d
T-test for linear regression
- One simple random sample
- Two variables (x and y)
The model for linear regression is:
𝑦i = 𝛽i + 𝛽1𝑥i + 𝜀i
Model assumptions:
- Error terms 𝜀i are independent and 𝜀i ~ N(0, 𝜎) (independent of x)
Statistics 2 – Page 16
- Differently formulated: observations y1, y2, …, yn are independent and normally distributed with mean
𝜇y = 𝛽0 + 𝛽1𝑥 and standard deviation 𝜎
Interested in linear relationship between x and y
Statistics 2 – Page 17