0% found this document useful (0 votes)
13 views17 pages

Stats 2 Notes

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 17

Lecture 1: Overview of Normal Distribution (See notes from Stats 1 – Lecture 6)

Are observations normally distributed?


In statistics 1, we learned that we could always assume normality for your measurements, however in
statistics 2, this is no longer the case because this is what we call an assumption.
To test normality, we use a Normal-Quantile Plot (Q-Q Plot):
Here we see an example of a Q-Q Plot. In a small box, you can see
the probability density function, and the data from that is used in the
Q-Q Plot. If the data points are within the barriers (dotted lines) and
stay relatively close to the straight line, our distribution is Normal.
Example: here we have a histogram that looks heavily skewed to
the right, leading us to believe we do not have Normality. The Q-Q
Plot confirms this:

But remember that sample means are Normally Distributed!! See Stats 1 Lecture 8 Notes Central
Limit Theorem! : The only thing you need if your distribution is not normally distributed is that your
sample size is large enough that the distribution of the sample mean will be normally distributed! The same
holds for a sum of series of observations.
Lecture 2: Confidence intervals:
Let’s begin with a research example:
Nitrate in the upper groundwater in sandy soil: Too much nitrate in groundwater can inhibit aquatic
life, like algae, from growing. The level of nitrate has been monitored in the NL in various types of land.
Population: Agricultural land on sandy soil in Sample: Random sample of hectares (More advanced:
the Netherlands. (E.g., all agricultural land on stratified sample, several regions and per region n
sandy soil divided into hectares.) samples (not exam material))
Unit: 1 hectare Response variable: Nitrate concentration in mg/litre

Visualisation results nitrate concentration:


In the histogram, we see that this looks fairly
like a normal distribution, however we can use a
Normal Q-Q plot to better determine the
distribution type.
In the Normal Q-Q Plot we see that the middle
points stay fairly in line with the centre line,
there is only a slight deviation at the edges.
Thus, we can say that we are working with a
normal distribution.
European norm = 50 mg/l (average)
Outcome 41 observations:
sample mean = 55.7
Sample standard deviation = 30.3 → thus quite some variation in measurements.
Next, we must introduce a method that also shows the uncertainty of the estimated mean → confidence
interval

Statistics 2 – Page 1
Expected Value E(𝒚) = 𝝁𝒚 of a sample mean
Random sample: 𝑦1, 𝑦2, …, 𝑦𝑛. These are 𝑛 independent observations from the same distribution with
expected value (population mean) 𝜇y and standard deviation 𝜎y.
𝑦 is an unbiased estimator for 𝜇y
Aim: estimate 𝜇y
Estimator: (formula) sample mean 𝑦 = (𝑦1 + 𝑦2 + … + 𝑦𝑛)/n because 𝜇 = 𝜇y (the larger the
sample, the closer we tend to the
E(𝒚) = 𝝁𝒚 and 𝝈𝒚 = 𝝈𝟐𝒚 of a sample mean unknown true value 𝜇y
𝒏∗𝝁𝒚 𝒏∗𝝈𝟐𝒚 𝝈𝟐𝒚
𝝁𝒚 = = 𝝁𝒚 𝝈𝟐𝒚 = = so 𝝈𝒚 = 𝝈𝟐𝒚
𝒏 𝒏𝟐 𝒏
Implementation: 𝑦1, 𝑦2, …, 𝑦𝑛 → outcome of 𝑦 is an estimate of µy

Confidence interval for population mean (expected value):


Outcome of 𝑦 is a (point) estimate of 𝜇𝑦, but how precise is this estimate?
Confidence interval (abbreviation: CI) provides information about the precision.
 CI has the form: estimator ± error margin
Confidence coefficient 1 – α presents a degree of trust: e.g., 1 – α = 0.95 means that the procedure with
which a CI is constructed leads to 95% correct statements (such that this interval contains 𝜇𝑦).
 Or: statement that CI contains 𝜇𝑦 is 95% of the time correct.
 Or: the probability that the CI contains the unknown parameter 𝜇𝑦 is 0.95.

Value from table:


With rule of thumb: (μ - 2σ, μ + 2σ) → ≈ 95% , However we can
use a more precise method:
Let Z ~ N(0,1)
Table 2: 1- α = 0.95 → Right-tail p = 0.025
→ bottom line (df = inf) → zα/2 = 1.960
Procedure 95% confidence interval:
𝜎 𝑦 − 𝜇
𝑦 ~ 𝑁(𝜇 , 𝜎 ) = 𝑁(𝜇, ) → 𝑧 = 𝜎
√𝑛 ( )
√𝑛
𝝈𝒚 𝝈𝒚
P(𝒚 − 𝟏. 𝟗𝟔 * ≤ 𝝁 ≤ 𝒚 + 𝟏. 𝟗𝟔 ∗ ) = 𝟎. 𝟗𝟓
√𝒏 √𝒏
Formula derived steps below:
P(−1.96 ≤ z ≤ 1.96) = 0.95 → substitute z
P(−𝑦 − 1.96 ∗ ≤ −𝜇 ≤ −𝑦 + 1.96 * ) =
√ √
P(−1.96 ≤ ≤ 1.96) = 0.95 → * 0.95 → ∗ −1
( ) √

P(𝑦 + 1.96 ∗ ≥ 𝜇 ≥ 𝑦 − 1.96 * ) = 0.95


P(−1.96 * ≤ 𝑦 − 𝜇 ≤ 1.96 * ) = 0.95 → −𝑦 √ √
√ √ → 𝑟𝑒𝑤𝑟𝑖𝑡𝑒!

Limits confidence interval:


1 – α = 0.95 → 95% confidence interval
Table 2 (O&L, p.1088), Right tail probability = α/2 =
0.025 → df = inf. → zα/2 = 1.960

Statistics 2 – Page 2
Exercise: Nitrate concentration
Suppose y is the nitrate concentration in the upper groundwater in sandy soil (mg/litre).
Suppose y is normally distributed with unknown μ and (known) σ = 30.3.
From a random sample with 41 observations, we get a sample mean of 55.7.
Construct a 90% confidence interval for μ n = 41 𝑦 = 55.7
y = nitrate concentration in the upper groundwater in sandy soil (mg/l) y ~ N(μ, 30.3)
.
Limits 90% confidence interval for µ: 𝑦 ± 𝑧 . ∗ = 55.7 ± 1.645 ∗ = 55.7 ± 7.78
√ √
90% confidence interval for µ : (47.92, 63.48)
We have constructed a 90% CI for µ, assuming we know that the σ is exactly known. Is that realistic? In
reality, the σ is hardly ever known. It is only really known when it is given in an exercise or historic
research has been done.

Unknown σ:
We estimate σ by using the sample standard deviation s, the square root of the sample variance s2

[(𝒚𝟏 − 𝒚)𝟐 + . . . +(𝒚𝒏 − 𝒚)𝟐 ] 𝜮𝒏𝒊 𝟏 (𝒚𝟏 − 𝒚)𝟐


𝒔𝟐 = =
𝒏 − 𝟏 𝒏 − 𝟏
𝝈𝒚 𝒔
We estimate the standard deviation of the mean: 𝝈𝒚 = thus by
√𝒏 √𝒏

Standard error of a sample mean:


The standard deviation of the sample mean is a measure for the precision of the sample mean 𝑦 as
estimator for the population mean 𝜇.
𝒔
That’s why we call it a standard error of a sample mean and we note it as: 𝑺𝑬(𝒚) =
√𝒏
NOTE: Actually, this is an estimate of the standard error, because we have to replace 𝜎 by 𝑠, but in
practice we don’t make a fuss about this.

Estimating 𝜎 with 𝑠: consequences for CI: CI for µ with σ known/unknown:


We expect that estimation of 𝜎 introduces
extra uncertainty therefore leading to a
wider confidence interval.
For estimation of 𝜎 the following applies:
a larger sample gives a better (more
precise) estimate for 𝜎 (the estimator s is
consistent).
 Standard normal distribution is
replaced by t-distribution with a
certain degree of freedom (df)
(Table 2 (O&L, p.1088 ) df = infinity →
standard normal distribution

Difference normal distribution and t-distribution:


If you are dealing with α, you have a SND (blue dashed line)
If we only have a certain number of observations n, and we estimated the
standard deviation, we cannot use a SND as smaller samples do not
behave like a SND, they behave like a student t distribution, which is a
SND that has been corrected for its sample size. (Red: df = 1, which
means n=2; Green: df=5 so n = 6) Under each of the curves, the area
remains 1, so for the smaller n, we have higher tails in our curve.

Statistics 2 – Page 3
Example: From textbook page 286:
A company that manufactures coffee for use in commercial machines monitors the caffeine content in its
coffee. The company selects 50 samples of coffee every hour from its production line and determines the
caffeine content. From historical data, the caffeine content (mg) is known to have a normal distribution
with 𝝈=7.1 mg. During a 1-hour time period, the 50 samples yielded a mean caffeine content of 𝑦 =110
mg.
a. Identify the population about which inferences can be made from the sample data.
Population: All coffee produced during the hour in which the 50 samples were selected.

b. Calculate a 95% confidence interval for the mean caffeine content 𝜇 of the coffee produced during the
hour in which the 50 samples were selected.
Let y = caffein content in mg. y~N(µ,7.1) n=50 𝑦 =110
𝜎 7.1
𝑦 ± 𝑧. ∗ = 110 ± 1.960 ∗ = 110 ± 1.968
√𝑛 √50
Limits 95% Confidence interval for µ: (108.032, 111.968)

c. What would happen to the width of the confidence intervals if the level of confidence of each interval
is increased from 95% to 99%? It would become wider
(95%: zα/2 = z0.025 = 1.960; 99%: zα/2 = z0.005 = 2.576)

d. What would happen to the width of the confidence intervals if the number of samples per hour was
increased from 50 to 100? It would become narrower (divide by √100 instead of √50).

Because the company is sampling the coffee production process every hour, there are 720 confidence
intervals for the mean caffeine content 𝜇 constructed every month.
a. If the level of confidence remains at 95% for the 720 confidence intervals in a given month, how
many of the confidence intervals would you expect to fail to contain the value of 𝜇?
5% of the 720 confidence intervals: 36

b. If the number of samples is increased from 50 to 100 each hour, how many of the 95% confidence
intervals would you expect to fail to contain the value of 𝜇 in a given month?
5% of the 720 confidence intervals: 36

c. If the number of samples remains at 50 each hour but the level of confidence is increased from 95% to
99% for each of the intervals, how many of the 99% confidence intervals would you expect to fail to
contain the value of 𝜇 in a given month?
1% of the 720 confidence intervals : 7.2 ≈ 7
Lecture 3: Single sample t-test:
Test procedure in 8 steps:
Define the parameter(s) and specify sequentially:
Determine step 1-5 prior to the
1. Null hypothesis H0 and alternative hypothesis Ha
2. The Test Statistic as a formula (T.S.), fill in allowed parts! experiment! Only in steps 6-8
3. The probability distribution of the T.S. under H 0 the data will be used.
4. The behaviour of the T.S. under Ha (i.e. under Ha the T.S. tends to
“higher”/”lower”/”higher or lower “values than under H0)
5. The type of p-value (right-, left-, two-sided)
-----------------------------------------------------------------------------------------------------------------
6. The outcome of the T.S.
7. The appropriate p-value
8. The conclusion (also in non-statistical terms)
Significance test: Procedure to compare data with a hypothesis. In a test procedure, we use a probability
to show how well the hypothesis and the data match.
Statistics 2 – Page 4
z-test for a population mean µ:
Steps 1 - 3: Step 6:
Calculate the outcome of the test statistic based on the
executed experiment
Step 7:
P-value: the prob of the T.S. obtaining the observed
value or anything more extreme (supporting Ha)
assuming that H0 is true.
Steps 4 & 5 Step 8:

Test procedure for a pop mean µ with σ known/unknown:


Model assumptions t-test for a
pop mean:
Based on a random sample size
n from a population where the
variable y is Normally
Distributed with unknown mean
µ and unknown standard
deviation σ
*We can use a Q-Q plot to
check whether it is Normally
Distributed.

Rejection Region (R.R) and Probability of a Type I and II error:


The Rejection region is all outcomes of the T.S. for
which we reject the H0.

The R.R. is linked to the type I error (incorrectly


rejecting the H0) and thus the α is also linked.
(α = probability of getting a type I error)

*Note: Confidence interval (CI) is the interval of the


credible parameter’s values (for example 𝜇)

P(Type I error) = P(incorrectly reject H0) =


P(T.S. in R.R. when H0 is true) ≤ 𝛼
P(Type II error) = P(incorrectly not reject H0) =
P(T.S. not in R.R. when Ha is true)
Statistics 2 – Page 5
Test procedure in 8 steps: 2 ways

Note: we have a two-sided P-value


when the outcome is ≠,
rather than < or >

Lecture Example: z-test for a population mean µ


Captan (pesticide) content in apple sauce: (EU limit for children: 0.15 mg/kg) Sample: 15 random jars
Sample mean = 0.19 sample standard deviation= 0.06
Steps: Answers: Steps: Answers:
1. H0: µ = 0.15 Ha: µ > 0.15 2. 𝑦−𝜇 𝑦 − 0.15
𝑡= =
𝑠/√𝑛 𝑠/√15
3. Distribution under H0 → tn-1 4. Under Ha the T.S. tends to higher values
distribution = t14 distribution than under H0.
5. Right sided P-value 6. 𝑦 − 0.15 0.19 − 0.15
𝑡= = = 2.6
𝑠/√15 0.06/√15
7. Right sided P-value = P(t14 ≥ 2.6) = 8. P value ≤ α = 0.05, so reject H0 and Ha has
0.0105 been shown. It has been shown (when α
=0.05) that the EU limit for the mean
Captan content in applesauce (0.15 mg/kg)
has been exceeded
Parameters: µ = (pop) mean Captan content in a randomly chosen jar of applesauce

Lecture 4: 2 independent samples t-test


T-test Steps: We use the same steps as the single sample t-test, however the first 3 steps change:
*The red estimators/ standard
Test statistic: 𝑡 = errors are the same for both the
T.S, and the Limits of C.I.
Limits confidence interval for the parameter: estimator ± tα/2 * standard error of the estimator

Confidence interval for two independent samples (𝝈𝟐𝟏 = 𝝈𝟐𝟐 ) (equal variances):
Model Assumption:
Random sample of size n1 from Independent random sample of size n2 𝜎 =𝜎 =𝜎
N(µ1, σ1) from the population from N(µ2, σ2) from the population
*To verify normality in the samples, we use a Q-Q Plot
Parameter
𝟏 𝟏
100 (1-𝛼) % confidence interval for 𝜇1 – 𝜇2: (𝒚𝟏 − 𝒚𝟐 ) ± 𝒕𝜶/𝟐 ∗ 𝒔𝒑 + Estimator
𝒏𝟏 𝒏𝟐
Standard error
*with 𝑠 = 𝑠 + 𝑠
𝒔𝟐𝒑 is the pooled variance estimator of 𝜎2 (ie. Weighted average of the sample variances 𝑠 and 𝑠 )
tα/2 from t-distribution with n1 + n2 − 2 degrees of freedom (df)

Statistics 2 – Page 6
t-test for two independent samples (𝝈𝟐𝟏 = 𝝈𝟐𝟐 ) (equal variances) : steps 1-3
Parameters: Define µ1 and µ2, !Don’t forget to use information from the aim/question!
Step 1: H0 : µ1 - µ2 = D0 (Notation book, often D0 = 0
(𝒚𝟏 𝒚𝟐 ) 𝑫𝟎 (𝒏𝟏 𝟏)𝒔𝟐𝟏 (𝒏𝟐 𝟏)𝒔𝟐𝟐
Step 2: t = with sp =
𝒔𝒑
𝟏 𝟏 𝒏𝟏 𝒏𝟐 𝟐
𝒏𝟏 𝒏𝟐

Step 3: Under H0 T.S. (t) follows a t-distribution with n1 + n2 − 2 degrees of freedom (df)

Confidence interval for two independent samples (𝝈𝟐𝟏 ≠ 𝝈𝟐𝟐 ) (not equal variances): (Welch’s Test)
Model Assumption:
Random sample of size n1 from Independent random sample of size n2 𝜎 ≠𝜎
N(µ1, σ1) from the population from N(µ2, σ2) from the population
*To verify normality in the samples, we use a Q-Q Plot
Parameter
𝒔𝟐𝟏 𝒔𝟐𝟐
100 (1-𝛼) % confidence interval for 𝜇1 – 𝜇2: (𝒚𝟏 − 𝒚𝟐 ) ± 𝒕𝜶/𝟐 ∗ + Estimator
𝒏𝟏 𝒏𝟐
Standard error

tα/2 from t-distribution with degrees of freedom (df) = value seen on Rcmdr output
t-test for two independent samples (𝝈𝟐𝟏 ≠ 𝝈𝟐𝟐 ) (not equal variances) : steps 1-3 (Welch’s Test)
Parameters: Define µ1 and µ2, !Don’t forget to use information from the aim/question!
Step 1: H0 : µ1 - µ2 = D0 (Notation book, often D0 = 0)
(𝒚𝟏 𝒚𝟐 ) 𝑫𝟎
Step 2: t’ = *the ‘ on the t is an indicator that this is a test
𝒔𝟐
𝟏 𝒔𝟐
𝟐
statistic for unequal variances!
𝒏𝟏 𝒏𝟐

Step 3: Under H0 T.S. (t’) follows approximately a t-distribution with degrees of freedom (df) = value
seen on Rcmdr output (usually not a round number)
Reading Rcmdr output:
Here we have a two-sided P-value. We
know this because our Ha is looking for
≠. This P-value is a two-sided P-value
and accounts for both tails. If you are
given a two-sided Rcmdr output and you
only need one side, you divide this P-
value in half.

Here we have a one-sided P-value. We


know this because our Ha is looking for
>. This P-value is a one-sided P-value
and accounts only one tail. If you are
given a one-sided Rcmdr output and you
need both sides, you double this P-value.

Statistics 2 – Page 7
How do we know if variances are equal? Levene’s Test! → Rcmdr
Levene's Test is used to test for equality of
variances. This test goes through the 8 steps of a
T.S. starting with: 𝐻 : 𝜎 = 𝜎 𝐻 :𝜎 ≠ 𝜎

To the right, we see the output on Rcmdr and how


to read it: α for Levene’s test is always 0.05

Non-equal variance t-test (t’) is also know as the


Welch’s Test!

Lecture Example: t-test for independent population sample means


Hamster Cortisol hormone level when exposed to scent of predator (weasel) for 28 days. Sample: 20
hamsters: 10 → control group (not exposed), 10 → test group (exposed): (Randomly assigned!)
Aim: Can we prove that exposure to weasel scent increases the mean cortisol level? Results:

Clearly, we see an increased mean cortisol level in the test hamsters. However, is this significant?
Steps: Answers: Steps: Answers:
1. H0: μA - μB = 0 2. (𝒚 𝒚 ) 𝑫𝟎
t = 𝑨 𝟏𝑩 𝟏 = 𝑨 𝑩
(𝒚 𝒚 ) 𝟎
H a: μ A - μ B > 0 𝒔𝒑
𝒏 𝒏 𝒔𝒑
𝟏 𝟏
𝑨 𝑩 𝟏𝟎 𝟏𝟎

3. Distribution under H0 → t has a t- 4. Under Ha t tends to higher values than


distribution with: under H0.
df = nA+nB-2 = 18.

5. Right sided P-value 6. 3.262


𝑡= = 2.405
1.3566
(One sided, in this case right sided, P-
value in table above in “reading Rcmdr
output)
7. Right sided P-value = P(t ≥ 2.405) 8. 0.0136 < α = 0.05 so reject H0, Ha has
= 0.0136 been shown. It has been shown (when α
(One sided, in this case right = 0.05) that the mean cortisol level in
sided, P-value in table above in hamsters that have been exposed to
“reading Rcmdr output) weasel scent is higher than the mean
cortisol level in hamsters that were not
exposed to weasel scent
Parameters: μA = (population) mean cortisol level (ng/ml) in hamsters that were exposed to weasel scent.
μB = (population) mean cortisol level in hamsters that were not exposed to weasel scent.
Lecture 5: Paired t-test
Paired observations vs 2 independent samples
Paired Observations 2 Independent samples
• 1 simple random sample • 2 independent simple random samples
• 2 variables x and y • 1 variable (x in 1st sample, y in 2nd
• Interested in (population) mean sample
difference between x and y • Interested in difference between
(population) mean x and (population)
mean y

Statistics 2 – Page 8
For paired observations: suppose that the differences d = x − y are N(µd, σd) distributed. Then a one-
sample t-test can be performed for the mean difference µd = µx - µy for differences d = x − y

General structure of t-test: (as seen in independent samples t-test)


T-test Steps: We use the same steps as the single sample t-test, however the first 3 steps change:
*The red estimators/ standard
Test statistic: 𝑡 = errors are the same for both the
T.S, and the Limits of C.I.
Limits confidence interval for the parameter: estimator ± tα/2 * standard error of the estimator

t-test for paired observations: steps 1-3


Parameters: Define µd !Don’t forget to use information from the aim/question!
Step 1: H0 : µd = D0 (Notation book, often D0 = 0)
𝒅 𝑫𝟎 *sd is the standard 𝒔𝒅
Step 2: t = 𝒔𝒅 = 𝒔𝒕𝒂𝒏𝒅𝒂𝒓𝒅 𝒆𝒓𝒓𝒐𝒓 𝒅 = 𝒆𝒔𝒕𝒊𝒎𝒂𝒕𝒐𝒓
deviation of the differences √𝒏
√𝒏
Step 3: Under H0 T.S. (t’) follows a Student t-distribution with n-1 degrees of freedom (df).

Confidence interval for paired observations


Model Assumption:
Simple random sample Differences d = x – y are Interest in (population) 𝜎d unknown,
of size n, two variables x N(𝜇d , 𝜎d) distributed mean difference 𝜇d estimated from
and y observed per unit measured data
*To verify normality in the sample, we use a Q-Q Plot
𝒔 tα/2 from t-distribution with n-1 degrees
100 (1-𝛼) % confidence interval for 𝜇d: 𝒅 ± 𝒕𝜶/𝟐 ∗ 𝒅
√𝒏 of freedom (df) [df = n-1, right tail
Estimator Standard error p=α/2 in Table 2]

Reading Rcmdr output:


Here we have a one-sided P-value. We
know this because our Ha is looking for
>. This P-value is a one-sided P-value
and accounts only one tail. If you are
given a one-sided Rcmdr output and you
need both sides, you double this P-value.

Here we have a two-sided P-value. We


know this because our Ha is looking for
≠. This P-value is a two-sided P-value
and accounts for both tails. If you are
given a two-sided Rcmdr output and you
only need one side, you divide this P-
value in half.

Type 1 and Type 2 error calculations:

Statistics 2 – Page 9
Lecture Example: t-test for paired population sample means
Medical training now includes communication training. Can we prove (with data) that communication
training is effective? | Units: Medical Students | Variable: Score of general impression of Dr’s before com.
training from patients and the score after. Calculate “by hand” the 95% confidence interval for the mean
difference between the scores for the general impression of the presentation and the student’s eye contact
with the people in the room, before and after the training. Use the Rcmdr output below:

Here we are using a one sample t-test with the


differences (diff) and a Q-Q Plot for diff =
after-before

Let x = score after training, and y = score before training d=x–y


Limits 95% confidence interval for µd:
𝒔 𝟎.𝟗𝟒𝟕𝟎
𝑑 ± 𝑡 . ∗ 𝒅 = 0.5385 ± 𝟐. 𝟎𝟔𝟎 ∗ Thus, 95% C.I. for µd : (0.1559. 0.9210)
√𝒏 √𝟐𝟔

Lecture 6: Correlations: (Linear relationships between two variables)


(Pearson) Correlation coefficient r:
A correlation coefficient measures the strength of the linear relationship between two quantitative
variables x and y
𝟏 𝒙𝒊 − 𝒙 𝒚𝒊 − 𝒚 ∑(𝒙𝒊 − 𝒙)(𝒚𝒊 − 𝒚)
𝒓 = ( )( ) =
𝒏−𝟏 𝒔𝒙 𝒔𝒚 ∑(𝒙𝒊 − 𝒙)𝟐 ∗ ∑(𝒚𝒊 − 𝒚)𝟐

Here we have pairs of variables (xi, yi); We have the mean of x (𝑥), and the mean of y (𝑦), and also use
the empirical standard deviation for all values of x and all values of y respectively.
What kind of values can we get from this formula?
The x’s represent a few data points:
Let’s for instants talk about a data point in the lower left quadrant:
We see that in these data points, the y value it is smaller than
(𝑦) and the x value is smaller than (𝑥). Looking at our equation,
the x portion would then be a negative value and same for the y
portion. Two negatives make a positive and thus we have a
positive contribution to our correlation coefficient bar. We can use
this logic for all other data points in the separate quadrants.
In the graph, the correlation coefficient is positive as we see a
positive trend in the scatterplot. We could also say that there is a
positive correlation between these two variables, meaning the
larger y gets, the larger x gets

Statistics 2 – Page 10
Properties of r:
r is sensitive to outliers Correlation between x and y = correlation between y and x
-1 ≤ r ≤ 1 Correlation is independent of scale (grams, kilograms, kilometres etc..)

r = +1: perfect positive correlation


(Whenever x increases, y increases)

r = -1 : perfect negative correlation.


(Whenever x increases, y decreases)
r = 0 : No correlation or linear relationship (y seems to be under no influence of x)

Examples of correlation
strengths: her we see the
linear relationship between
variables of different
sample sizes. For larger
sample sizes, it can be
easier to spot patterns
within the data and
determine if a correlation
is perhaps present.

A few warnings with regard to correlation:


High correlation does not automatically imply cause-and-
effect relation
Example: Nobel Laureates and chocolate consumption
Proper interpretation correlation coefficient requires both
variables to be random (stochastic variables)

Here we see that there is a positive correlation between


countries that eat chocolate and their number of Nobel prizes
won per capita. However, we cannot actually say that these
two are related and that one actually effects the other.

Determine cause-and-effect relation:


We can determine cause and effect relation by using an experimental study:
● we can use different settings of explanatory variable
● we can control other influences on response variable
In an experimental study the researcher purposely assigns different treatments to the experimental units
Experimental vs. observational study:
● Observational study: observe without any interference with the process
● Experimental study: actively manipulate certain variables to ascertain their effects.

Statistics 2 – Page 11
Example: 6.1 in LN
Suppose that beer consumption (the average consumption by 100 persons) has been measured in 3
countries on 5 days with different maximum temperatures. The surprising outcome is a negative
relationship. Make a sketch of 3 x 5 points showing
120 what the dot diagram could look like, assuming that
there is a positive relationship between beer
Beer comsumptiom

100
consumption and maximum temperature during the
80
day on an individual level.
60 Show the maximum temperature on the x-axis, and
40 beer consumption on the y-axis. (Hint: Think of
20 Norway (15-25 ºC), France (25-35 ºC), and Iraq (35-
0 45 ºC)).
0 10 20 30 40 50 Red point = Norway
Max temp Blue points = France Green points = Iraq

In this hypothetical scatterplot, we can see that overall beer consumption goes down with temperature,
however within the individual countries, it actually goes up. This means we have to be aware of what is
going on inside groups of our data as well as the overall trends for the whole of the data.

Model for simple linear regression:


In the model for simple linear regression, it is assumed that for each
value of the explanatory variable x the response variable y is
normally distributed with mean 𝜇y, where 𝜇y is a linear function of x:
µy = β0 + β1x
β1 = (population) mean increase in y when there is a 1 unite increase in x
β0 = (population) mean value of y when x = 0
(In many cases, β0, has no sensible interpretation)
Actual observed y’s will vary around this mean. The model assumes
that this variation, measured by the standard deviation σ, which is the
same for all values of x.
Model: mathematical description of a linear relationship between 2 variables
In the model:
● Suppose n pairs of ● The model for linear regression:
observations (xi, yi), …, yi = β0 + β1x + εi
(xn, yn) Systematic part Random part

Assumption of ε:
● Error terms εi are deviations ● All εi’s have the same variance σ2
from the mean, so εi’s vary ● εi’s are normally distributed, so:
around 0 εi ~ N(0,σ)
● εi’s are independent of each other
Assumptions differently formulated:
● observations y1, y2, ..., yn are independent and normally distributed with mean μ y = β0 + β1x and
standard deviation σ
Parameters of the model: β0, β1 and σ

Statistics 2 – Page 12
Linear Regression – estimation of regression parameters
Adapting line to the data by using Least-Squares-
Method (LSM):
∑ (Error)2 = ∑ (residual)2 = ∑𝜀̂ = ∑(yi - 𝑦i)2
= ∑(yi - 𝜷0 - 𝜷i1 xi )2

minimalize → equation regression line


We use this method to find the line that best
fits our data: to do this we are minimizing the
squared distance of the points to the line.
Say we have a line, and we look at the distance of a point(observed y i) to the line, we will get an error
value. We would like to make this error as small as possible, because we are looking for the best line to
describe our relationship. We would like to make this as small as possible for all data points
simultaneously. In the equation we see that we are finding the sum of all the error values and squaring it
(so that we only end with a positive value).
Estimation of regression parameters:
If we have a sample of observations, then the population regression line μ y = β0+β1x can be estimated by
the least-squares regression line:
𝒚 = 𝜷 0 + 𝜷1 x
Where 𝜷1 = r and 𝜷0 = 𝑦 − 𝛽 1𝑥 are estimators for β1 and β0 (the estimates for can be read from the
Rcmdr output. Lecturer has mentioned that they will not ask us to use these equations, but rather be able
to interpret their results from Rcmdr output)

*The "hat" symbol generally denotes an estimate, as opposed to the "true" value. Therefore, 𝛽 is an
estimate of β. A few symbols have their own conventions: the sample variance, for example, is often
written as s2, not 𝜎2.
r2 in regression
● r2 (squared correlation coefficient) is the proportion of variation in 𝑦 explained by 𝑥 (Also called the
coefficient of determination).
● This measure indicates how successful the regression is in explaining the variation in the response
variable.
● r2 is the square of the correlation coefficient, therefore always between 0 and 1!
● r2 is also sensitive for outliers in the dataset
Estimator s2 for population variance σ2
∑(𝒓𝒆𝒔𝒊𝒅𝒖𝒂𝒍)𝟐 ∑(𝒚𝒊 𝒚𝟏 )𝟐
Residual variance s2 can be calculated by: 𝒔𝟐 = =
𝒏 𝟐 𝒏 𝟐

Comments on regression
Regression procedure is sensitive to outliers
In regression it is allowed to select the values for x
(y is random, but x is not)
Extrapolation beyond the measuring range of the
data is risky
Correlation is symmetric in x and y, but regression
is not. In regression it does matter which one is the
response y and which one the explanatory variable x. In regression we want to say something about y,
given the value of variable x.

Statistics 2 – Page 13
Rcmdr output for Linear regression:

Example: 6.2 (LN, pages 26-28)


This exercise is based on data from Ruben Dijkhof’s master thesis, about which you have watched a clip
before this tutorial. The data of the Dutch province of Gelderland will now be analysed, in particular the
distance to the National Ecological Network (x, in meters) and the price of agricultural land (y, in Euros
per hectare).The relationship between x and y is assumed to be linear: y = β 0 + β1 x + ε
The Rcmdr output of a linear regression analysis is shown in the Lecture Notes on page 28.

Lecture 7: Outliers and Influential observations in linear regression


An outlier is a point that falls far from the other data points. If the parameter estimates change a great deal
when a point is removes from the calculations, the point is said to be influential

Checking the model assumptions


Rcmdr → graphs → basic diagnostic plots
Q-Q plot → check if the Residuals vs Fitted → check if all residuals (εi’s) have the same
residuals (εi’s) are normally variances. → If there is a pattern visible, e.g., increasing variance with
distributed increasing predicted values, there is no constant variance

Outliers and influential observations:


Outlier in y direction, but not x direction Outlier in x direction and in y direction

Statistics 2 – Page 14
Outliers that fall within the x value range, but not Outliers that do not fall within the x value range nor
the y value range, will not have a great influence on the y value range, will have a great influence on the
the regression line (will not overly affect the slope, regression line. (Will have a different slope and
only the position of the positioning)
line)

**x value range → y value range →


It is important to be aware of outliers in your data set. This does not mean you must remove all outliers,
but perhaps you there is a reason as to why that observation is so different from the rest (wrong manual
imput, external factor etc.)
Standard error of the estimators of the regression parameters
The LSM gives estimates of 𝛽0 and 𝛽1 : 𝜷𝟎 𝒂𝒏𝒅 𝜷𝟏
The standard error of 𝛽 𝑎𝑛𝑑 𝛽 represents the uncertainty of this estimation

Confidence interval for 𝛽0 and 𝛽1


(1 – 𝛼) confidence interval for 𝛽0 is: 𝜷𝟎 ± 𝒕𝜶/𝟐 ∗ 𝑺𝑬(𝜷𝟎 )

(1 – 𝛼) confidence interval for 𝛽1 is: 𝜷𝟏 ± 𝒕𝜶/𝟐 ∗ 𝑺𝑬(𝜷𝟏 )

tα/2 from t(n-2) distribution [df = n-s, right-tail, p = 𝛼/2]

Significance test for regression coefficients 𝛽i, i = 0 or i = 1


Define 𝛽i : … (with i = 0 or i = 1)
Test
1. H0: 𝛽i = d (d is a given constant and i = 0 or i = 1)
𝜷 𝒅
2. T.S.: 𝒕 = 𝒊
𝑺𝑬(𝜷𝒊 )
3. Under H0 T.S. (t) follows a t-distribution with n-2 degrees of freedom
Etc. (rest of 8-step procedure)

Lecture 8:
You must be able to calculate the value of the estimate (‘prediction’)
𝜷𝟎 + 𝜷𝟏 𝒙𝒊 = 𝝁𝒚 = prediction
You don’t have to be able to calculate the standard error of the estimate
𝜺 = residual = observation - 𝝁𝒚
Confidence interval for mean µy at x*
Estimated regression line: 𝒚 = 𝜷𝟎 + 𝜷𝟏 𝒙

Estimator: : 𝝁𝒚 = 𝜷𝟎 + 𝜷𝟏 𝒙 ∗

If x = x* then the mean value of y is given by : 𝝁𝒚 = 𝜷𝟎 + 𝜷𝟏 𝒙 ∗ This mean can be


estimated from the sample: 𝝁𝒚 = 𝜷𝟎 + 𝜷𝟏 𝒙 ∗

Statistics 2 – Page 15
100 (1-𝛼) % confidence interval for 𝜇 = 𝛽 + 𝛽 𝑥 ∗ : 𝝁𝒚 ± 𝒕𝜶/𝟐 ∗ 𝑺𝑬(𝝁𝒚 )

Where:
- SE(𝜇̂ ) is the standard error of the estimator 𝜇̂
- 𝑡α/2 is the table value at n-2 df

Prediction interval for y when x = x*


100 (1-𝛼) % prediction interval for future observation of y at x = x*:
𝒚 ± 𝒕𝜶/𝟐 ∗ 𝑺𝑬(𝒚)
Where:
- 𝑦 = 𝛽 + 𝛽 𝑥 = 𝜇̂
- tα/2 is the table value at n-2 df
- SE(𝑦) : the standard error for the prediction of an individual reaction > SE(𝜇̂ ) :
the standard error of the estimator for mean reaction 𝜇̂

SE(𝑦) > SE(𝜇̂ ) because it is more difficult to predict the value of one individual than of the mean.

Lecture 9: T-tests
Types of t-tests
T-test for one sample and one variable T-test for two independent samples
Paired t-test T-test for linear regression
T-test for one sample and one variable:
- One simple random sample
- One variable
Model assumptions:
- Simple random sample of size n from a population where the variable y is
normally distributed with unknown mean 𝜇 and unknown standard deviation 𝜎
 Interested in (population) mean 𝜇
T-test for two independent samples
- Two independent simple random samples
- One variable
Model assumptions:
- Simple random sample of size n1 selected from N(𝜇1 , 𝜎1)-population,
independent simple random sample of size n2 selected from N(𝜇2 , 𝜎2)-
population
 Interested in difference of population means 𝜇1 – 𝜇2
Paired t-test
- One simple random sample
- Two variables (x and y)
Model assumptions:
- Differences d = x – y is independent, and N(𝜇d , 𝜎d) distributed
 Interested in mean difference 𝜇d
T-test for linear regression
- One simple random sample
- Two variables (x and y)
The model for linear regression is:
 𝑦i = 𝛽i + 𝛽1𝑥i + 𝜀i
Model assumptions:
- Error terms 𝜀i are independent and 𝜀i ~ N(0, 𝜎) (independent of x)
Statistics 2 – Page 16
- Differently formulated: observations y1, y2, …, yn are independent and normally distributed with mean
𝜇y = 𝛽0 + 𝛽1𝑥 and standard deviation 𝜎
 Interested in linear relationship between x and y

Statistics 2 – Page 17

You might also like