6.3 SSK5210 Parametric Statistical Testing - Analysis of Variance LR and Correlation - 2
6.3 SSK5210 Parametric Statistical Testing - Analysis of Variance LR and Correlation - 2
Parametric
STATISTICAL TESTING
The Analysis of Variance
Linear Regression & Correlation
Outline
• Scatter Plots and Correlation
• Regression
• Coefficient of Determination and Standard
Introduction
The purpose of this chapter is to answer these questions
statistically:
• Are two or more variables linearly related?
• If so, what is the strength of the relationship?
• What type of relationship exists?
• What kind of predictions can be made from the relationship?
Introduction
• Consider two variables, independent variable and dependent variable.
• Independent variable: The variable in regression that can be controlled or manipulated.
• In this case, the number of hours of study is the independent variable and is
designated as the x variable. The student can regulate or control the number of hours
he or she studies for the exam.
• Dependent variable: The variable in regression that cannot be controlled or manipulated.
• The grade the student received on the exam is the dependent variable, designated as
the y variable.
• The grade the student earns depends on the number of hours the student studied.
• The independent and dependent variables can be plotted on a graph called a scatter plot.
• independent variable: x is plotted on the horizontal axis
• dependent variable: y is plotted on the vertical axis.
Scatter Plot
•Deterministic Model: y = α + βx
•Probabilistic Model:
y = deterministic model + random error
y = α + βx + ε
A Simple Linear Model
• Since the bivariate measurements that we observe
do not generally fall exactly on a straight line, we
choose to use:
• Probabilistic Model:
• y = α + βx + ε
• E(y) = α + βx
Points deviate from the
line of means by an amount
ε where ε has a normal distribution
with mean 0 and variance σ2.
The Random Error
• The line of means, E(y) = α + βx , describes average value of y for any
fixed value of x.
• The population of measurements is generated as y deviates from
the population line
by ε. We estimate α
and β using sample
information.
The Method of Least Squares
• The equation of the best-fitting line is calculated using a set of n pairs
(xi, yi).
• We choose our estimates a and b to
estimate α and β so that the vertical
distances of the points from the line, are
minimized.
Bestfittingline :yˆ = a + bx
Choosea and b to minimize
SSE = ∑( y − yˆ ) 2 = ∑( y − a − bx) 2
Least Squares Estimators
Calculatethesumsof squares:
( ∑ x ) 2
( ∑ y ) 2
Sxx = ∑ x 2 − Syy = ∑ y 2 −
n n
(∑ x)(∑ y )
Sxy = ∑ xy −
n
Bestfittingline : yˆ = a + bx where
S xy
b= and a = y − bx
S xx
Example
The table shows the math achievement test scores for a random sample
of n = 10 college freshmen, along with their final calculus grades.
Student 1 2 3 4 5 6 7 8 9 10
Math test, x 39 43 21 64 57 47 28 75 34 52
Calculus grade, y 65 78 52 82 92 89 73 98 56 75
Source df SS MS F
Regression 1 SSR SSR/(1) MSR/MSE
Total n -1 Total SS
The Calculus Problem
( S xy ) 2 1894 2
SSR = = = 1449.9741
S xx 2474
( S xy ) 2
SSE = Total SS- SSR = S yy −
S xx
= 2056 − 1449.9741 = 606.0259
Source df SS MS F
Regression 1 1449.9741 1449.9741 19.14
Error 8 606.0259 75.7532
Total 9 2056.0000
Testing the Usefulness of the Model
•The first question to ask is whether the independent variable
x is of any use in predicting y.
•If it is not, then the value of y does not change, regardless of
the value of x. This implies that the slope of the line, β, is
zero.
H 0 : β = 0 versusH a : β ≠ 0
Testing the Usefulness of the Model
• The test statistic is function of b, our best estimate of β.
Using MSE as the best estimate of the random variation σ2,
we obtain a t statistic.
b−0
Test statistic: t = whichhasa t distribution
MSE
S xx
MSE
withdf = n − 2 or a confidenceinterval : b ± tα / 2
S xx
The Calculus Problem
• Is there a significant relationship between the calculus
grades and the test scores at the 5% level of significance?
There is a
significant linear
H 0 : β = 0 versusH a : β ≠ 0
relationship
b−0 .7656 − 0 between the
t= = = 4.38
M SE/ S xx 75.7532 / 2474 calculus grades
and the test
Reject H0 when |t| > 2.306. Since t = 4.38 falls into scores for the
the rejection region, H0 is rejected . population of
college freshmen.
The F Test
• You can test the overall usefulness of the model using an F
test. If the model is useful, MSR will be large compared to
the unexplained variation, MSE.
Analysis of Variance
Source DF SS MS F P
Regression 1 1450.0 1450.0 19.14 0.002
Residual Error 8 606.0 75.8
Total 9 2056.0
Minitab Least squares
regression line
Output To test H 0 : β = 0
Regression Analysis: y versus x
The regression equation is y = 40.8 + 0.766 x
Predictor Coef SE Coef T P
Constant 40.784 8.507 4.79 0.001
x 0.7656 0.1750 4.38 0.002
Analysis of Variance
Source DF SS MS F P
Regression 1 1450.0 1450.0 19.14 0.002
Residual Error 8 606.0 75.8
Total 9 2056.0
Regression coefficients, t 2 = F
MSE a and b
Measuring the Strength of the Relationship
•If the independent variable x is of useful in predicting y, you
will want to know how well the model fits.
•The strength of the relationship between x and y can be
measured using:
S xy
Correlation coefficient : r =
S xx S yy
2
S xy SSR
Coefficient of determination : r 2 = =
S xx S yy Total SS
Measuring the Strength of the Relationship
•Since Total SS = SSR + SSE, r2 measures
the proportion of the total variation in the responses that
can be explained by using the independent variable x in the
model.
the percent reduction the total variation by using the
regression equation rather than just using the sample mean
y-bar to estimate y.
For the calculus problem, r2 = .705 or 70.5%. SSR
r2 =
The model is working well! Total SS
Correlation Analysis
• The strength of the relationship between x and y is measured using
the coefficient of correlation:
S xy
Correlation coefficient : r =
S xx S yy
210
200
190
r = .8261
Weight
180
160
150
66 67 68 69 70 71 72 73 74 75
As the player’s height
Height
increases, so does his weight.
Some Correlation Patterns
r = 0; No correlation r = .931; Strong positive correlation
• Use the Exploring
Correlation applet
to explore some
correlation
patterns:
To test H 0 : ρ = 0 versusH a : ρ ≠ 0
n−2
Test Statistic: t = r This test is
1− r2 exactly
equivalent to
Reject H 0 if t > tα / 2 or t < −tα / 2 withn - 2 df . the t-test for
the slope β=0.
Example
• Is there a significant positive correlation between weight and height
in the population of all college football players?
r = .8261 n−2
Test Statistic: t = r
1− r2
H0 : ρ = 0
8
Ηa : ρ ≠ 0 = .8261 = 4.15
1 − .8261 2