0% found this document useful (0 votes)
43 views39 pages

6.3 SSK5210 Parametric Statistical Testing - Analysis of Variance LR and Correlation - 2

The document discusses linear regression and correlation analysis. It defines independent and dependent variables, and how to plot them on a scatter plot. It describes how to calculate the correlation coefficient to determine the strength of relationship between two variables. It explains how to perform linear regression to estimate the best-fit line using the method of least squares. It provides an example to illustrate how to calculate sums of squares, regression coefficients, and perform an analysis of variance (ANOVA) to test the significance of the regression model.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views39 pages

6.3 SSK5210 Parametric Statistical Testing - Analysis of Variance LR and Correlation - 2

The document discusses linear regression and correlation analysis. It defines independent and dependent variables, and how to plot them on a scatter plot. It describes how to calculate the correlation coefficient to determine the strength of relationship between two variables. It explains how to perform linear regression to estimate the best-fit line using the method of least squares. It provides an example to illustrate how to calculate sums of squares, regression coefficients, and perform an analysis of variance (ANOVA) to test the significance of the regression model.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

SSK5210

Parametric
STATISTICAL TESTING
The Analysis of Variance
Linear Regression & Correlation
Outline
• Scatter Plots and Correlation
• Regression
• Coefficient of Determination and Standard
Introduction
The purpose of this chapter is to answer these questions
statistically:
• Are two or more variables linearly related?
• If so, what is the strength of the relationship?
• What type of relationship exists?
• What kind of predictions can be made from the relationship?
Introduction
• Consider two variables, independent variable and dependent variable.
• Independent variable: The variable in regression that can be controlled or manipulated.
• In this case, the number of hours of study is the independent variable and is
designated as the x variable. The student can regulate or control the number of hours
he or she studies for the exam.
• Dependent variable: The variable in regression that cannot be controlled or manipulated.
• The grade the student received on the exam is the dependent variable, designated as
the y variable.
• The grade the student earns depends on the number of hours the student studied.
• The independent and dependent variables can be plotted on a graph called a scatter plot.
• independent variable: x is plotted on the horizontal axis
• dependent variable: y is plotted on the vertical axis.
Scatter Plot

• A scatter plot is a graph of the ordered pairs (x, y) of numbers


consisting of the independent variable x and the dependent variable y.
Correlation Coefficient
• Correlation coefficient is use to determine the strength of
the linear relationship between two variables.
• The population correlation coefficient denoted by the Greek
letter ρ (rho) is the correlation computed by using all
possible pairs of data values (x, y) taken from a population.
• The linear correlation coefficient computed from the sample
data measures the strength and direction of a linear
relationship between two quantitative variables.
• The symbol for the sample correlation coefficient is r.
Example
• Let y be a student’s college achievement, measured by
his/her GPA. This might be a function of several variables:
• x1 = rank in high school class
• x2 = high school’s overall rating
• x3 = high school GPA
• x4 = SAT scores
• We want to predict y using knowledge of x1, x2, x3 and x4.
Example
• Let y be the monthly sales revenue for a company. This
might be a function of several variables:
• x1 = advertising expenditure
• x2 = time of year
• x3 = state of economy
• x4 = size of inventory
• We want to predict y using knowledge of x1, x2, x3 and x4.
Some Questions
• Which of the independent variables are useful and which are not?
• How could we create a prediction equation to allow us to predict y
using knowledge of x1, x2, x3 etc?
• How good is this prediction?

We start with the simplest case, in which the response


y is a function of a single independent variable, x.
A Simple Linear Model
• Equation of a line to describe the relationship between y and x for a
sample of n pairs, (x, y).
• If we want to describe the relationship between y and x for the whole
population, there are two models we can choose

•Deterministic Model: y = α + βx
•Probabilistic Model:
y = deterministic model + random error
y = α + βx + ε
A Simple Linear Model
• Since the bivariate measurements that we observe
do not generally fall exactly on a straight line, we
choose to use:
• Probabilistic Model:
• y = α + βx + ε
• E(y) = α + βx
Points deviate from the
line of means by an amount
ε where ε has a normal distribution
with mean 0 and variance σ2.
The Random Error
• The line of means, E(y) = α + βx , describes average value of y for any
fixed value of x.
• The population of measurements is generated as y deviates from
the population line
by ε. We estimate α
and β using sample
information.
The Method of Least Squares
• The equation of the best-fitting line is calculated using a set of n pairs
(xi, yi).
• We choose our estimates a and b to
estimate α and β so that the vertical
distances of the points from the line, are
minimized.
Bestfittingline :yˆ = a + bx
Choosea and b to minimize
SSE = ∑( y − yˆ ) 2 = ∑( y − a − bx) 2
Least Squares Estimators
Calculatethesumsof squares:
( ∑ x ) 2
( ∑ y ) 2
Sxx = ∑ x 2 − Syy = ∑ y 2 −
n n
(∑ x)(∑ y )
Sxy = ∑ xy −
n
Bestfittingline : yˆ = a + bx where
S xy
b= and a = y − bx
S xx
Example
The table shows the math achievement test scores for a random sample
of n = 10 college freshmen, along with their final calculus grades.
Student 1 2 3 4 5 6 7 8 9 10
Math test, x 39 43 21 64 57 47 28 75 34 52
Calculus grade, y 65 78 52 82 92 89 73 98 56 75

Use your ∑ x = 460 ∑ y = 760


calculator to find
the sums and ∑ x 2 = 23634 ∑ y 2 = 59816
sums of squares. ∑ xy = 36854
x = 46 y = 76
Example
(460) 2
Sxx = 23634 − = 2474
10
(760) 2
Syy = 59816 − = 2056
10
(460)(760)
Sxy = 36854 − = 1894
10
1894
b= = .76556 and a = 76 − .76556(46) = 40.78
2474
Bestfittingline : yˆ = 40.78 + .77 x
The Analysis of Variance

• The total variation in the experiment is measured by the total sum of


squares:
Total SS= S yy = ∑( y − y ) 2

• The Total SS is divided into two parts:


SSR (sum of squares for regression): measures the variation
explained by using x in the model.
SSE (sum of squares for error): measures the leftover variation not
explained by x.
The Analysis of Variance
We calculate
( S xy ) 2 18942
SSR = =
S xx 2474
= 1449.9741
SSE = Total SS- SSR
( S xy ) 2
= S yy −
S xx
= 2056 − 1449.9741
= 606.0259
The ANOVA Table
Total df = n -1 Mean Squares
Regression df = 1 MSR = SSR/(1)
Error df = n –1 – 1 = n - 2 MSE = SSE/(n-2)

Source df SS MS F
Regression 1 SSR SSR/(1) MSR/MSE

Error n-2 SSE SSE/(n-2)

Total n -1 Total SS
The Calculus Problem
( S xy ) 2 1894 2
SSR = = = 1449.9741
S xx 2474
( S xy ) 2
SSE = Total SS- SSR = S yy −
S xx
= 2056 − 1449.9741 = 606.0259
Source df SS MS F
Regression 1 1449.9741 1449.9741 19.14
Error 8 606.0259 75.7532
Total 9 2056.0000
Testing the Usefulness of the Model
•The first question to ask is whether the independent variable
x is of any use in predicting y.
•If it is not, then the value of y does not change, regardless of
the value of x. This implies that the slope of the line, β, is
zero.

H 0 : β = 0 versusH a : β ≠ 0
Testing the Usefulness of the Model
• The test statistic is function of b, our best estimate of β.
Using MSE as the best estimate of the random variation σ2,
we obtain a t statistic.
b−0
Test statistic: t = whichhasa t distribution
MSE
S xx
MSE
withdf = n − 2 or a confidenceinterval : b ± tα / 2
S xx
The Calculus Problem
• Is there a significant relationship between the calculus
grades and the test scores at the 5% level of significance?
There is a
significant linear
H 0 : β = 0 versusH a : β ≠ 0
relationship
b−0 .7656 − 0 between the
t= = = 4.38
M SE/ S xx 75.7532 / 2474 calculus grades
and the test
Reject H0 when |t| > 2.306. Since t = 4.38 falls into scores for the
the rejection region, H0 is rejected . population of
college freshmen.
The F Test
• You can test the overall usefulness of the model using an F
test. If the model is useful, MSR will be large compared to
the unexplained variation, MSE.

To test H 0 : model isuseful in predicting y This test is


exactly
M SR equivalent
Test Statistic: F =
M SE to the t-test,
Reject H 0 if F > Fα with1 and n - 2 df . with t2 = F.
Minitab
Output
Regression Analysis: y versus x
The regression equation is y = 40.8 + 0.766 x
Predictor Coef SE Coef T P
Constant 40.784 8.507 4.79 0.001
x 0.7656 0.1750 4.38 0.002

S = 8.70363 R-Sq = 70.5% R-Sq(adj) = 66.8%

Analysis of Variance
Source DF SS MS F P
Regression 1 1450.0 1450.0 19.14 0.002
Residual Error 8 606.0 75.8
Total 9 2056.0
Minitab Least squares
regression line
Output To test H 0 : β = 0
Regression Analysis: y versus x
The regression equation is y = 40.8 + 0.766 x
Predictor Coef SE Coef T P
Constant 40.784 8.507 4.79 0.001
x 0.7656 0.1750 4.38 0.002

S = 8.70363 R-Sq = 70.5% R-Sq(adj) = 66.8%

Analysis of Variance
Source DF SS MS F P
Regression 1 1450.0 1450.0 19.14 0.002
Residual Error 8 606.0 75.8
Total 9 2056.0

Regression coefficients, t 2 = F
MSE a and b
Measuring the Strength of the Relationship
•If the independent variable x is of useful in predicting y, you
will want to know how well the model fits.
•The strength of the relationship between x and y can be
measured using:

S xy
Correlation coefficient : r =
S xx S yy
2
S xy SSR
Coefficient of determination : r 2 = =
S xx S yy Total SS
Measuring the Strength of the Relationship
•Since Total SS = SSR + SSE, r2 measures
the proportion of the total variation in the responses that
can be explained by using the independent variable x in the
model.
the percent reduction the total variation by using the
regression equation rather than just using the sample mean
y-bar to estimate y.
For the calculus problem, r2 = .705 or 70.5%. SSR
r2 =
The model is working well! Total SS
Correlation Analysis
• The strength of the relationship between x and y is measured using
the coefficient of correlation:
S xy
Correlation coefficient : r =
S xx S yy

• Recall from Chapter 3 that


(1) -1 ≤ r ≤ 1 (2) r and b have the same sign
(3) r ≈ 0 means no linear relationship
(4) r ≈ 1 or –1 means a strong (+) or (-)
relationship
Example
The table shows the heights and weights of n = 10 randomly selected
college football players.
Player 1 2 3 4 5 6 7 8 9 10
Height, x 73 71 75 72 72 75 67 69 71 69
Weight, y 185 175 200 210 190 195 150 170 180 175

Use your S xy = 328 S xx = 60.4 S yy = 2610


calculator to find
the sums and 328
r= = .8261
sums of squares. (60.4)(2610)
Football Players

Scatterplot of Weight vs Height

210

200

190

r = .8261
Weight

180

Strong positive correlation


170

160

150

66 67 68 69 70 71 72 73 74 75
As the player’s height
Height
increases, so does his weight.
Some Correlation Patterns
r = 0; No correlation r = .931; Strong positive correlation
• Use the Exploring
Correlation applet
to explore some
correlation
patterns:

r = 1; Linear relationship r = -.67; Weaker negative correlation


Inference using r
• The population coefficient of correlation is called ρ (“rho”). We can
test for a significant correlation between x and y

To test H 0 : ρ = 0 versusH a : ρ ≠ 0
n−2
Test Statistic: t = r This test is
1− r2 exactly
equivalent to
Reject H 0 if t > tα / 2 or t < −tα / 2 withn - 2 df . the t-test for
the slope β=0.
Example
• Is there a significant positive correlation between weight and height
in the population of all college football players?
r = .8261 n−2
Test Statistic: t = r
1− r2
H0 : ρ = 0
8
Ηa : ρ ≠ 0 = .8261 = 4.15
1 − .8261 2

Use the t-table with n-2 = 8 df to bound the p-value as p-


value < .005. There is a significant positive correlation.
Key Concepts
I. A Linear Probabilistic Model
1. When the data exhibit a linear relationship, the appropriate model
is y = α + β x + ε .
2. The random error ε has a normal distribution with mean 0 and
variance σ2.
II. Method of Least Squares
1. Estimates a and b, for α and β, are chosen to minimize SSE, the
sum of the squared deviations about the regression line,
2. The least squares estimates are b = Sxy/Sxx and
a= y − bx. yˆ= a + bx.
Key Concepts
III. Analysis of Variance
1. Total SS = SSR + SSE, where Total SS = Syy and
SSR = (Sxy)2 / Sxx.
2. The best estimate of σ 2 is MSE = SSE / (n − 2).

IV. Testing, Estimation, and Prediction


1. A test for the significance of the linear regression—H0 : β = 0 can
be implemented using one of two test statistics:
b MSR
=t = or F
MSE / S xx MSE
Key Concepts
2. The strength of the relationship between x and y can be measured
using 2 SSR
R =
Total SS
which gets closer to 1 as the relationship gets stronger.
Key Concepts
V. Correlation Analysis
1. Use the correlation coefficient to measure the relationship between x and
y when both variables are random:
S xy
r=
S xx S yy

2. The sign of r indicates the direction of the relationship; r near 0 indicates


no linear relationship, and r near 1 or −1 indicates a strong linear
relationship.
3. A test of the significance of the correlation coefficient is identical to the
test of the slope β.
TERIMA KASIH / THANK YOU

You might also like