0% found this document useful (0 votes)

43 views39 pages

6.3 SSK5210 Parametric Statistical Testing - Analysis of Variance LR and Correlation - 2

The document discusses linear regression and correlation analysis. It defines independent and dependent variables, and how to plot them on a scatter plot. It describes how to calculate the correlation coefficient to determine the strength of relationship between two variables. It explains how to perform linear regression to estimate the best-fit line using the method of least squares. It provides an example to illustrate how to calculate sums of squares, regression coefficients, and perform an analysis of variance (ANOVA) to test the significance of the regression model.

Uploaded by

Sarveshwaran Balasundaram

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views39 pages

6.3 SSK5210 Parametric Statistical Testing - Analysis of Variance LR and Correlation - 2

Uploaded by

Sarveshwaran Balasundaram

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

SSK5210

Parametric
STATISTICAL TESTING
The Analysis of Variance
Linear Regression & Correlation
Outline
• Scatter Plots and Correlation
• Regression
• Coefficient of Determination and Standard
Introduction
The purpose of this chapter is to answer these questions
statistically:
• Are two or more variables linearly related?
• If so, what is the strength of the relationship?
• What type of relationship exists?
• What kind of predictions can be made from the relationship?
Introduction
• Consider two variables, independent variable and dependent variable.
• Independent variable: The variable in regression that can be controlled or manipulated.
• In this case, the number of hours of study is the independent variable and is
designated as the x variable. The student can regulate or control the number of hours
he or she studies for the exam.
• Dependent variable: The variable in regression that cannot be controlled or manipulated.
• The grade the student received on the exam is the dependent variable, designated as
the y variable.
• The grade the student earns depends on the number of hours the student studied.
• The independent and dependent variables can be plotted on a graph called a scatter plot.
• independent variable: x is plotted on the horizontal axis
• dependent variable: y is plotted on the vertical axis.
Scatter Plot

• A scatter plot is a graph of the ordered pairs (x, y) of numbers

consisting of the independent variable x and the dependent variable y.
Correlation Coefficient
• Correlation coefficient is use to determine the strength of
the linear relationship between two variables.
• The population correlation coefficient denoted by the Greek
letter ρ (rho) is the correlation computed by using all
possible pairs of data values (x, y) taken from a population.
• The linear correlation coefficient computed from the sample
data measures the strength and direction of a linear
relationship between two quantitative variables.
• The symbol for the sample correlation coefficient is r.
Example
• Let y be a student’s college achievement, measured by
his/her GPA. This might be a function of several variables:
• x1 = rank in high school class
• x2 = high school’s overall rating
• x3 = high school GPA
• x4 = SAT scores
• We want to predict y using knowledge of x1, x2, x3 and x4.
Example
• Let y be the monthly sales revenue for a company. This
might be a function of several variables:
• x1 = advertising expenditure
• x2 = time of year
• x3 = state of economy
• x4 = size of inventory
• We want to predict y using knowledge of x1, x2, x3 and x4.
Some Questions
• Which of the independent variables are useful and which are not?
• How could we create a prediction equation to allow us to predict y
using knowledge of x1, x2, x3 etc?
• How good is this prediction?

We start with the simplest case, in which the response

y is a function of a single independent variable, x.
A Simple Linear Model
• Equation of a line to describe the relationship between y and x for a
sample of n pairs, (x, y).
• If we want to describe the relationship between y and x for the whole
population, there are two models we can choose

•Deterministic Model: y = α + βx
•Probabilistic Model:
y = deterministic model + random error
y = α + βx + ε
A Simple Linear Model
• Since the bivariate measurements that we observe
do not generally fall exactly on a straight line, we
choose to use:
• Probabilistic Model:
• y = α + βx + ε
• E(y) = α + βx
Points deviate from the
line of means by an amount
ε where ε has a normal distribution
with mean 0 and variance σ2.
The Random Error
• The line of means, E(y) = α + βx , describes average value of y for any
fixed value of x.
• The population of measurements is generated as y deviates from
the population line
by ε. We estimate α
and β using sample
information.
The Method of Least Squares
• The equation of the best-fitting line is calculated using a set of n pairs
(xi, yi).
• We choose our estimates a and b to
estimate α and β so that the vertical
distances of the points from the line, are
minimized.
Bestfittingline :yˆ = a + bx
Choosea and b to minimize
SSE = ∑( y − yˆ ) 2 = ∑( y − a − bx) 2
Least Squares Estimators
Calculatethesumsof squares:
( ∑ x ) 2
( ∑ y ) 2
Sxx = ∑ x 2 − Syy = ∑ y 2 −
n n
(∑ x)(∑ y )
Sxy = ∑ xy −
n
Bestfittingline : yˆ = a + bx where
S xy
b= and a = y − bx
S xx
Example
The table shows the math achievement test scores for a random sample
of n = 10 college freshmen, along with their final calculus grades.
Student 1 2 3 4 5 6 7 8 9 10
Math test, x 39 43 21 64 57 47 28 75 34 52
Calculus grade, y 65 78 52 82 92 89 73 98 56 75

Use your ∑ x = 460 ∑ y = 760

calculator to find
the sums and ∑ x 2 = 23634 ∑ y 2 = 59816
sums of squares. ∑ xy = 36854
x = 46 y = 76
Example
(460) 2
Sxx = 23634 − = 2474
10
(760) 2
Syy = 59816 − = 2056
10
(460)(760)
Sxy = 36854 − = 1894
10
1894
b= = .76556 and a = 76 − .76556(46) = 40.78
2474
Bestfittingline : yˆ = 40.78 + .77 x
The Analysis of Variance

• The total variation in the experiment is measured by the total sum of

squares:
Total SS= S yy = ∑( y − y ) 2

• The Total SS is divided into two parts:

SSR (sum of squares for regression): measures the variation
explained by using x in the model.
SSE (sum of squares for error): measures the leftover variation not
explained by x.
The Analysis of Variance
We calculate
( S xy ) 2 18942
SSR = =
S xx 2474
= 1449.9741
SSE = Total SS- SSR
( S xy ) 2
= S yy −
S xx
= 2056 − 1449.9741
= 606.0259
The ANOVA Table
Total df = n -1 Mean Squares
Regression df = 1 MSR = SSR/(1)
Error df = n –1 – 1 = n - 2 MSE = SSE/(n-2)

Source df SS MS F
Regression 1 SSR SSR/(1) MSR/MSE

Error n-2 SSE SSE/(n-2)

Total n -1 Total SS
The Calculus Problem
( S xy ) 2 1894 2
SSR = = = 1449.9741
S xx 2474
( S xy ) 2
SSE = Total SS- SSR = S yy −
S xx
= 2056 − 1449.9741 = 606.0259
Source df SS MS F
Regression 1 1449.9741 1449.9741 19.14
Error 8 606.0259 75.7532
Total 9 2056.0000
Testing the Usefulness of the Model
•The first question to ask is whether the independent variable
x is of any use in predicting y.
•If it is not, then the value of y does not change, regardless of
the value of x. This implies that the slope of the line, β, is
zero.

H 0 : β = 0 versusH a : β ≠ 0
Testing the Usefulness of the Model
• The test statistic is function of b, our best estimate of β.
Using MSE as the best estimate of the random variation σ2,
we obtain a t statistic.
b−0
Test statistic: t = whichhasa t distribution
MSE
S xx
MSE
withdf = n − 2 or a confidenceinterval : b ± tα / 2
S xx
The Calculus Problem
• Is there a significant relationship between the calculus
grades and the test scores at the 5% level of significance?
There is a
significant linear
H 0 : β = 0 versusH a : β ≠ 0
relationship
b−0 .7656 − 0 between the
t= = = 4.38
M SE/ S xx 75.7532 / 2474 calculus grades
and the test
Reject H0 when |t| > 2.306. Since t = 4.38 falls into scores for the
the rejection region, H0 is rejected . population of
college freshmen.
The F Test
• You can test the overall usefulness of the model using an F
test. If the model is useful, MSR will be large compared to
the unexplained variation, MSE.

To test H 0 : model isuseful in predicting y This test is

exactly
M SR equivalent
Test Statistic: F =
M SE to the t-test,
Reject H 0 if F > Fα with1 and n - 2 df . with t2 = F.
Minitab
Output
Regression Analysis: y versus x
The regression equation is y = 40.8 + 0.766 x
Predictor Coef SE Coef T P
Constant 40.784 8.507 4.79 0.001
x 0.7656 0.1750 4.38 0.002

S = 8.70363 R-Sq = 70.5% R-Sq(adj) = 66.8%

Analysis of Variance
Source DF SS MS F P
Regression 1 1450.0 1450.0 19.14 0.002
Residual Error 8 606.0 75.8
Total 9 2056.0
Minitab Least squares
regression line
Output To test H 0 : β = 0
Regression Analysis: y versus x
The regression equation is y = 40.8 + 0.766 x
Predictor Coef SE Coef T P
Constant 40.784 8.507 4.79 0.001
x 0.7656 0.1750 4.38 0.002

S = 8.70363 R-Sq = 70.5% R-Sq(adj) = 66.8%

Analysis of Variance
Source DF SS MS F P
Regression 1 1450.0 1450.0 19.14 0.002
Residual Error 8 606.0 75.8
Total 9 2056.0

Regression coefficients, t 2 = F
MSE a and b
Measuring the Strength of the Relationship
•If the independent variable x is of useful in predicting y, you
will want to know how well the model fits.
•The strength of the relationship between x and y can be
measured using:

S xy
Correlation coefficient : r =
S xx S yy
2
S xy SSR
Coefficient of determination : r 2 = =
S xx S yy Total SS
Measuring the Strength of the Relationship
•Since Total SS = SSR + SSE, r2 measures
the proportion of the total variation in the responses that
can be explained by using the independent variable x in the
model.
the percent reduction the total variation by using the
regression equation rather than just using the sample mean
y-bar to estimate y.
For the calculus problem, r2 = .705 or 70.5%. SSR
r2 =
The model is working well! Total SS
Correlation Analysis
• The strength of the relationship between x and y is measured using
the coefficient of correlation:
S xy
Correlation coefficient : r =
S xx S yy

• Recall from Chapter 3 that

(1) -1 ≤ r ≤ 1 (2) r and b have the same sign
(3) r ≈ 0 means no linear relationship
(4) r ≈ 1 or –1 means a strong (+) or (-)
relationship
Example
The table shows the heights and weights of n = 10 randomly selected
college football players.
Player 1 2 3 4 5 6 7 8 9 10
Height, x 73 71 75 72 72 75 67 69 71 69
Weight, y 185 175 200 210 190 195 150 170 180 175

Use your S xy = 328 S xx = 60.4 S yy = 2610

calculator to find
the sums and 328
r= = .8261
sums of squares. (60.4)(2610)
Football Players

Scatterplot of Weight vs Height

210

200

190

r = .8261
Weight

180

Strong positive correlation

170

160

150

66 67 68 69 70 71 72 73 74 75
As the player’s height
Height
increases, so does his weight.
Some Correlation Patterns
r = 0; No correlation r = .931; Strong positive correlation
• Use the Exploring
Correlation applet
to explore some
correlation
patterns:

r = 1; Linear relationship r = -.67; Weaker negative correlation

Inference using r
• The population coefficient of correlation is called ρ (“rho”). We can
test for a significant correlation between x and y

To test H 0 : ρ = 0 versusH a : ρ ≠ 0
n−2
Test Statistic: t = r This test is
1− r2 exactly
equivalent to
Reject H 0 if t > tα / 2 or t < −tα / 2 withn - 2 df . the t-test for
the slope β=0.
Example
• Is there a significant positive correlation between weight and height
in the population of all college football players?
r = .8261 n−2
Test Statistic: t = r
1− r2
H0 : ρ = 0
8
Ηa : ρ ≠ 0 = .8261 = 4.15
1 − .8261 2

Use the t-table with n-2 = 8 df to bound the p-value as p-

value < .005. There is a significant positive correlation.
Key Concepts
I. A Linear Probabilistic Model
1. When the data exhibit a linear relationship, the appropriate model
is y = α + β x + ε .
2. The random error ε has a normal distribution with mean 0 and
variance σ2.
II. Method of Least Squares
1. Estimates a and b, for α and β, are chosen to minimize SSE, the
sum of the squared deviations about the regression line,
2. The least squares estimates are b = Sxy/Sxx and
a= y − bx. yˆ= a + bx.
Key Concepts
III. Analysis of Variance
1. Total SS = SSR + SSE, where Total SS = Syy and
SSR = (Sxy)2 / Sxx.
2. The best estimate of σ 2 is MSE = SSE / (n − 2).

IV. Testing, Estimation, and Prediction

1. A test for the significance of the linear regression—H0 : β = 0 can
be implemented using one of two test statistics:
b MSR
=t = or F
MSE / S xx MSE
Key Concepts
2. The strength of the relationship between x and y can be measured
using 2 SSR
R =
Total SS
which gets closer to 1 as the relationship gets stronger.
Key Concepts
V. Correlation Analysis
1. Use the correlation coefficient to measure the relationship between x and
y when both variables are random:
S xy
r=
S xx S yy

2. The sign of r indicates the direction of the relationship; r near 0 indicates

no linear relationship, and r near 1 or −1 indicates a strong linear
relationship.
3. A test of the significance of the correlation coefficient is identical to the
test of the slope β.
TERIMA KASIH / THANK YOU

Simple Linear Regressionclassroom
No ratings yet
Simple Linear Regressionclassroom
37 pages
CH 14
No ratings yet
CH 14
31 pages
Session 17
No ratings yet
Session 17
23 pages
Chapter 5 - Eng
No ratings yet
Chapter 5 - Eng
20 pages
Module 6A Estimating Relationships
No ratings yet
Module 6A Estimating Relationships
104 pages
Lecture 8 Correlation and Linear Regression
No ratings yet
Lecture 8 Correlation and Linear Regression
66 pages
Week 12+13
No ratings yet
Week 12+13
47 pages
Regression2024 MBA
No ratings yet
Regression2024 MBA
25 pages
Linear Regression and Correlation: Mcgraw-Hill/Irwin
No ratings yet
Linear Regression and Correlation: Mcgraw-Hill/Irwin
29 pages
Introduction To Linear Regression and Correlation Analysis: Objectives
100% (1)
Introduction To Linear Regression and Correlation Analysis: Objectives
33 pages
12.1correlation and Simple Linear
No ratings yet
12.1correlation and Simple Linear
45 pages
Sta404 - Chapter 5 - Bivariate Analysis (Student)
No ratings yet
Sta404 - Chapter 5 - Bivariate Analysis (Student)
27 pages
9 Regression (Statistics IEM 2-2)
No ratings yet
9 Regression (Statistics IEM 2-2)
32 pages
New Paradigm of Industry 4.0 Internet of Things, Big Data Cyber Physical Systems
100% (11)
New Paradigm of Industry 4.0 Internet of Things, Big Data Cyber Physical Systems
187 pages
Section 2
No ratings yet
Section 2
22 pages
Interactive Lecture Notes 12-Regression Analysis
No ratings yet
Interactive Lecture Notes 12-Regression Analysis
22 pages
F Regression
No ratings yet
F Regression
65 pages
Lecture 12
No ratings yet
Lecture 12
47 pages
L5 - Simple Linear Regression Students
No ratings yet
L5 - Simple Linear Regression Students
33 pages
Chapter 12
No ratings yet
Chapter 12
48 pages
REGRESSION
No ratings yet
REGRESSION
8 pages
Stats101A - Chapter 1
No ratings yet
Stats101A - Chapter 1
25 pages
Simple Regression and Correlation
No ratings yet
Simple Regression and Correlation
30 pages
Investigating Variables
No ratings yet
Investigating Variables
15 pages
Simple Linear Regression Part 1
No ratings yet
Simple Linear Regression Part 1
63 pages
Simple LR Lecture
No ratings yet
Simple LR Lecture
60 pages
Unit 4 Regression Analysis
No ratings yet
Unit 4 Regression Analysis
28 pages
Chapter7
No ratings yet
Chapter7
52 pages
6 Continuous Data Analysis
No ratings yet
6 Continuous Data Analysis
49 pages
Regression Analysis
No ratings yet
Regression Analysis
18 pages
Simple Linear Regression
100% (1)
Simple Linear Regression
50 pages
Correlation & Regression Analysis
100% (1)
Correlation & Regression Analysis
39 pages
Regression
No ratings yet
Regression
15 pages
Regression Models - Follow
No ratings yet
Regression Models - Follow
7 pages
Correlation and Regression Analysis
No ratings yet
Correlation and Regression Analysis
8 pages
Lecture8 4
No ratings yet
Lecture8 4
29 pages
Regression and Correlation
No ratings yet
Regression and Correlation
13 pages
Regression and Correlation
No ratings yet
Regression and Correlation
37 pages
Topic - Chapter 12 - Regression Models
No ratings yet
Topic - Chapter 12 - Regression Models
1 page
M. Amir Hossain PHD: Course No: Emba 502: Business Mathematics and Statistics
No ratings yet
M. Amir Hossain PHD: Course No: Emba 502: Business Mathematics and Statistics
31 pages
Unit 2-Part 3-Linear Regression
No ratings yet
Unit 2-Part 3-Linear Regression
38 pages
Session 5 Marked B PDF
No ratings yet
Session 5 Marked B PDF
36 pages
03 - Simple Linear Regression
No ratings yet
03 - Simple Linear Regression
13 pages
Lecture9 Regression1 PDF
No ratings yet
Lecture9 Regression1 PDF
22 pages
Lectures 14 15
No ratings yet
Lectures 14 15
66 pages
Regression Equation
No ratings yet
Regression Equation
56 pages
Topic Simple Linear Regression
No ratings yet
Topic Simple Linear Regression
38 pages
Regression
No ratings yet
Regression
66 pages
Week-4 BA Linear Regression
No ratings yet
Week-4 BA Linear Regression
16 pages
Simple and Multiple Linear Regression
No ratings yet
Simple and Multiple Linear Regression
91 pages
QMM Epgdm 5
No ratings yet
QMM Epgdm 5
58 pages
What Is Multiple Linear Regression
No ratings yet
What Is Multiple Linear Regression
23 pages
Regression Equation For SI
No ratings yet
Regression Equation For SI
12 pages
A Tutorial On How To Run A Simple Linear Regression in Excel
No ratings yet
A Tutorial On How To Run A Simple Linear Regression in Excel
19 pages
Regression and Correlation
No ratings yet
Regression and Correlation
17 pages
Regression and Correlation
No ratings yet
Regression and Correlation
14 pages
Simple Regression
No ratings yet
Simple Regression
35 pages
Simple Linear Regression Analysis
No ratings yet
Simple Linear Regression Analysis
7 pages
Dewatering of Iron Ore Slurry by A Ceramic Vacuum Disc Filter
100% (1)
Dewatering of Iron Ore Slurry by A Ceramic Vacuum Disc Filter
6 pages
Supervised Logistic Regression
No ratings yet
Supervised Logistic Regression
13 pages
Computational Intelligence Techniques For Bioprocess Modelling, Supervision and Control (PDFDrive)
No ratings yet
Computational Intelligence Techniques For Bioprocess Modelling, Supervision and Control (PDFDrive)
349 pages
Sample Final Solutions
No ratings yet
Sample Final Solutions
12 pages
Chapter 7 (Time Series Analysis - Forecasting)
No ratings yet
Chapter 7 (Time Series Analysis - Forecasting)
36 pages
SML - Question Bank-20.2.25
No ratings yet
SML - Question Bank-20.2.25
35 pages
SET-01 - SOCS - ESE-MAY23 - B.Tech (CSE-H+N.H) - VI - CSAI3011 - Pattern Recognition and Anomaly%2
No ratings yet
SET-01 - SOCS - ESE-MAY23 - B.Tech (CSE-H+N.H) - VI - CSAI3011 - Pattern Recognition and Anomaly%2
2 pages
Simple Linear Regression-Example
100% (1)
Simple Linear Regression-Example
4 pages
Sample Size Guidelines For Log
No ratings yet
Sample Size Guidelines For Log
10 pages
Emperical Analysis of Legal Insider Trading
No ratings yet
Emperical Analysis of Legal Insider Trading
143 pages
Examples Using The PLS Procedure: Example 1. Predicting Biological Activity
No ratings yet
Examples Using The PLS Procedure: Example 1. Predicting Biological Activity
72 pages
Sullivan 2021
No ratings yet
Sullivan 2021
14 pages
New Maths I &amp Maths II HSC Repaired)
No ratings yet
New Maths I &amp Maths II HSC Repaired)
66 pages
Crime Prediction and Analysis Using Data Mining
No ratings yet
Crime Prediction and Analysis Using Data Mining
6 pages
AiMidterm Exam - Attempt Review
No ratings yet
AiMidterm Exam - Attempt Review
17 pages
Quantitative Management PDF
No ratings yet
Quantitative Management PDF
252 pages
Hyp 19017015 Michael Wijaya
No ratings yet
Hyp 19017015 Michael Wijaya
28 pages
CatCost v1-1-0 User Guide
No ratings yet
CatCost v1-1-0 User Guide
62 pages
Treatment of Experimental Data: Prof. Anand Kumar Tiwari DDIT, Nadiad
No ratings yet
Treatment of Experimental Data: Prof. Anand Kumar Tiwari DDIT, Nadiad
19 pages
Vol. 12 No. 2 Opatija
No ratings yet
Vol. 12 No. 2 Opatija
150 pages
Report
No ratings yet
Report
6 pages
Ardl or Bound Test (Autoregrassive Distributed Lag) : Saeed Aas Khna Meo
No ratings yet
Ardl or Bound Test (Autoregrassive Distributed Lag) : Saeed Aas Khna Meo
6 pages
Heteroskedasticity Glejser Using SPSS
No ratings yet
Heteroskedasticity Glejser Using SPSS
9 pages
Lampiran Uji UMJ
No ratings yet
Lampiran Uji UMJ
9 pages
Indonesian Interdisciplinary Journal of Sharia Economics (IIJSE) Vol. 6. No. 3 (2023) e-ISSN: 2621-606X Page: 1436-1448
No ratings yet
Indonesian Interdisciplinary Journal of Sharia Economics (IIJSE) Vol. 6. No. 3 (2023) e-ISSN: 2621-606X Page: 1436-1448
13 pages
Advanced Data Visualization in R: Iris Malone
No ratings yet
Advanced Data Visualization in R: Iris Malone
98 pages
My MBA Syllabus
No ratings yet
My MBA Syllabus
15 pages
ENCODING & Logistic Regression
No ratings yet
ENCODING & Logistic Regression
3 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
ALGEBRA SIMPLIFIED EQUATIONS WORKBOOK WITH ANSWERS: Linear Equations, Quadratic Equations, Systems of Equations
From Everand
ALGEBRA SIMPLIFIED EQUATIONS WORKBOOK WITH ANSWERS: Linear Equations, Quadratic Equations, Systems of Equations
Luke Aneke
No ratings yet
Applications of Derivatives Errors and Approximation (Calculus) Mathematics Question Bank
From Everand
Applications of Derivatives Errors and Approximation (Calculus) Mathematics Question Bank
Mohmmad Khaja Shareef
No ratings yet

6.3 SSK5210 Parametric Statistical Testing - Analysis of Variance LR and Correlation - 2

Uploaded by

6.3 SSK5210 Parametric Statistical Testing - Analysis of Variance LR and Correlation - 2

Uploaded by

SSK5210

• A scatter plot is a graph of the ordered pairs (x, y) of numbers

We start with the simplest case, in which the response

Use your ∑ x = 460 ∑ y = 760

• The total variation in the experiment is measured by the total sum of

• The Total SS is divided into two parts:

Error n-2 SSE SSE/(n-2)

To test H 0 : model isuseful in predicting y This test is

S = 8.70363 R-Sq = 70.5% R-Sq(adj) = 66.8%

S = 8.70363 R-Sq = 70.5% R-Sq(adj) = 66.8%

• Recall from Chapter 3 that

Use your S xy = 328 S xx = 60.4 S yy = 2610

Scatterplot of Weight vs Height

Strong positive correlation

r = 1; Linear relationship r = -.67; Weaker negative correlation

Use the t-table with n-2 = 8 df to bound the p-value as p-

IV. Testing, Estimation, and Prediction

2. The sign of r indicates the direction of the relationship; r near 0 indicates

You might also like