Lecture 14: Regression Analysis: Nonlinear Relationship
Lecture 14: Regression Analysis: Nonlinear Relationship
A. Introduction
Regression analysis is the statistical method used when both the response variable and the
explanatory variable are continuous variables
The easiest way of knowing when regression is the appropriate analysis is when the
scatterplot is the appropriate graphic.
From the scatter plot, it is often possible to visualize a smooth curve that approximates the
data. Such a curve is called an approximate curve.
If the data appear to be approximated well by a straight line, we say that a linear relationship
exists between the variables.
If a relationship exists between two variables and the relationship is not linear, we call it a
nonlinear relationship.
The general problem of finding equations of approximating curves that fit given sets of data
is called curve fitting.
The right sides of the above equations are called polynomials of the first, second, third, fourth
and nth degrees, respectively.
The functions defined by the first four equations are sometimes called linear, quadratic,
cubic and quartic functions, respectively.
1
Lecture 14 Wilson Wesley LazaroJere
BSC 311: Design and Analysis of Experiments First Semester 2021/22 Academic Year
2
Lecture 14 Wilson Wesley LazaroJere
BSC 311: Design and Analysis of Experiments First Semester 2021/22 Academic Year
The simplest type of approximating curve is a straight line, whose equation can be written as
follows:
𝑌 = 𝛽0 + 𝛽1 𝑋
Where 𝛽0 is the y intercept and 𝛽1 is the slope of the line that describes the relationship of
Y and X.
𝑌2 − 𝑌1
𝛽1 =
𝑋2 − 𝑋1
A least-squares regression line (or line of best fit) is the line through the data points (𝑥𝑖 , 𝑦𝑖 )
that has the smallest possible sums of squares of deviations from the line.
The method of finding this line is called least-squares estimation.
(∑ 𝑦𝑖 )2
∑(𝑦𝑖 − 𝑦̅) = 2
∑ 𝑦𝑖2 −
𝑛
3
Lecture 14 Wilson Wesley LazaroJere
BSC 311: Design and Analysis of Experiments First Semester 2021/22 Academic Year
∑ 𝑥𝑖 ∑ 𝑦𝑖
∑(𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑥̅ ) = ∑ 𝑥𝑖 𝑦𝑖 −
𝑛
∑(𝑥𝑖 −𝑥̅ )(𝑦𝑖 −𝑥̅ )
Therefore, the computational version of 𝛽̂1 = ∑(𝑥𝑖 −𝑥̅ )2
becomes:
∑ 𝑥𝑖 ∑ 𝑦 𝑖
∑ 𝑥𝑖 𝑦𝑖 −
𝑛
𝛽̂1 = (∑ 𝑥𝑖 )2
∑ 𝑥𝑖2 − 𝑛
From the computational formula, we notice the following quantities (Recall the
computational formulas for sum of squares from Lecture 4):
(∑ 𝑦𝑖 )2
𝑆𝑆𝑌 = ∑ 𝑦𝑖2 −
𝑛
b) The corrected sum of squares of the explanatory variable shown in the denominator:
(∑ 𝑥𝑖 )2
𝑆𝑆𝑋 = ∑ 𝑥𝑖2 −
𝑛
c) The corrected sum of products shown in the numerator:
∑ 𝑥𝑖 ∑ 𝑦𝑖
𝑆𝑆𝑋𝑌 = ∑ 𝑥𝑖 𝑦𝑖 −
𝑛
Example 13.1. Given the response variable Y and the explanatory variable X shown the table
below, determine the linear regression parameter estimates 𝛽0and 𝛽1.
X Y
1 6
1 5
3 10
3 14
5 12
5 18
4
Lecture 14 Wilson Wesley LazaroJere
BSC 311: Design and Analysis of Experiments First Semester 2021/22 Academic Year
We start out by calculating the ‘famous five’. These are: ∑ 𝑦𝑖2 and ∑ 𝑦𝑖 , ∑ 𝑥𝑖2 and ∑ 𝑥𝑖 and the
sum of cross of products, ∑ 𝑥𝑖 𝑦𝑖 .
Here is the table showing computations:
𝒙 𝒚 𝒙𝟐 𝒚𝟐 𝒙𝒚
1 6 1 36 6
1 5 1 25 5
3 10 9 100 30
3 14 9 196 42
5 12 25 144 60
5 18 25 324 90
Total 18 65 70 825 233
Mean 3 10.8333
From the table, we get the following quantities:
(∑ 𝑥𝑖 )2 (18)2
𝑆𝑆𝑋 = ∑ 𝑥𝑖2 − = 70 − = 16
𝑛 6
∑ 𝑥𝑖 ∑ 𝑦 𝑖 (18)(65)
𝑆𝑆𝑋𝑌 = ∑ 𝑥𝑖 𝑦𝑖 − = 233 − = 38
𝑛 6
E. Regression analysis
Now that we have the estimates of the regression line, we need to test if the regression model
fits the data well.
This is done by comparing the mean squares due to regression (𝑀𝑆𝑟𝑒𝑔 ) to the residual mean
squares (𝑀𝑆𝑟𝑒𝑠 ).
The two mean squares of variances can be compared through an F-test.
2
𝜎𝑟𝑒𝑔
𝐹= 2
𝜎𝑟𝑒𝑠
Example 13.2. For the data in Example 13.1, perform an analysis of variance to determine if the
model fits the data well.
Solution.
2
𝜎𝑟𝑒𝑔
𝐻𝑎 : 2 >1
𝜎𝑟𝑒𝑠
6
Lecture 14 Wilson Wesley LazaroJere
BSC 311: Design and Analysis of Experiments First Semester 2021/22 Academic Year
= 120.8333 - 90.25
= 30.5833
i) Two parameters were estimated in the simple linear regression, 𝛽0 and 𝛽1 . The
degrees of freedom are 2 − 1 = 1.
ii) The residual degrees of freedom are the number of observations minus number of
parameters, i.e., 6 − 2 = 4.
The mean squares are calculated by diving the sums of squares by the degrees of freedom.
The F-test for the model is calculated by diving the mean squares for regression by the mean
squares for residuals.
The following is the ANOVA table for the regression model in Example 13.1.
The following descriptive statistics can be calculated from the regression analysis:
𝑆𝑦.𝑥 = √𝑀𝑆𝑟𝑒𝑠
Example 13.3. Calculate the standard error of the estimate for the model in Example
13.2.
7
Lecture 14 Wilson Wesley LazaroJere
BSC 311: Design and Analysis of Experiments First Semester 2021/22 Academic Year
Since the standard error of estimate is a descriptive statistic, describing the spread of
‘observations’ around the regression line, it behaves like a ‘standard deviation’ and,
technically, should not be called ‘standard error’: this term is generally reserved for
describing the variation of a ‘statistic’.
Some books do not use the term ‘standard error of estimate’ at all, using either ‘square
root residual variance’ or ‘RMSE’ (root mean square error) instead.
The coefficient of variation shows the amount of variation in the response variable that is
explained by the variation in the independent variable.
𝑆𝑆𝑟𝑒𝑔 𝑆𝑆
𝑅 2 = 𝑆𝑆 or 𝑅 2 = 1 − 𝑆𝑆 𝑟𝑒𝑠
𝑡𝑜𝑡𝑎𝑙 𝑡𝑜𝑡𝑎𝑙
Example 13.4. Calculate the coefficient of determination for the model in Example 13.2
and explain the coefficient of determination.
𝑆𝑆𝑟𝑒𝑔 90.25
𝑅 2 = 𝑆𝑆 = 120.8333 = 0.7469
𝑡𝑜𝑡𝑎𝑙
Or
𝑆𝑆𝑟𝑒𝑠 30.5833
𝑅2 = 1 − = 1− = 0.7469
𝑆𝑆𝑡𝑜𝑡𝑎𝑙 120.8333
This means that 74.7% of the variation in the in Y that is explained by the variation in X.
𝑀𝑆𝑟𝑒𝑠 ∑𝑛 2
𝑖=1 𝑥𝑖
𝑆𝛽̂0 = √( )( )
𝑛 𝑆𝑆𝑋
𝑀𝑆𝑟𝑒𝑠
𝑆𝛽̂1 = √ 𝑆𝑆𝑋
8
Lecture 14 Wilson Wesley LazaroJere
BSC 311: Design and Analysis of Experiments First Semester 2021/22 Academic Year
Example 13.5. Calculate the standard error for the 𝛽̂0 and 𝛽̂1 in Example 13.1.
𝑀𝑆𝑟𝑒𝑠 ∑𝑛 2
𝑖=1 𝑥𝑖 7.6458 70
𝑆𝛽̂0 = √( )( ) = √( ) (16) = 2.3612
𝑛 𝑆𝑆𝑋 6
𝑀𝑆𝑟𝑒𝑠 7.6458
𝑆𝛽̂1 = √ =√ = 0.6913
𝑆𝑆𝑋 16
Example 13.6. Is it reasonable to believe that the unknown population intercept of Y versus X
regression model in Example 13.1 is zero (i.e. the line runs through the origin)? Use α = 0.05.
Solution. Since there is no prior knowledge available about the directional preference for the
rejection of null hypothesis, a two-tailed test will be used.
𝐻0 : 𝛽0 = 0
𝐻𝑎 : 𝛽0 ≠ 0
𝛼 = 0.05
We use t-test
3.708−0
𝑡= =1.57
2.3612
𝑡0.025,6 = 2.776
Since the calculated t of 1.57 is less than the tabulated t of 2.776, we fail to reject the null
hypothesis. We conclude that the y-intercept is not significantly different from zero at 5 % level
of significance.
Example 13.7. Is it reasonable to believe that the unknown population slope in Example 13.1 is
zero (i.e. it is a flat, horizontal line, meaning that there is no relationship between X and Y)? Use
a 0.05 level of significance.
9
Lecture 14 Wilson Wesley LazaroJere
BSC 311: Design and Analysis of Experiments First Semester 2021/22 Academic Year
Solution
𝐻0 : 𝛽1 = 0
𝐻𝑎 : 𝛽1 ≠ 0
𝛼 = 0.05
We use t-test
2.375−0
𝑡= =3.4356
0.6913
𝑡0.025,6 = 2.776
Since the calculated t of 3.44 is greater than the tabulated t of 2.776, we reject the null
hypothesis. We conclude that the slope is significantly different from zero at 5 % level of
significance.
Example 13.8. Is it reasonable to assume that the unknown population slope is at least 2? In
other words, for every cm increase in X, does Y in Example 13.1 increase by at least 2? Use a
0.05 level of significance.
Solution.
𝐻0 : 𝛽1 = 2
𝐻𝑎 : 𝛽1 ≠ 2
𝛼 = 0.05
We use t-test
2.375−2
𝑡= = 0.5425
0.6913
𝑡0.05,6 = 2.132
Since the calculated t of 0.5425 is less than the tabulated t of 2.132, we reject the null hypothesis.
We conclude that the slope is significantly greater 2 at 5 % level of significance.
10
Lecture 14 Wilson Wesley LazaroJere