Regression Models - Follow
Regression Models - Follow
regression is an approach for modeling the relationship between a quantitative dependent variable y and one or more
explanatory variables (or independent variables) represented by X(s). The case of one explanatory variable is called
simple linear regression.
The main purposes of regression analysis are to understand relationship between/among variables and to predict one
variable based on the other(s).
At the completion of the Spring 2023 semester, students will be able to:
4.1 – Identify variables, visualize them using scatter diagram, and use them in a regression model
4.2 – Develop simple linear regression equations from a collected data and interpret the slope and intercept
4.3 – Compute the coefficient of Determination and the coefficient of correlation and interpret their meanings
4.4 – List assumptions used in regression and use residual plot to identify problems
4.7 – Develop a multiple regression model using excel and use it for prediction
We are going to use the example blow to go through learning objectives 4.1 to 4.6
A cafeteria at a local college would like to come up with a regression model that would predict what a student would
spend for lunch based on what they spent for breakfast. The collected data from a randomly selected students and the
result is shown below:
Speculation: There seem to be a negative linear relationship between what a given student spends for breakfast and
lunch. For this scatter plot, $ breakfast is our input, independent, or explanatory variable, whereas $ lunch is our output,
dependent, or response variable.
4.2 – Developing a simple linear model
Where
Ŷ = b0 + b1x where b0 and b1 are estimated values of the intercept and slope assuming that error are at minimum.
Note that error still exist and can be tabulated by E = (actual value Y) – (predicted value ŷ)
Here,
The best way to do this is to develop a table (You can use excel)
(X-
X Y (X-X̄ )^2==> Explanation X̄ )^2 (X-X̄ )(Y-Ȳ)==> Explanation (X-X̄ )(Y-Ȳ)
5 12 (5-8)^2 9 (5-8)(12-7) -15
6 11 (6-8)^2 4 -8
7 9 (7-8)^2 1 -2
7 8 (7-8)^2 1 -1
9 4 (9-8)^2 1 -3
10 3 (10-8)^2 4 -8
12 2 (12-8)^2 16 -20
Sum 56 49 36 -57
X̄ = 56/7 8
Ȳ = 49/7 7
b1 = -57/36 -1.58
b0 = 7- (-1.58)(8) 19.67
To know for a fact that the model developed is good enough to be used for prediction, we must start by computing the
coefficient of Determination (R 2) and the coefficient of correlation ( r ).
The coefficient of determination (represented by R2) gives proportion of the variation in the dependent variable(Y) that
is predictable from the regression with the independent variable (X)
Interpretation: About 94% of the variation in Y (money spent on lunch) can be explained by the regression with X
(money spent on breakfast). The remaining 6% are due to other fact (are due to error)
Coefficient of correlation
The quantity r, called the linear correlation coefficient, measures the strength and
the direction of a linear relationship between two variables.
r = +/- √ R2 important r has the same sign as the slope (b1) of the line of regression
The value of is such that -1 < r < +1. The + and – signs are used for positive
linear correlations and negative linear correlations, respectively.
Negative Positive
-1 Strong - .7Moderated - .5 Weak 0 Weak .5 Moderated .7 Strong +1
In this case, r = -√ 0.936 = - 967
This is negative because the slope of the line of regression is also negative.
Interpretation: there is a strong negative linear relationship between the amount of money spent on breakfast and
the amount of money spent on lunch.
We stated earlier that the linear regression model comes with errors in it due to the fact that we are not dealing with
perfectly aligned set of points… In other terms, the SSE in not always equal to 0, or R 2 is not always 100%. Therefore, we
have to make some assumptions about the errors in the regression model so that we can test it for significance. We
must make the following assumptions about the errors:
When assumptions are met, a plot of errors against the independent variable should appear to be random
In our example, we are going to plot X against Residual (Y - Ŷ ) and check for randomness
Ŷ = bo - b1 X here (ŷ =
X Y 19.67-1.58x) Residual (Y - Ŷ )
5 12 11.77 0.23
6 11 10.19 0.81
7 9 8.61 0.39
7 8 8.61 -0.61
9 4 5.45 -1.45
10 3 3.87 -0.87
12 2 0.71 1.29
Using Excel
Residual (Y - Ŷ )
1.5
1
0.5
0
4 5 6 7 8 9 10 11 12 13
-0.5
-1
-1.5
-2
We can see that the scatter plot appears to be random – You can use figure 4.4A, 4.4B, and 4.4C on page 118 to check
for likelihood of randomness. We want the residual plot to look like figure 4.4 a.
The next step is to estimate the variance.
While errors are assumed to have constant variance ( σ 2), it can only be estimated when a sample is collected. The
Mean Squared Error (MSE or s2) is a good estimate of the population variance σ 2.
S2 = MSE = SSE/(n-k-1), where n is the number of observations (pairs of points), and k it the number of independent
variables.
For the sample variance, we can estimate the standard deviation by taking the square root of s 2.
Here, s = √ 1.15 = 1.07. This is also called the standard error estimate or standard deviation of the regression.
1. Determine the Null Hypothesis (H0) and the Alternative Hypothesis (H1).
This is always
3. Compute the calculated value of F. For our course, we will read that value on the regression
summary output.
4. Reject H0 if F calculated is greater than F critical (On F table)… and interpret the finding.
Locate df1 or Degrees of freedom of the numerator (entry column on F table). DF1 is the number of
independent variables K.
Locate df2 or Degrees of freedom of the denominator (entry row on F table). The value of dF 2 is by
the n – k – 1 (Sample size – number of independent variables – 1).
The Critical value of f or F-critical (df1, df2) is going to be the number located at the junction of the
identify entry row (df1) and the entry column (df2).
For hour example
Step 1
MSE = 1.15
Step 4: Decision: Reject Ho if the test statistics is greater than F critical (From the F table)
We are going to go to the F Table in appendix D, look for α = 0.05 (the first F distribution table) and go to first column
and fifth row F0.05, 1, 5 = 6.61.
Here, since Fcalculated of 78.1536 is greater than Fcritical of 6.61, we are going to reject Ho. Therefore, the regression model is
significant. That is, prediction generated by the linear model ŷ = 19.67 – 1.58X will be reliable.
Try this
Additional examples X 4 5 6 8 10
Given the following pairs of points Y 13 16 8 3 2
a. Draw a scatter diagram and speculate on the linear relationship between x and y
b. Find the equation of the regression line
c. Compute the coefficient of determination and tell us what that means
d. Find the coefficient of correlation and determine the strength of the relationship between x and y
e. Is the linear relationship significant? Use alpha of 0.05 to test this hypothesis for significance