Simple Linear Regression Part 1
Simple Linear Regression Part 1
Ms RU Cruz
Overview
1. Pearsons Correlation (r)
2. Simple Regression Model
3. Least Squares Method
4. Coefficient of Determination
5. Model Assumptions
6. Testing of Significance
7. Residual Analysis
Decisions, Decisions, Decisions
Managerial decisions are often based on the
relationship between two or more variables.
Advertising Expenditures relative to Sales
Marketing Manager is to predict sales for a given level of
advertising expenditures
Sometimes a manager will rely on intuition to judge how
two variables are related.
Pattern recognition and data analysis would served as your
fundamental mind framework.
Correlation
A quantitative relationship between two interval or ratio level variables
Explanatory Response
(Independent) Variable (Dependent) Variable
x y
Hours of Training Number of Accidents
Height IQ
1 0 1
x y
95
90 8 78
85
Final Grade
80 2 92
75
70 5 90
65
60
55
12 58
50
45
15 43
40
9 74
0 2 4 6 8 10 12 14 16
Absences 6 81
X Compute for
r
Computation of r
x y xy x2 y2
1 8 78 624 64 6084
2 2 92 184 4 8464
3 5 90 450 25 8100
4 12 58 696 144 3364
5 15 43 645 225 1849
6 9 74 666 81 5476
7 6 81 486 36 6561
57 516 3751 579 39898
r value
r value =
Husband
36 72 37 36 51 50 47 50 37 41
(x)
Wife
35 67 33 35 50 46 47 42 36 41
(y)
Regression Terminologies
Regression analysis can be used to develop an
equation showing how the variables are related.
Dependent variable the variable being
predicted or explained under the regression
model. Denoted by y.
Independent variable the variable that is doing
the predicting or explaining. Denoted by x
Simple Linear Regression involves one
independent variable and one dependent
variable approximated by a straight line.
Multiple Regression involves two or more
independent variables
Simple Linear Regression Model
Regression Model is the equation that
describes how y is related to x and an error
term.
Model:
y = b0 + b1x +e
Where 0 and 1 are called parameters of the
model, while is a random variable called the
error term
Simple Linear Regression Equation
E(y) = b0 + b1x
b0 is the y intercept of the regression line.
b1 is the slope of the regression line.
E(y) is the expected value of y for a given x value.
E(y)
Regression line
Intercept
b0 Slope b1
is positive
x
Scatter Plots and Types of Correlation
x = SAT score
y = GPA
4.00
3.75
3.50
3.25
GPA
3.00
2.75
2.50
2.25
2.00
1.75
1.50
300 350 400 450 500 550 600 650 700 750 800
Math SAT
E(y)
Intercept
b0 Regression line
Slope b1
is negative
x
Scatter Plots and Types of Correlation
x = hours of training (horizontal axis)
y = number of accidents (vertical axis)
60
50
40
Accidents
30
20
10
0
0 2 4 6 8 10 12 14 16 18 20
Hours of Training
No Relationship
E(y)
x
Scatter Plots and Types of Correlation
x = height y = IQ
160
150
140
130
IQ
120
110
100
90
80
60 64 68 72 76 80
Height
No linear correlation
Estimated Simple
Linear Regression Equation
The graph is called the estimated regression line.
b0 is the y intercept of the line.
b1 is the slope of the line.
is the estimated value of y for a given x value.
y b0 b1 x
Y intercept for
b0 y b1 x
Estimated
Regression
Equation
Slope and for the Estimated
Regression Equation
b1 ( x x )( y y )
i i
(x x )
i
2
25
20
15
Series1
10
0
0 0.5 1 1.5 2 2.5 3 3.5
Computation Table for Slope
Week TV Ads Cars Sold xi x yi y (xi x) (xi x)2
(x) (y) (yi y)
1 1 14 -1 -6 6 1
2 3 24 1 4 4 1
3 2 18 0 -2 0 0
4 1 17 -1 -3 3 1
5 3 27 1 7 7 1
x = 10 y = 100 20 4
x=2 y = 20
b1 ( x x )( y y ) 20
i i
5
(x x )
i
2
4
3. Y intercept is
b0 y b1 x 20 5(2) 10
4. Regression Equation
y b0 b1 x
y 10 5x
y = mx + b
y= 5x + 10
SEATWORK: 1/2 CROSWISE BY PAIR
The grades of a sample of 9 students on a midterm
exam (X) and on the final exam (Y) are as follows:
x 96 81 71 72 50 94 77 99 67
y 99 47 78 34 66 85 82 99 68
y 12.062 0.777 x
a.
Using the above information, develop the least-squares
estimated regression line and write the equation.
Rejection Rule:
p value approach: Reject Ho if p value
critical value approach: Reject Ho if t -t/2
or t t/2
Where t/2 is based on a t distribution with n 2
degrees of freedom.
Steps of Significant t Tests
1. Determine the hypotheses
2. Specify the level of significance
3. Select the test statistic
b1
t
sb1
4. State the rejection rule
5. Compute the value of the test statistic
6. Determine conclusion via rejection rule
Testing for Significance: F Test
Hypothesis: Ho : 1 = 0
Ha : 1 0
Test Statistic: F = MSR/MSE
Rejection Rule:
p value approach: Reject Ho if p value
Critical value approach: Reject Ho if F F
Where F is based on an F distribution with 1
degree of freedom in the numerator and
n 2 degrees of freedom in the denominator.
Steps of Significant F Tests
1. Determine the hypotheses
2. Specify the level of significance
3. Select the test statistic
F = MSR/MSE
4. State the rejection rule
5. Compute the value of the test statistic
6. Determine conclusion via rejection rule
Hypothesis Test for Significance
r is the correlation coefficient for the sample. The
correlation coefficient for the population is (rho).
For a two tail test for significance:
(The correlation is not significant)
Standardized test
statistic
Test of Significance
The correlation between the number of times absent and a
final grade r = 0.975. There were seven pairs of data.Test the
significance of this correlation. Use = 0.01.
Critical Values t0
t
4.032 0 4.032
df\
0.40 0.25 0.10 0.05 0.025 0.01 0.005 0.0005
p
Regression indicates the degree to which the variation in one variable X, is related to or can
be explained by the variation in another variable Y
Once you know there is a significant linear correlation, you can write an equation describing
the relationship between the x and y variables. This equation is called the line of regression
or least squares line.
The equation of a line may be written as y = mx + b
where m is the slope of the line and b is the y-intercept.
= a residual
230
220
210
200
190
180
1.5 2.0 Ad $ 2.5 3.0
x y xy x2 y2
Write the equation of the
1 8 78 624 64 6084
2 2 92 184 4 8464
line of regression with
3 5 90 450 25 8100 x = number of absences
4 12 58 696 144 3364 and y = final grade.
5 15 43 645 225 1849
6 9 74 666 81 5476
7 6 81 486 36 6561
Calculate m and b.
57 516 3751 579 39898
95
90
85
Grade
80
75
70
65
Final
60
55
50
45
40
0 2 4 6 8 10 12 14 16
Absences
The regression equation for number of times absent and final grade is:
= 3.924x + 105.667
Use this equation to predict the expected grade for a student with
The correlation coefficient of number of times absent and final grade is r = 0.975.
The coefficient of determination is r2 = (0.975)2 = 0.9506.
Interpretation: About 95% of the variation in final grades can be explained by the
number of times a student is absent. The other 5% is unexplained and can be due to
sampling error or other variables such as intelligence, amount of time studied, etc.
ANOVA table
Source of Sum of Degrees of Mean F Test P value
Variation Squares Freedom Square
Regression
SSR 1 MSR = F =
SSR/1 MSR/
MSE
Error
SSE n2 MSE =
SSE/
n2
Total
SST n1
Residual Analysis
Residual Analysis the analysis of the residuals
used to determine whether the assumptions
made about the regression model appear to be
valid. Residual analysis is used to identify outliers
and influential observation.
Outliers a data point or observation that does
not fir the trend shown by the remaining data
Influential observation an observation that has
a strong influence or effect on the regression
results.
Residual Plot Graphical representation of the
residuals that can be used to determine whether
the assumptions made about the regression
model appear to be valid.
Residual Analysis
If the assumptions about the error term appears
questionable, the hypothesis tests about the
significance of the regression relationship and the
interval estimation results may not be valid. The
residuals provide the best information about . Much
of the residual analysis is based on an examination of
graphical plots.
Residuals for observation : yi i
If the assumption that the variance of is the same for
all values of x is valid, and the assumed regression
model is an adequate representation of the
relationship between the variable then, the residual
plot should give an overall impression band of points.
Residual Plot Against x
y y
Good Pattern
Residual
x
Residual Plot Against x
y y
Nonconstant Variance
Residual
x
Residual Plot Against x
y y
Model Form Not Adequate
Residual
x
Residual Plot Against x
1
0
-1
-2
-3
0 1 2 3 4
TV Ads
Standardized Residuals
Standard Residuals the value obtained by
dividing the residual by its standard deviation
Normal probability plot a graph of the
standardized residuals plotted against the
values of the normal scores. This plot helps
determine whether the assumption that the
error term has a normal probability
distribution that appears to be valid.
High leverage points observations with
extreme values for the independent variables
Standardized Residual
Standardized Residual for Observation i.
y i y i
syi y i
Standard Deviation of the ith residual
syi y i s 1 hi
Leverage of observation i.
1 ( xi x ) 2
hi
n ( x i x )2