0% found this document useful (0 votes)
99 views

Section 9.2, Linear Regression

This document discusses linear regression analysis. It defines the regression line as the best-fit line for predicting a dependent variable (y) from an independent variable (x). The regression line takes the form ŷ = mx + b, where m and b are calculated using formulas involving sums of the x and y values and their products. The residual is the difference between the observed and predicted y values. The coefficient of determination, r2, indicates what proportion of the variation in y is explained by the regression line. Examples are provided to demonstrate calculating the regression line and r2 from sample data. Limitations of linear regression like its reliance on a linear relationship and issues with extrapolation are also noted.

Uploaded by

Han Myo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views

Section 9.2, Linear Regression

This document discusses linear regression analysis. It defines the regression line as the best-fit line for predicting a dependent variable (y) from an independent variable (x). The regression line takes the form ŷ = mx + b, where m and b are calculated using formulas involving sums of the x and y values and their products. The residual is the difference between the observed and predicted y values. The coefficient of determination, r2, indicates what proportion of the variation in y is explained by the regression line. Examples are provided to demonstrate calculating the regression line and r2 from sample data. Limitations of linear regression like its reliance on a linear relationship and issues with extrapolation are also noted.

Uploaded by

Han Myo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Section 9.

2, Linear Regression

Our goal for this section will be to write the equation of the “best-fit” line through the points on
a scatter plot for paired data. This helps us to predict values of the response variable when the
explanatory variable is given.
The regression line is the best-fit line through the points in the data set. For an independent
variable x and dependent variable y, it has the form
ŷ = mx + b,
where ŷ is the predicted y-value for a given x-value,
P P  P 
n (xy) − x y
m= P P and
n (x2 ) − ( x)2
P P
y x
b = ȳ − mx̄ = −m .
n n
The line always passes through the point (x̄, ȳ).
The residual, d, is the difference of the observed y-value and the predicted y-value. d = (observed y-value)−
(predicted y-value). The regression line (found with these formulas) minimizes the sum of the squares
of the residuals.
The coefficient of determination, r2 , is the proportion of the variation that explained by the
regression line.
Examples
1. The number of officers on duty in a Boston city park and the number of muggings for that day
are:
Officers Muggings
10 5
15 2
16 1
1 9
4 7
6 8
18 1
12 5
14 3
7 6
Calculate the regression line for this data, and the residual for the first observation, (10, 5).
What percentage of variation is explained by the regression line?
P P P
FromPthe calculations we did in section 9.1, we found that x = 103, y = 47, xy = 343,
and x2 = 1347. So,
10 · 343 − 103 · 47
m= = −0.493 and
10 · 1347 − 1032
47 103
b= − (−0.493) · = 9.780.
10 10
Then, the equation of the regression line is ŷ = −0.493x + 9.780.
To find the residual, we need to find ŷ when x = 10, so ŷ = −0.493 · 10 + 9.780 = 4.848, so
d = 5−4.848 = 0.152. (Whenever possible, use the original numbers for m and b in calculations
instead of the rounded numbers).
In Section 9.1, we calculated that r = −0.969, so r2 = .939 and 93.9% of the variation is
explained by the regression line (and 6.1% is due to random and unexplained factors).
2. A study involved comparing the per capita income (in thousands of dollars) to the number
of medical doctors per 10,000 residents. Six small cities in Oregon had the observations:
Per capita income Doctors
8.6 9.6
9.3 18.5
10.1 20.9
8.0 10.2
8.3 11.4
8.7 13.1
The data has a correlation coefficient of r = 0.934. Calculate the regression line for this
data. What percentage of variation is explained by the regression line? Predict the number of
doctors per 10,000 residents in a town with a per capita income of $8500.
P P P P 2
Calculating from the data we see that x = 53, y = 83.7, (xy) = 755.89, and x =
471.04. Then,
6 · 755.89 − 53 · 83.7
m= = 5.756 and
6 · 471.04 − 532
83.7 53
b= − 5.756 · = −36.898.
6 6
The equation of the line is ŷ = 5.756x − 36.898.
The proportion of variation explained by the line is r2 = 0.9342 = 0.872, so 87.2% is explained
by the line.
A town with a per capita income of $8500 (x=8.5) will have approximately ŷ = 5.756 · 8.5 −
36.898 = 12.03 doctors per 10,000 residents
Some problems with Linear Regression:
• It works best to predict values when the relationship between variables is linear. If r is close
to zero, ŷ will not be a good predictor of y, in general.
• Extrapolation: The line is intended to predict values of y for values of x that are close to the
data. Using the line far outside that range may produce unrealistic forecasts.

You might also like