0% found this document useful (0 votes)
43 views26 pages

Lecture 8

1) Linear regression allows us to predict a response variable (like oxygen uptake) based on the value of a predictor variable (like heart rate) when there is a strong linear relationship. 2) The data showed a strong positive correlation (r = 0.986) between heart rate and oxygen uptake. 3) We can use the linear regression line, or "line of best fit" to estimate oxygen uptake from a measured heart rate value. For example, at a heart rate of 100 the predicted oxygen uptake is about 1.1 units.

Uploaded by

Angel Mae Yap
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views26 pages

Lecture 8

1) Linear regression allows us to predict a response variable (like oxygen uptake) based on the value of a predictor variable (like heart rate) when there is a strong linear relationship. 2) The data showed a strong positive correlation (r = 0.986) between heart rate and oxygen uptake. 3) We can use the linear regression line, or "line of best fit" to estimate oxygen uptake from a measured heart rate value. For example, at a heart rate of 100 the predicted oxygen uptake is about 1.1 units.

Uploaded by

Angel Mae Yap
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 26

Linear Regression

Lesson 2: Linear Regression


Correlation and Prediction
In the previous lesson, we learned to measure the strength of the
linear relationship between two variables with the correlation
coefficient r.
When there is a strong linear relationship two variables, we can use
the value of the predictor variable to estimate the value of the
response variable.

Unit 2: Probability Distributions t z f


Lesson 2: Linear Regression
A Motivating Example
A person takes in more oxygen when exercising than when at
rest. The oxygen is supplied to the muscles by the heart, which
must beat faster as the exercise level is increased.
Suppose that we wish to determine the oxygen uptake of subjects
at various levels of activity.
• Measuring oxygen uptake directly requires the use of
specialized and costly equipment in a lab environment.
• Measuring a person’s heart rate is simple, inexpensive, and
convenient.
If a person’s oxygen uptake can be predicted accurately from the
heart rate, we may be able to use the predicted uptake values
instead of direct measurements for our research purposes.
Unit 2: Probability Distributions t z f
Lesson 2: Linear Regression
Oxygen Uptake Data Time HR VO2

Suppose the heart rate (HR) and oxygen uptake 1 96 .753


2 95 .929
(VO2) for a subject exercising on a treadmill 3 95 .939
were recorded during a 20-minute workout, and 4 94 .832
5 95 .983
the following data were recorded: 6 94 1.049
7 104 1.178
The correlation coefficient, 8 104 1.176
9 106 1.292
r = .986 10 109 1.379
11 108 1.403
indicates a strong, positive, linear relationship 12 110 1.499
between heart rate and oxygen uptake. 13 113 1.529
14 113 1.599
15 118 1.749
16 115 1.746
17 121 1.897
18 127 2.040
19 131 2.231
20 130 2.301

Unit 2: Probability Distributions t z f


Lesson 2: Linear Regression
Scatter Plot
From the scatter plot, observe: 2.5

• the trend in the data–the 2.0

Oxygen Uptake
oxygen uptake increases as the 1.5

1.0
heart rate increases. 0.5

• the sample points do not fall 0.0


90 100 110 120 130
on a single line, but they do Heart Rate

appear to be scattered about a


Scatter plot of oxygen uptake vs.
central line, the line of best fit. heart rate data

We can use the line of best fit to estimate the oxygen uptake from the
measured heart rate.
For example, at a heart rate of 100 beats per minute, the predicted
oxygen uptake is approximately 1.1 (units?)

Unit 2: Probability Distributions t z f


Lesson 2: Linear Regression
Example 1: Fitting a Line to a Bivariate Data Set
P 216, #3. For the following data set: x 3 4 5 7 8
y 4 5 7 15 14
(a) Draw a scatter diagram
treating x as the predictor 15
variable and y as the response
10
variable. y
5

0
2 4 6 8
x

Unit 2: Probability Distributions t z f


Lesson 2: Linear Regression
Example 1: Fitting a Line to a Bivariate Data Set
P 216, #3. For the following data set: x 3 4 5 7 8
y 4 5 7 15 14
(b) Select two points from the
scatter diagram and find an 15
equation of the line (in the y = 2x – 2 (8, 14)
10
form y = mx + b) containing y
the points. 5
(3, 4)
Take the points (3, 4) and 0
(8, 14). 2 4 6 8
14  4 x
The slope is: m  =2
83
Substitute m = 2, x = 3, and y = 4 to obtain the y-intercept:
4 = (2)(3) + b  4 = 6 + b  b = –2  y = 2x – 2

Unit 2: Probability Distributions t z f


Lesson 2: Linear Regression
Errors and Residuals
A predicted value of y is denoted with the symbol ŷ (“y-hat”).
The difference between the actual value of y and the predicted value
of y is the error or residual. That is

error = observed y value – predicted y value = y – ŷ


For any predicted value ŷ, the squared error is (y – ŷ)2

For any given line of fit, the sum of the squared errors is

SSE = Σ(y – ŷ)2

The least squares regression line is the line of fit that minimizes the
sum of the squared errors.

Unit 2: Probability Distributions t z f


Lesson 2: Linear Regression
Example 2: Fitting a Line to a Bivariate Data Set
P 216, #3. For the following data set: x 3 4 5 7 8
y 4 5 7 15 14
(c) Compute the sum of the
ŷ 4 6 8 12 14
squared errors for the line
y–ŷ 0 –1 –1 3 0
ŷ = 2x – 2 .
( y – ŷ)2 0 1 1 9 0

15

10
ŷ = 2x – 2 (8, 14) SSE = Σ(y – ŷ) 2
= 11
y
5
(3, 4)
0
2 4 6 8
x

Unit 2: Probability Distributions t z f


Lesson 2: Linear Regression
Equation of the Least Squares Regression Line
The equation of the least-squares regression line is given by
ŷ = b1 x + b0
where
S xy
the slope of the least-squares line is b1 
S xx

Recall S xx   ( xi  x ) 2 S xy   ( xi  x )( yi  y )

  x
2

  xi yi
  x   y 
i i
 x
i
2
i  
n n

Unit 2: Probability Distributions t z f


Lesson 2: Linear Regression
Equation of the Least Squares Regression Line
The equation of the least-squares regression line is given by
ŷ = b1 x + b0
where
the intercept of the least-squares line is b0  y  b1 x
Note: The value ŷ predicts the mean value of the response variable y
at the for a specific value of x.
The graph of the least-squares equation is also known as the
line of means of the data.

Unit 2: Probability Distributions t z f


Lesson 2: Linear Regression
Example 3: Finding the Least Squares Regression Line
P 216, #3. For the following data set: x 3 4 5 7 8
y 4 5 7 15 14
(d) Find the equation of the least
x2 9 16 25 49 64
squares regression line.
xy 12 20 35 105 112

Σx = 27 Σy = 45 x = 5.4 y = 9 Σx2 = 163 Σxy = 284


27 2 (45)(27)
S xx  163   17.2 S xy  284   41
5 5
S xy 41
The slope of the least-squares line is b1   = 2.38
S xx 17.2
The intercept of the least-squares line is
b0  y  b1 x = 9 – (2.38)(5.4) = –3.85 Thus, ŷ = 2.38x – 3.85

Unit 2: Probability Distributions t z f


Lesson 2: Linear Regression
Example 3: Finding the Least Squares Regression Line
P 216, #3. For the following data set: x 3 4 5 7 8

(e) Graph the least squares y 4 5 7 15 14


regression line on the scatter
diagram.

15
ŷ = 2.38x – 3.85
10
y
5

0
2 4 6 8
x

Unit 2: Probability Distributions t z f


Lesson 2: Linear Regression
Example 3: Finding the Least Squares Regression Line
P 216, #3. For the following data set: x 3 4 5 7 8

(e) Graph the least squares y 4 5 7 15 14


regression line on the scatter ŷ 3.3 5.7 8.1 12.8 15.2
diagram. y–ŷ .7 –.7 –1.1 2.2 –1.2
(y – ŷ)2 .49 .49 1.21 4.84 1.44
15
ŷ = 2.38x – 3.85
(f) Compute the sum of the
10
squared errors for the
y
5 regression line.
0 SSE = Σ(y – ŷ) 2
= 8.47
2 4 6 8
x Note that this SSE is lower than
the first line, i.e. the “fit” is
best for the least squares line.
Unit 2: Probability Distributions t z f
Lesson 2: Linear Regression
INTERPRETATION OF THE SLOPE (m or b1)

FOR EVERY ONE UNIT INCREASED IN THE INDEPENDENT


VARIABLE, THE DEPENDENT VARIABLE IS INCREASED
OR DECREASED BY AN AVERAGE OF THE VALUE OF THE
SLOPE.

INTERPRETATION OF THE y-INTERCEPT (b or bo)


THE VALUE OF THE DEPENDENT VARIABLE (y) WHEN THE
INDEPENDENT VARIABLE (X) IS 0.

Unit 2: Probability Distributions t z f


Lesson 2: Linear Regression
THE COEFFICIENT OF DETERMINATION (R2)
- VARIANCE EXPLAINED BY THE INDEPENDENT
VARIABLE FOR THE DEPENDENT VARIABLE
- THE MODEL FIT ((R2)

THE COEFFICIENT OF ALIENATION (1 – r2 )


- PERCENTAGE/VARIANCE NOT EXPLAINED BY THE
INDEPENDENT VARIABLE FOR THE DEPENDENT
VARIABLE.

Unit 2: Probability Distributions t z f


Lesson 2: Linear Regression
Example 4: Weight vs. Mileage Rating
Weight Miles Per
P 218, #14. The data represent the weight of (pounds) Gallon
various domestic cars and their city mileage 3565 19

rating (in mpg) for the 2001 model year. 3440 20


3970 17
(a) Find the least squares regression line treating 3305 19

weight as the predictor variable (x) and 3340 20


3200 20
mileage as the response variable (y). 3230 19

Using the TI-83, the equation of the least- 2560 28


2520 28
squares line is: 3065 20

ŷ = –.0073x + 44.3 3600 18


3300 19
3625 19
3590 19
2605 23
2370 28

Unit 2: Probability Distributions t z f


Lesson 2: Linear Regression
Example 4: Weight vs. Mileage Rating
Weight Miles Per
P 218, #14. The data represent the weight of (pounds) Gallon
various domestic cars and their city mileage 3565 19

rating (in mpg) for the 2001 model year. 3440 20


3970 17
(b) Interpret the slope and intercept, if possible. 3305 19
3340 20
The slope m = –0.0073 means that the 3200 20

mileage is reduced by an average of 0.0073 3230 19


2560 28
mpg for a one pound increase in the weight 2520 28
of the car. 3065 20
3600 18
Since a weight of x = 0 lbs is not possible, 3300 19
there is no meaningful interpretation of the 3625 19
intercept. 3590 19
2605 23
2370 28

Unit 2: Probability Distributions t z f


Lesson 2: Linear Regression
Example 4: Weight vs. Mileage Rating
Weight Miles Per
P 218, #14. The data represent the weight of (pounds) Gallon
various domestic cars and their city mileage 3565 19

rating (in mpg) for the 2001 model year. 3440 20


3970 17
(c) Predict the mileage of an Oldsmobile Aurora 3305 19

(3625 lbs) and compute the residual error. 3340 20


3200 20
The predicted mileage is 3230 19
2560 28
ŷ = –.0073(3625) + 44.3 = 18 mpg 2520 28
3065 20
The residual error is +1 mpg 3600 18

Is the mileage of an Aurora above or below 3300 19


3625 19
average for cars of this weight? 3590 19

Since the residual is positive, the Aurora is 2605 23


2370 28
above average for cars of its weight.
Unit 2: Probability Distributions t z f
Lesson 2: Linear Regression
Example 4: Weight vs. Mileage Rating
Weight Miles Per
P 218, #14. The data represent the weight of (pounds) Gallon
various domestic cars and their city mileage 3565 19

rating (in mpg) for the 2001 model year. 3440 20


3970 17
(d) Draw the least-squares regression line on the 3305 19

scatter diagram of the data and label the 3340 20


3200 20
residual. 3230 19
2560 28
Weight vs. Mileage
City Mileage (MPG)

30 2520 28
3065 20
25 3600 18
3300 19
20
Residual 3625 19
15 3590 19
2000 2500 3000 3500 4000 2605 23
Weight (lbs) 2370 28

Unit 2: Probability Distributions t z f


Lesson 2: Linear Regression
Example 4: Weight vs. Mileage Rating
Weight Miles Per
P 218, #14. The data represent the weight of (pounds) Gallon
various domestic cars and their city mileage 3565 19

rating (in mpg) for the 2001 model year. 3440 20


3970 17
(e) Would it be reasonable to use the least- 3305 19

squares regression line to predict the mileage 3340 20


3200 20
of a Honda Insight—a hybrid gas and electric 3230 19
car? Why? 2560 28
2520 28
No. Since the hybrid uses a different fuel 3065 20
source, we cannot expect its mileage to be 3600 18
predicted by this model. 3300 19
3625 19
3590 19
2605 23
2370 28

Unit 2: Probability Distributions t z f


Lesson 2: Linear Regression
Example 4: Weight vs. Mileage Rating
Weight Miles Per
P 218, #14. The data represent the weight of (pounds) Gallon
various domestic cars and their city mileage 3565 19

rating (in mpg) for the 2001 model year. 3440 20


3970 17
(f) Compute for the model fit or the coefficient 3305 19

of determination (R2), and interpret. 3340 20


3200 20
3230 19
R2 = (-0.92)^2 = 85%, this means that 85% of 2560 28

the variance in the mileage of the car can be 2520 28


3065 20
explained by the variance of its weight. Or the 3600 18
weight of the car contributed 85% of its 3300 19
mileage. 3625 19
3590 19
15% is the coefficient of alienation. 2605 23
2370 28

Unit 2: Probability Distributions t z f


Lesson 2: Linear Regression
Limitations of the Regression Model
If the least-squares regression line is used to make predictions based
on values of the predictor variable that are much larger or smaller
than the observed values, we say the researcher is working outside
the scope of the model.
Never use a least-squares regression line to make predictions outside
the scope of the model because we can’t be sure the linear relation
continues to exist.
If the correlation coefficient is near zero, indicating a weak or non-
existent linear relationship between the variables, use the mean value
of the response variable as the predicted value.

Unit 2: Probability Distributions t z f


Lesson 2: Linear Regression
Example 5: Brain Size and Intelligence
P 219, #17. Researchers interested in whether a person’s brain size
is related to mental capacity selected a sample of 20 students who
had SAT scores higher than 1350 and administered an IQ test. Brain
size was determined by an MRI scan.
MRI MRI
(a) Find the least-squares Gender Count IQ Gender Count IQ
regression line treating Female 816932 133 Male 949395 140
MRI count as the Female 951545 137 Male 1001121 140
predictor variable and Female 991305 138 Male 1038437 139
Female 833868 132 Male 965353 133
IQ as the response
Female 856472 140 Male 955466 133
variable. Female 852244 132 Male 1079549 141
ŷ = 0.000029x + 110 Female 790619 135 Male 924059 135
Female 866662 130 Male 955003 139
Female 857782 133 Male 935494 141
Female 948066 133 Male 949589 144

Unit 2: Probability Distributions t z f


Lesson 2: Linear Regression
Example 5: Brain Size and Intelligence
P 219, #17. Researchers interested in whether a person’s brain size
is related to mental capacity selected a sample of 20 students who
had SAT scores higher than 1350 and administered an IQ test. Brain
size was determined by an MRI scan.
MRI MRI
(b) What do you notice Gender Count IQ Gender Count IQ
about the value of the Female 816932 133 Male 949395 140
slope? Female 951545 137 Male 1001121 140
Female 991305 138 Male 1038437 139
The slope is near zero. Female 833868 132 Male 965353 133
Why does this result Female 856472 140 Male 955466 133
Female 852244 132 Male 1079549 141
seem reasonable based
Female 790619 135 Male 924059 135
on the correlation Female 866662 130 Male 955003 139
coefficient calculated Female 857782 133 Male 935494 141
earlier. Female 948066 133 Male 949589 144

Unit 2: Probability Distributions t z f


Lesson 2: Linear Regression
Example 5: Brain Size and Intelligence
P 219, #17. Researchers interested in whether a person’s brain size
is related to mental capacity selected a sample of 20 students who
had SAT scores higher than 1350 and administered an IQ test. Brain
size was determined by an MRI scan.
MRI MRI
(c) When there is no Gender Brain IQ
Count Size Gender
vs Intelligence
Count IQ
145
relation between the Female 816932 133 Male 949395 140
140
predictor and response Female y 951545 137 Male 1001121 140

IQ
135
variables, we use the Female
130
991305 138 Male 1038437 139
Female 833868 132 Male 965353 133
mean value y to predict. 125
Female 856472 140 Male 955466 133
800 900 1000 1100
Predict the IQ of an Female 852244 132 Male 1079549 141
MRI Count
individual whose MRI Female 790619 135 Male
×1000
924059 135
Female 866662 130 Male 955003 139
count is 1,000,000. y = 136
Female 857782 133 Male 935494 141
Female 948066 133 Male 949589 144

Unit 2: Probability Distributions t z f

You might also like