Chapter 12 Notes
Chapter 12 Notes
POWERPOINTS
INTRODUCTION
Professionals often want to know how two or more numeric variables are related.
For example, is there a relationship between the grade on the second math exam
a student takes and the grade on the final exam? If there is a relationship, what is
the relationship and how strong is it?
In another example, your income maybe determined by your education, your
profession, your years of experience, and your ability. The amount you pay a
repair person for labor is often determined by an initial amount plus an hourly fee.
The type of data described in the examples is bivariate data — "bi" for two
variables. In reality, statisticians use multivariate data, meaning many variables.
In this chapter, you will be studying the simplest form of regression, "linear
regression" with one independent variable (x). This involves data that fits a line in
two dimensions. You will also study correlation which measures how strong the
relationship is.
FIGURE 12.1
Linear regression and correlation can help you determine if an auto mechanic’s salary
is related to his work experience. (credit: modification of work “USPS commissions local
repair-shop for some needed work on its older trucks” by Joshua Rothhaas/ Flickr, CC
BY 2.0)
FIGURE 12.2
FIGURE 12.3
FIGURE 12.4
Typically, you choose a value to substitute for the independent variable and then solve
for the dependent variable
LINEAR EQUATIONS
From algebra recall that the slope is a number that describes the
steepness of a line, and the y-intercept is the ycoordinate of the
point (0, a) where the line crosses the y-axis.
EXAMPLE OF LINEAR EQUATIONS
Aaron's Word Processing Service (AWPS) does word processing. The rate for
services is $32 per hour plus a $31.50 one-time charge. The total cost to a
customer depends on the number of hours it takes to complete the job. Find the
equation that expresses the total cost in terms of the number of hours required
to complete the job.
Let x= the number of hours it takes to get the job done.
Let y= the total cost to the customer.
The $31.50 is a fixed cost.
If it takes x hours to complete the job, then (32)(x) is the cost of the word
processing only.
The total cost is: y= 31.50 + 32x
EXAMPLE OF LINEAR EQUATIONS
Exercise 2.
A vacation resort rents SCUBA equipment to
certified divers. The resort charges an up-front fee
of $25 and another fee of $12.50 an hour.
Find the equation that expresses the total fee in
terms of the number of hours the equipment is
rented.
Exercise 3.
Is the equation y = 10 + 5x – 3x2 linear? Why or
why not?
12.2 | SCATTER PLOTS
SCATTERPLOT
Scatter plot showing the number of m-commerce users (in millions) by year.
FIGURE 12.9
Scatter plot showing the scores on the final exam based on scores from the third exam.
EXAMPLE
Does the scatter plot appear linear? Strong or weak? Positive or negative?
EXAMPLE
Does the scatter plot appear linear? Strong or weak? Positive or negative?
EXAMPLE
Does the scatter plot appear linear? Strong or weak? Positive or negative?
EXAMPLE
A random sample of ten professional athletes produced the following data where x is the numb
SOLUTION
EXAMPLE
12.3 | THE REGRESSION EQUATION
LINE OF BEST FIT
The third exam score, x, is the independent variable and the final exam score, y, is the
dependent variable. We will plot a regression line that best "fits" the data.
If each of you were to fit a line "by eye," you would draw different lines. We can use
what is called a least-squares regression line to obtain the best fit line.
Y-HAT
The term y0 – ŷ0 = ε0 is called the "error" or residual. It is not an error in the sense of
a mistake. The absolute value of a residual measures the vertical distance between
the actual value of y and the estimated value of y. In other words, it measures the
vertical distance between the actual data point and the predicted point on the line.
SUM OF SQUARED ERRORS (SSE)
X y
A random sample of
65 175
11statistics students 67 133
produced the following data, 71 185
where x is the third exam 71 163
score out of 80, and y is the 66 126
final exam score out of 200. 75 198
67 153
What is the best-fit line for
70 163
this data?
71 159
69 151
69 159
PREDICTION WARNING
Besides looking at the scatter plot and seeing that a line seems reasonable,
how can you tell if the line is a good predictor? Use the correlation coefficient
as another indicator (besides the scatterplot) of the strength of the relationship
between xand y.
The correlation coefficient, r, developed by Karl Pearson in the early 1900s, is
numerical and provides a measure of strength and direction of the linear
association between the independent variable x and the dependent variable y.
The correlation coefficient is calculated as
R = where n= the number of data points.
If you suspect a linear relationship between x and y, then r can measure how
strong the linear relationship is.
(NEVER CALCULATE BY HAND)
WHAT THE VALUE OF R
(a) A scatter plot showing data with a positive correlation. 0 < r < 1
(b) A scatter plot showing data with a negative correlation. –1 < r < 0
(c) A scatter plot showing data with zero correlation. r = 0
THE COEFFICIENT OF DETERMINATION
“R²”
The variable r² is called the coefficient of determination and is the square
of the correlation coefficient, but is usually stated as a percent, rather
than in decimal form. It has an interpretation in the context of the data:
• r², when expressed as a percent, represents the percent of variation in
the dependent (predicted) variable y that can be explained by variation in
the independent (explanatory) variable x using the regression (best-fit)
line.
•1– r²,when expressed as a percentage, represents the percent of
variation in y that is NOT explained by variation in x using the regression
line. This can be seen as the scattering of the observed data points about
the regression line.
THE COEFFICIENT OF DETERMINATION
“R²”
Consider the third exam/final exam example introduced in the previous section
• The line of best fit is: ŷ= –173.51 + 4.83x
• The correlation coefficient is r= 0.6631
• The coefficient of determination is r² = (0.66312)² = 0.4397
• Interpretation of r² in the context of this example:
• Approximately 44% of the variation (0.4397 is approximately 0.44) in the final-exam
grades can be explained by the variation in the grades on the third exam, using the
best-fit regression line.
• Therefore ,approximately 56% of the variation (1–0.44=0.56) in the final exam
grades cannot be explained by the variation in the grades on the third exam, using the
best-fit regression line. (This is seen as the scattering of the points about the line.)
12.5 PREDICTION
PREDICTION
Recall the third exam/final exam example. We examined the scatterplot and
showed that the correlation coefficient is significant. We found the equation
of the best-fit line for the final exam grade as a function of the grade on the
third-exam. We can now use the least-squares regression line for prediction.
Suppose you want to estimate, or predict, the mean final exam score of
statistics students who received 73 on the third exam. The exam scores(x-
values) range from 65 to 75. Since73 is between the x-values 65 and 75,
substitute x=73 into the equation.
Then: = −173.51+4.83(73)=179.08
We predict that statistics students who earn a grade of 73 on the third exam
will earn a grade of 179.08 on the final exam, on average.
FIGURE 12.20
FIGURE 12.21
FIGURE 12.22
FIGURE 12.28
FIGURE 12.29
FIGURE 12.30
FIGURE 12.32
EXAMPLE