0% found this document useful (0 votes)
13 views60 pages

Chapter 12 Notes

Chapter 12 of 'Introductory Statistics 2E' focuses on linear regression and correlation, exploring how to analyze the relationship between two or more numeric variables. It introduces concepts such as scatter plots, the line of best fit, and the correlation coefficient, which measures the strength and direction of the relationship between variables. The chapter emphasizes the importance of interpreting the slope and understanding the coefficient of determination in the context of data analysis.

Uploaded by

meenahershey52
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views60 pages

Chapter 12 Notes

Chapter 12 of 'Introductory Statistics 2E' focuses on linear regression and correlation, exploring how to analyze the relationship between two or more numeric variables. It introduces concepts such as scatter plots, the line of best fit, and the correlation coefficient, which measures the strength and direction of the relationship between variables. The chapter emphasizes the importance of interpreting the slope and understanding the coefficient of determination in the context of data analysis.

Uploaded by

meenahershey52
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 60

INTRODUCTORY STATISTICS 2E

Chapter 12 LINEAR REGRESSION AND CORRELATION

POWERPOINTS
INTRODUCTION
Professionals often want to know how two or more numeric variables are related.
For example, is there a relationship between the grade on the second math exam
a student takes and the grade on the final exam? If there is a relationship, what is
the relationship and how strong is it?
In another example, your income maybe determined by your education, your
profession, your years of experience, and your ability. The amount you pay a
repair person for labor is often determined by an initial amount plus an hourly fee.
The type of data described in the examples is bivariate data — "bi" for two
variables. In reality, statisticians use multivariate data, meaning many variables.
In this chapter, you will be studying the simplest form of regression, "linear
regression" with one independent variable (x). This involves data that fits a line in
two dimensions. You will also study correlation which measures how strong the
relationship is.
FIGURE 12.1

Linear regression and correlation can help you determine if an auto mechanic’s salary
is related to his work experience. (credit: modification of work “USPS commissions local
repair-shop for some needed work on its older trucks” by Joshua Rothhaas/ Flickr, CC
BY 2.0)
FIGURE 12.2
FIGURE 12.3
FIGURE 12.4

Three possible graphs of y = a + bx.


(a) If b > 0, the line slopes upward to the right.
(b) If b = 0, the line is horizontal.
(c) If b < 0, the line slopes downward to the right.
LINEAR EQUATIONS

Linear regression for two variables is based on a linear equation


with one independent variable.

The equation has the form: y = a+bx


a and b are constant numbers.
x is the independent variable
y is the dependent variable.

Typically, you choose a value to substitute for the independent variable and then solve
for the dependent variable
LINEAR EQUATIONS

The graph of a linear equation of the form y=a+bx is a straight


line.
Any line that is not vertical can be described by this equation.
SLOPE AND Y-INTERCEPT OF A LINEAR
EQUATION

For the linear equation y=a+bx,


b=slope and a=y-intercept.

From algebra recall that the slope is a number that describes the
steepness of a line, and the y-intercept is the ycoordinate of the
point (0, a) where the line crosses the y-axis.
EXAMPLE OF LINEAR EQUATIONS

Aaron's Word Processing Service (AWPS) does word processing. The rate for
services is $32 per hour plus a $31.50 one-time charge. The total cost to a
customer depends on the number of hours it takes to complete the job. Find the
equation that expresses the total cost in terms of the number of hours required
to complete the job.
Let x= the number of hours it takes to get the job done.
Let y= the total cost to the customer.
The $31.50 is a fixed cost.
If it takes x hours to complete the job, then (32)(x) is the cost of the word
processing only.
The total cost is: y= 31.50 + 32x
EXAMPLE OF LINEAR EQUATIONS

Svetlana tutors to make extra money for college. For each


tutoring session, she charges a one-time fee of $25 plus $15 per
hour of tutoring. A linear equation that expresses the total amount
of money Svetlana earns for each session she tutors is y= 25 +
15x.
What are the independent and dependent variables? What is the
y-intercept and what is the slope? Interpret them using complete
sentences.
SOLUTION TO EXAMPLE

The independent variable (x) is the number of hours Svetlana


tutors each session.
The dependent variable (y) is the amount, in dollars, Svetlana
earns for each session.
The y-intercept is 25 (a = 25).
At the start of the tutoring session, Svetlana charges a one-time
fee of $25 (this is when x= 0).
The slope is 15 (b= 15).
For each session, Svetlana earns $15 for each hour she tutors.
YOU TRY
Exercise 1.
A vacation resort rents SCUBA equipment to certified
divers. The resort charges an up-front fee of $25 and
another fee of $12.50 an hour.
What are the dependent and independent variables?

Exercise 2.
A vacation resort rents SCUBA equipment to
certified divers. The resort charges an up-front fee
of $25 and another fee of $12.50 an hour.
Find the equation that expresses the total fee in
terms of the number of hours the equipment is
rented.

Exercise 3.
Is the equation y = 10 + 5x – 3x2 linear? Why or
why not?
12.2 | SCATTER PLOTS
SCATTERPLOT

Before we take up the discussion of linear regression and


correlation, we need to examine a way to display the relation
between two variables x and y. The most common and easiest
way is a scatter plot. The following example illustrates a scatter
plot.
SCATTERPLOT
A scatter plot shows the direction of a relationship between the variables.
A clear direction happens when there is either:
• High values of one variable occurring with high values of the other variable
or low values of one variable occurring with low values of the other variable.
• High values of one variable occurring with low values of the other variable.
You can determine the strength of the relationship by looking at the scatter
plot and seeing how close the points are to a line, a power function, an
exponential function, or to some other type of function.
For a linear relationship there is an exception. Consider a scatter plot where
all the points fall on a horizontal line providing a "perfect fit."
The horizontal line would in fact show no relationship.
TRENDS IN SCATTERPLOTS

When you look at a scatterplot, you want to notice the overall


pattern and any deviations from the pattern. The following
scatterplot examples illustrate these concepts.
PATTERNS IN SCATTERPLOTS
PATTERNS IN SCATTERPLOTS
PATTERNS IN SCATTERPLOTS
SCATTERPLOT TO LINEAR
REGRESSION

In this chapter, we are interested in scatter plots that show a linear


pattern. Linear patterns are quite common.
The linear relationship is strong if the points are close to a straight line,
except in the case of a horizontal line where there is no relationship.
If we think that the points show a linear relationship, we would like to
draw a line on the scatter plot. This line can be calculated through a
process called linear regression.
However, we only calculate a regression line if one of the variables
helps to explain or predict the other variable. If x is the independent
variable and y the dependent variable, then we can use a regression
line to predict y for a given value of x.
FIGURE 12.5

Scatter plot showing the number of m-commerce users (in millions) by year.
FIGURE 12.9

Scatter plot showing the scores on the final exam based on scores from the third exam.
EXAMPLE
Does the scatter plot appear linear? Strong or weak? Positive or negative?
EXAMPLE
Does the scatter plot appear linear? Strong or weak? Positive or negative?
EXAMPLE
Does the scatter plot appear linear? Strong or weak? Positive or negative?
EXAMPLE
A random sample of ten professional athletes produced the following data where x is the numb
SOLUTION
EXAMPLE
12.3 | THE REGRESSION EQUATION
LINE OF BEST FIT

Data rarely fit a straight line exactly. Usually, you must


be satisfied with rough predictions. Typically, you have
a set of data whose scatter plot appears to "fit“ a
straight line. This is called a Line of Best Fit or Least-
Squares Line.
LEAST-SQUARES REGRESSION LINE

The third exam score, x, is the independent variable and the final exam score, y, is the
dependent variable. We will plot a regression line that best "fits" the data.
If each of you were to fit a line "by eye," you would draw different lines. We can use
what is called a least-squares regression line to obtain the best fit line.
Y-HAT

The ŷ is read "yhat“ and is the estimated value of y. It is the value of y


obtained using the regression line. It is not generally equal to y from data.
RESIDUAL OR ERROR

The term y0 – ŷ0 = ε0 is called the "error" or residual. It is not an error in the sense of
a mistake. The absolute value of a residual measures the vertical distance between
the actual value of y and the estimated value of y. In other words, it measures the
vertical distance between the actual data point and the predicted point on the line.
SUM OF SQUARED ERRORS (SSE)

ε= the Greek letter epsilon


For each data point, you can calculate the residuals or errors,
yi -ŷi =εi for i= 1, 2, 3, ..., 11.
Each |ε| is a vertical distance.
If you square each ε and add them together, you get the Sum of Squared
Errors (SSE).
Using calculus, you can determine the values of a and b that make the
SSE a minimum. When you make the SSE a minimum, you have
determined the points that are on the line of best fit.
It turns out that the line of best fit has the equation: y-hat= a + bx
LINEAR REGRESSION

The process of fitting the best-fit line is called linear


regression.
The idea behind finding the best-fit line is based on the
assumption that the data are scattered about a straight line.
The criteria for the best fit line is that the sum of the squared
errors (SSE) is minimized, that is, made as small as possible.
Any other line you might choose would have a higher SSE
than the best fit line.
This best fit line is called the least-squares regression line.
FIGURE 12.10
FIGURE 12.11
FIGURE 12.12
FIGURE 12.13
FIGURE 12.14
PRACTICE OF USING THE CALCULATOR

X y
A random sample of
65 175
11statistics students 67 133
produced the following data, 71 185
where x is the third exam 71 163
score out of 80, and y is the 66 126
final exam score out of 200. 75 198
67 153
What is the best-fit line for
70 163
this data?
71 159
69 151
69 159
PREDICTION WARNING

Remember, it is always important to plot a scatter diagram first.


If the scatter plot indicates that there is a linear relationship between
the variables, then it is reasonable to use a best fit line to make
predictions for y given x within the domain of x-values in the sample
data, but not necessarily for x-values outside that domain.
You could use the line to predict the final exam score for a student
who earned a grade of 73 on the third exam. You should NOT use the
line to predict the final exam score for a student who earned a grade
of 50 on the third exam, because 50 is not within the domain of the x-
values in the sample data, which are between 65 and 75.
UNDERSTANDING SLOPE

The slope of the line, b, describes how changes in the variables


are related. It is important to interpret the slope of the line in the
context of the situation represented by the data. You should be
able to write a sentence interpreting the slope in plain English.
INTERPRETATION OF THE SLOPE: The slope of the best-fit line
tells us how the dependent variable (y) changes for every one unit
increase in the independent (x) variable, on average.
THIRD EXAM vs FINAL EXAM EXAMPLE Slope: The slope of the
line is b= 4.83.
Interpretation: For a one-point increase in the score on the third
exam, the final exam score increases by 4.83 points, on average.
THE CORRELATION COEFFICIENT “R”

Besides looking at the scatter plot and seeing that a line seems reasonable,
how can you tell if the line is a good predictor? Use the correlation coefficient
as another indicator (besides the scatterplot) of the strength of the relationship
between xand y.
The correlation coefficient, r, developed by Karl Pearson in the early 1900s, is
numerical and provides a measure of strength and direction of the linear
association between the independent variable x and the dependent variable y.
The correlation coefficient is calculated as
R = where n= the number of data points.
If you suspect a linear relationship between x and y, then r can measure how
strong the linear relationship is.
(NEVER CALCULATE BY HAND)
WHAT THE VALUE OF R

• The value of r is always between –1 and +1: –1 ≤ r≤ 1.


• The size of the correlation r indicates the strength of the linear
relationship between x and y. Values of r close to –1or to +1
indicate a stronger linear relationship between x and y.
• If r= 0 there is absolutely no linear relationship between x and
y(no linear correlation).
• If r = 1, there is perfect positive correlation. If r = –1, there is
perfect negative correlation. In both these cases, all of the original
data points lie on a straight line. Of course, in the real world, this
will not generally happen.
WHAT THE SIGN OF R TELLS US
• A positive value of r means that when x increases, y tends to increase
and when x decreases, y tends to decrease (positive correlation).
• A negative value of r means that when x increases, y tends to decrease
and when x decreases, y tends to increase (negative correlation).
• The sign of r is the same as the sign of the slope, b, of the best-fit line.
NOTE Strong correlation does not suggest that x causes y or y causes x.
We say “ correlation does not imply causation."
FIGURE 12.15

(a) A scatter plot showing data with a positive correlation. 0 < r < 1
(b) A scatter plot showing data with a negative correlation. –1 < r < 0
(c) A scatter plot showing data with zero correlation. r = 0
THE COEFFICIENT OF DETERMINATION
“R²”
The variable r² is called the coefficient of determination and is the square
of the correlation coefficient, but is usually stated as a percent, rather
than in decimal form. It has an interpretation in the context of the data:
• r², when expressed as a percent, represents the percent of variation in
the dependent (predicted) variable y that can be explained by variation in
the independent (explanatory) variable x using the regression (best-fit)
line.
•1– r²,when expressed as a percentage, represents the percent of
variation in y that is NOT explained by variation in x using the regression
line. This can be seen as the scattering of the observed data points about
the regression line.
THE COEFFICIENT OF DETERMINATION
“R²”

Consider the third exam/final exam example introduced in the previous section
• The line of best fit is: ŷ= –173.51 + 4.83x
• The correlation coefficient is r= 0.6631
• The coefficient of determination is r² = (0.66312)² = 0.4397
• Interpretation of r² in the context of this example:
• Approximately 44% of the variation (0.4397 is approximately 0.44) in the final-exam
grades can be explained by the variation in the grades on the third exam, using the
best-fit regression line.
• Therefore ,approximately 56% of the variation (1–0.44=0.56) in the final exam
grades cannot be explained by the variation in the grades on the third exam, using the
best-fit regression line. (This is seen as the scattering of the points about the line.)
12.5 PREDICTION
PREDICTION

Recall the third exam/final exam example. We examined the scatterplot and
showed that the correlation coefficient is significant. We found the equation
of the best-fit line for the final exam grade as a function of the grade on the
third-exam. We can now use the least-squares regression line for prediction.
Suppose you want to estimate, or predict, the mean final exam score of
statistics students who received 73 on the third exam. The exam scores(x-
values) range from 65 to 75. Since73 is between the x-values 65 and 75,
substitute x=73 into the equation.
Then: = −173.51+4.83(73)=179.08
We predict that statistics students who earn a grade of 73 on the third exam
will earn a grade of 179.08 on the final exam, on average.
FIGURE 12.20
FIGURE 12.21
FIGURE 12.22
FIGURE 12.28
FIGURE 12.29
FIGURE 12.30
FIGURE 12.32
EXAMPLE

You might also like