0% found this document useful (0 votes)
4 views57 pages

Lec 4

The document discusses regression analysis, focusing on how to explore associations between numerical variables using scatterplots and correlation coefficients. It explains the importance of visualizing data trends, measuring the strength of associations, and modeling linear trends with regression equations. Additionally, it emphasizes the need for careful interpretation of results, including the slope and y-intercept of regression equations, and warns against extrapolation and assuming causation from correlation.

Uploaded by

slenderwather
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views57 pages

Lec 4

The document discusses regression analysis, focusing on how to explore associations between numerical variables using scatterplots and correlation coefficients. It explains the importance of visualizing data trends, measuring the strength of associations, and modeling linear trends with regression equations. Additionally, it emphasizes the need for careful interpretation of results, including the slope and y-intercept of regression equations, and warns against extrapolation and assuming causation from correlation.

Uploaded by

slenderwather
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Regression Analysis:

Exploring Associations
between Variables
Topics

• Explore associations between numerical


variables graphically and numerically
• Model linear trends using a regression line
SECTION 4.1 VISUALIZING
VARIABILITY WITH A SCATTERPLOT

• Use Technology to
Create a Scatterplot
• Use Scatterplots to
Investigate
Associations
Between Numerical
Variables
Visualizing Variability with a Scatterplot

Scatterplot
• The primary tool for examining relationships
between two numerical variables.
• Each point in the scatterplot represents one
observation.
• Usually created using technology such as a
computer software program or a graphing
calculator.
Median Age of Marriage for Women

Each point in the scatterplot represents one state in the U S and


the District of Columbia. Each point represents the median age
of marriage for women and men in the state. Each data point has
the form: (median age of women, median age for men).
Examining Scatterplots

Note three features:


1. Trend (like center)
2. Strength (like spread)
3. Shape
Trend

The general tendency of the scatterplot as


you read from left to right, typical trends:
1. Increasing (uphill), called a positive association
2. Decreasing (downhill), called a negative
association
3. No trend, if there is neither an uphill nor
downhill tendency
Example 1: Positive Trend

This scatterplot shows a positive trend because the graph goes


uphill as you scan from left to right. This means as the age of the
car increases, the mileage also tends to increase.
Example 2: Negative Trend

This scatterplot shows a negative trend because the graph goes


downhill as you scan from left to right. This means as literacy
rate increases, total births per woman tends to decrease.
Example 3: No Trend

This scatterplot shows no trend because the points seem to


follow no predictable pattern. This means that for every age
group we can find relatively fast and relative slow runners.
Marathon running speed does not seem to be related to age of
runner.
Example 4: Trend, Neither Positive nor
Negative

This data set shows an association between two variables,


but it cannot be characterized as positive nor negative.
Strength of an Association

Scatterplots with large amounts of scatter or


vertical variation indicate a weak association.

Scatterplots with small amounts of scatter or little


vertical variation indicate a strong association.
Example 5: Strength of Association (1 of 2)

Is there a stronger association between height and


weight or between waist size and weight?
There seems to be a stronger association between waist
size and weight (less vertical variation in the graph).
Shape: Linear
Scatterplots that cluster around a line model linear
trends. This scatterplot shows there is a linear
association between volume of searches for the
word “vampire” and the word “zombie.”
Shape: Non-Linear
Sometimes there are trends in data that are non-linear
– trends that are better modeled by a curve rather than
a line. This scatterplot shows there is a non-linear
trend between temperature and pollutant ozone levels.
Writing Descriptions of Associations

When writing a description of an association


between two numerical variables, always
include:
1. Trend
2. Shape
3. Strength
In addition, mention any observations that don’t fit
the general trend (if any).
Example 6: Describing Associations
(1 of 2)

How would you describe the association between median


age of marriage for women and median age of marriage for
men in the 50 states and the District of Columbia?
Example 6: Describing Associations
(2 of 2)

The association between median age of marriage for women and


the median age of marriage for men is positive and linear. In
other words, women who marry at an older age tend to marry
men who are an older age. The association is strong because
there is very little vertical variation in the graph.
Be Careful Describing Associations

• Always use a phrase like “tends to” when


describing an association because the trend you
are describing has variability – the association
you are describing may not be true for all
individuals.
• Always point out any data points that appear to
be unusual or not part of the general pattern.
SECTION 4.2 MEASURING
STRENGTH OF ASSOCIATION
WITH CORRELATION
• Find and Interpret
the Correlation
Coefficient
Correlation Coefficient

• A number that measures the strength of a linear


relationship
• Symbol: r
• Always between −1 and +1
• r values close to −1 or +1 indicate a strong linear
association
• r values close to 0 indicate a weak association
r Values of 1 and −1

Correlation coefficients of 1 and −1 indicate perfect


positive and perfect negative associations. The
data points lie exactly on a line.
Visualizing the Correlation Coefficient

Notice that as r increases, there is less vertical variation in the data (the
trend is stronger).
Computing the Correlation Coefficient

Background:
Data are converted to z-scores which are multiplied
together. These products are then added and the
resulting sum is divided by n − 1.
In practice: The correlation coefficient is found
using technology.
Example 7
The table below shows the heights and weights for
6 women. Compute and interpret r, the correlation
coefficient.

Height 61 62 63 64 66 68
Weight 104 110 141 125 170 160
Stat Crunch Output (1 of 2)
• Simple linear regression results:
Dependent Variable: Weight
Independent Variable: Height
Weight = −442.88235 + 9.0294118 Height
Sample size: 6
R ( correlation coefficient ) = 0.88093363
R-sq = 0.77604407
Page 1 of the output has a lot of information, but
we can see r = 0.881. Since r is close to 1, we
would say there is a strong linear association
between height and weight.
Stat Crunch Output (2 of 2)

Page 2 provides a graph of the data, including a


graph of the line that best fits the data.
Notes About the Correlation Coefficient

• Changing the order of the variables does not


change r.
• Adding a constant or multiplying by a positive
constant does not affect r.
• r is unitless.
• r is only useful to measure a linear trend –
always graph your data first before computing r
to make sure the association is linear!
SECTION 4.3 MODELING LINEAR
TRENDS

• Use Technology to
Write the
Regression Equation
• Use the Regression
Equation to Make
Appropriate
Predictions
Regression Line

• A tool for making predictions about future


observed values
• Has the formy = a + bx, where a is the y-intercept
and b is the slope

• Usually generated using appropriate technology


Example 8: Regression Equation

The scatterplot shows a fairly strong positive linear trend.


The regression equation has a slope of 2.16 and a y-
intercept of 30.46. The positive trend indicates that players
who hit more home runs tend to have more RBIs.
Example 9: Using the Regression
Equation (1 of 2)

The scatterplot shows a negative linear trend. As age of car


increases, value tends to decrease. The regression
equation is: predicted value = 21375 − 1215 age
Example 9: Using the Regression
Equation (2 of 2)

predicted value = 21,375-1215 age


Use the regression equation to predict the value of
a car that is 12 years old.

predicted value = 21,375 - 1215 age


predicted value = 21,375 - 1215 12 ( )
predicted value = $6795
Finding the Regression Equation

• To find the regression equation using


technology, follow the same steps as for finding
the correlation coefficient.
Example 10
The table below shows the heights and weights for
six women. Find the regression equation that
describes the relationship between height and
weight.
Height 61 62 63 64 66 68
Weight 104 110 141 125 170 160

Note: We previously determined that this data


followed a linear trend, so it is appropriate to find
the regression equation.
Stat Crunch Output
• Simple linear regression results:
Dependent Variable: Weight
Independent Variable: Height
Weight = −442.88235 + 9.0294118 Height
Sample size: 6
R ( correlation coefficient ) = 0.88093363
R-sq = 0.77604407
Example 11: Using the Regression
Equation

Weight = −442.882 + 9.03 Height


Use the regression equation to predict the weight
of a woman who is 65 inches tall.
Weight = −442.882 + 9.03 Height
Weight = −442.882 + 9.03 ( 65 )
Weight = 144.07 inches
Notes About the Regression Equation

• Order matters. If x and y are switched, the


regression equation will change.
• We use the x-variable to make predictions about
the y-variable, so the x-variable is called the
explanatory or predictor variable. It is also called
the independent variable.
• The y-variable is the response or predicted
variable. It is also called the dependent
variable.
Example 12

The table below shows the heights and weights for


six women. Find the regression equation that
describes the relationship between height and
weight. This time use weight as the predictor or
explanatory variable (x) and height as the
predicted or response variable (y).
Height 61 62 63 64 66 68
Weight 104 110 141 125 170 160
Example 13
Simple linear regression results:
Dependent Variable: Height
Independent Variable: Weight
Height = 52.397256 + 0.085946249 Weight
Sample size: 6
R ( correlation coefficient ) = 0.88093363
R-sq = 0.77604407
Note: r ( correlation coefficient ) remains the same;
The regression equation is different from our
previous result.
Interpreting the Slope of the
Regression Equation

• Slope tells us how much the y-variable changes


when the x-variable is increased by 1 unit.
• A slope close to 0 means there is no linear
relationship between x and y.
Example 14: Interpreting the Slope

Weight = −442.882 + 9.03 Height


The slope of this line is 9.03. The y-variable is
weight and the x-variable is height.
Interpretation:
For every additional inch in height, weight tends to
increase by 9.03 pounds.
Every increase of 1 inch in height is associated
with an increase in weight of 9.03 pounds.
Example 15: Interpreting Slope
In a previous example on the association between
age of car and value of car, the regression
equation was: predicted value = 21,375-1215age
Interpret the slope of the regression equation.
Slope = −1215, x-variable is age, y-variable is value.
Interpretation:
For each additional year of age, value of car tends
to decrease by $1215.
Each additional year of age is associated with a
decrease of $1215 in value.
Interpreting the y-Intercept of the
Regression Equation

• The y-intercept is the predicted value when x is 0.

• The y-intercept is meaningful only if it makes


sense for x to equal 0.
Example 16: Interpreting the y-Intercept
(1 of 2)

In a previous example on the association between


age of car and value of car, the regression
equation was: predicted value = 21,375-1215 age

Interpret the y-intercept of the equation, if


appropriate.
y-intercept = 21375. It is the predicted value when
x (age) is 0. In other words, when the car is new,
its value is $21,375.
Example 16: Interpreting the y-Intercept
(2 of 2)

In a previous example on the association between


height and weight in women, the regression
equation was:
Weight = −442.882 + 9.03 Height

Interpret the y-intercept, if appropriate.


y-intercept = −442.882. It is the predicted value for weight if
x (height) is 0. It is impossible to weigh −442 pounds and it
is impossible for a woman to be 0 inches tall, so in this
case the y-intercept is meaningless.
SECTION 4.4 EVALUATING THE
LINEAR MODEL

• Use Linear Models


to Describe
Associations Only
When Appropriate
• Compute and
Interpret the
Coefficient of
Determination
Cautionary Notes Regarding
Regression
• Don’t use linear models to describe non-linear
associations. Always look at a scatterplot first!
• Correlation is not causation! An association between two
variables is not sufficient evidence to conclude that a
cause-and-effect relationship exists between the
variables.
• Beware of outliers that can have a big effect on r. Always
check the scatterplot for outliers first.
• Don’t extrapolate! Don’t make predictions beyond the
range of the data, because we are not sure that the linear
trend will continue beyond the range of the data.
Example 17: Extrapolation (1 of 2)

In a previous example we found there was a


strong linear relationship between heights and
weights in women, and the regression
equation is Weight = −442.882 + 9.03Height.

What weight does this equation predict for a


woman who is 36 inches tall?
Example 17: Extrapolation (2 of 2)

Weight = −442.882 + 9.03 Height

Weight = −442.882 + 9.03 ( 36 ) = −117.8 pounds

Note: The range of the data was for women 61 to


68 inches tall. It is not appropriate to use the
regression equation to predict the height for a 36
inch tall woman since 36 is beyond the range of
the data (extrapolation).
Coefficient of Determination: r Squared

• The square of r, the correlation coefficient


• Usually converted to a percentage, so always
between 0% and 100%
• Measures how much variation in the response
variable is explained by the explanatory variable
2
• The larger r , the smaller the amount of
variation or scatter about the regression line.
Example 18: r Squared

For the data on car age and predicted value,


2
r = −0.778. Compute and interpret r .
r = ( −0.778 ) = .605, so r 2 = 60.5%.
2 2

Car age explains about 60.5% of the variation in


car value.
Section 4.1 Question
The scatterplot shows what type of relationship
between median age of marriage for men and women?

A. A strong positive relationship


B. A weak positive relationship
C. A strong negative relationship
D. A weak negative relationship
Section 4.2 Question 1
There is a negative association between the
percentage of smoke-free homes and the
percentage of high school students who smoke.
This means:
A. As the percentage of smoke-free homes has
increased, the percentage of high school smokers
has also increased.
B. As the percentage of smoke-free homes has
increased, the percentage of high school smokers
has decreased.
C. We cannot predict any trends from the given
information.
Section 4.4 Question
For a certain group of cars, there is a strong association
between city and highway mileage that can be described
by the equation:
Predicted Hwy MPG = 7.79 + 0.95 City MPG
Which of the following is an interpretation of the slope?
A. Each increase of 1 MPG in highway mileage is
associated with an increase of 0.95 in city mileage.
B. Each increase of 1 MPG in city mileage is associated
with an increase of 0.95 in highway mileage.
C. Each increase of 1 MPG in city mileage is associated
with an increase of 7.79 in highway mileage.
D. The slope of this equation is meaningless.
Section 4.2 Question 2

Which of the following correlation coefficient


values indicates the strongest association
between two variables?
A. 0.12
B. 0.42
C. 0.78
D. −0.92.
(Closest to +1 or −1)
Section 4.3 Question

When doing a regression analysis on a data


set, which of the following remain the same
no matter which variable is chosen for x and
which is chosen for y?
A. The y-intercept of the regression equation
B. The slope of the regression equation
C. The correlation coefficient
D. All of the above

You might also like