0% found this document useful (0 votes)

73 views23 pages

Looking at Data: Relationships: Least-Squares Regression

The least-squares regression line is the unique line such that the sum of the squared vertical (y) distances between the data points and the line is the smallest possible. The equation completely describes the regression line.

Uploaded by

crutili

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views23 pages

Looking at Data: Relationships: Least-Squares Regression

Uploaded by

crutili

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 23

Looking at data:

relationships
Least-squares regression
IPS chapter 2.3

© 2006 W. H. Freeman and Company

Objectives (IPS chapter 2.3)
Least-squares regression

 The regression line

 Making predictions: interpolation
 Coefficient of determination, r2
 Transforming relationships
Correlation tells us about
strength (scatter) and direction
of the linear relationship
between two quantitative
variables.

In addition, we would like to have a numerical description of how both

variables vary together. For instance, is one variable increasing faster
than the other one? And we would like to make predictions based on that
numerical description.
But which line best
describes our data?
The regression line
The least-squares regression line is the unique line such that the sum
of the squared vertical (y) distances between the data points and the
line is the smallest possible.

Distances between the points and

line are squared so all are positive
values. This is done so that
distances can be properly added
(Pythagoras).
Properties
The least-squares regression line can be shown to have this equation:

sy sy
yˆ = ( y − rx )+r x, or yˆ = a + bx
sx sx

ˆ
y is the predicted y value (y hat)

b is the slope
a is the y-intercept

"a" is in units of y
"b" is in units of y / units of x
How to:
sy
b=r
First we calculate the slope of the line, b;
from statistics we already know:

r is the correlation.
sx
sy is the standard deviation of the response variable y.
sx is the the standard deviation of the explanatory variable x.

Once we know b, the slope, we can calculate a, the y-intercept:

a=
y −
bx where x and y are the sample
means of the x and y variables

This means that we don't have to calculate a lot of squared distances to find the least-
squares regression line for a data set. We can instead rely on the equation.

But typically, we use a 2-var stats calculator or stats software.

BEWARE!!!
Not all calculators and software use the same convention:

yˆ = a + bx
Some use instead:

ˆ ax +b
y =
Make sure you know what YOUR
calculator gives you for a and b before
you answer homework or exam questions.
Software output

intercept
slope
R2

r
R2

intercept
slope
The equation completely describes the regression line.

To plot the regression line you only need to plug two x values into the
equation, get y, and draw the line that goes through those those points.
Hint: The regression line always passes through the mean of x and y.

The points you use for

drawing the regression
line are derived from the
equation.

They are NOT points from

your sample data (except
by pure coincidence).
The distinction between explanatory and response variables is crucial in
regression. If you exchange y for x in calculating the regression line, you
will get the wrong line.

Regression examines the distance of all points from the line in the y
direction only.

Hubble telescope data about

galaxies moving away from earth:

These two lines are the two

regression lines calculated either
correctly (x = distance, y = velocity,
solid line) or incorrectly (x =
velocity, y = distance, dotted line).
Correlation versus regression

The correlation is a measure In regression we examine

of spread (scatter) in both the the variation in the response
x and y directions in the linear variable (y) given change in
relationship. the explanatory variable (x).
Making predictions:
interpolation
The equation of the least-squares regression allows to predict y for
any x within the range studied. This is called interpolating.

ˆ 0.0144 x +
y = 0.0008 Nobody in the study drank 6.5
beers, but by finding the value
of ŷ from the regression line for
x = 6.5 we would expect a blood
alcohol content of 0.094 mg/ml.

yˆ = 0.0144 * 6.5 + 0.0008

yˆ = 0.936 + 0.0008 = 0.0944 mg/ml
(in 1000’s)
y =0.
ˆ 125x−41.4
Year Powerboats Dead Manatees
1977 447 13
1978 460 21
1979 481 24
1980 498 16
1981 513 24
1982 512 20
1983 526 15
1984 559 34
1985 585 33
1986 614 33
1987 645 39
1988 675 43
1989 711 50
1990 719 47

There is a positive linear relationship between the number of powerboats

registered and the number of manatee deaths.

The least squares regression line has the equation: y =0.

ˆ 125x−41.4
Thus if we were to limit the number of powerboat registrations to 500,000, what
could we expect for the number of manatee deaths?

yˆ = 0.125(500) − 41.4 ⇒ yˆ = 62.5 − 41.4 = 21.1

Roughly 21 manatees.
Extrapolation
!!!

Height in Inches
!!!

Extrapolation is the use of a

regression line for predictions
outside the range of x values
used to obtain the line.

This can be a very stupid thing

Height in Inches
to do, as seen here.
Example: Bacterial growth rate over time in closed cultures

If you only observed bacterial growth in test-tube during a small subset of the
time shown here, you could get almost any regression line imaginable.
Extrapolation = big mistake.
The y intercept

Sometimes the y-intercept is not biologically possible. Here we have

negative blood alcohol content, which makes no sense…

y-intercept shows
But the negative value is negative blood alcohol

appropriate for the equation

of the regression line.

There is a lot of scatter in the

data, and the line is just an
estimate.
Coefficient of determination,
r2
r2, the coefficient of determination, is the square of the correlation
coefficient.

r2 represents the percentage of

the variance in y (vertical scatter
from the regression line) that can
be explained by changes in x. sy
b=r
sx
r = -1 Changes in x
r2 = 1 explain 100% of r = 0.87
the variations in y. r2 = 0.76
Y can be entirely
predicted for any
given value of x.

r=0 Changes in x
r2 = 0 explain 0% of the Here the change in x only
variations in y. explains 76% of the change in
The value(s) y y. The rest of the change in y
takes is (are) (the vertical scatter, shown as
entirely
red arrows) must be explained
independent of
by something other than x.
what value x
takes.
There is quite some variation in BAC for the same
r =0.7 number of beers drunk. A person’s blood volume is
r2 =0.49 a factor in the equation that was overlooked here.

We changed number
of beers to number of
beers/weight of
person in lb.

r =0.9
r2 =0.81  In the first plot, number of beers only explains
49% of the variation in blood alcohol content.
 But number of beers / weight explains 81% of
the variation in blood alcohol content.
 Additional factors contribute to variations in
BAC among individuals (like maybe some
genetic ability to process alcohol).
Grade performance

If class attendance explains 16% of the variation in grades, what is

the correlation between percent of classes attended and grade?

1. We need to make an assumption: attendance and grades are

positively correlated. So r will be positive too.

2. r2 = 0.16, so r = +√0.16 = + 0.4

A weak correlation.
Transforming relationships
A scatterplot might show a clear relationship between two quantitative
variables, but issues of influential points or non linearity prevent us
from using correlation and regression tools.

Transforming the data – changing the scale in which one or both of the
variables are expressed – can make the shape of the relationship
linear in some cases.

Example: Patterns of growth are often exponential, at least in their initial

phase. Changing the response variable y into log(y) or ln(y) will transform
the pattern from an upward-curved exponential to a straight line.
Exponential bacterial growth
In ideal environments, bacteria multiply through binary fission. The
number of bacteria can double every 20 minutes in that way.

5000 4

4000

Log of bacterial count

3
Bacterial count

3000
2
2000

1
1000

0 0
0 30 60 90 120 150 180 210 240 0 30 60 90 120 150 180 210 240
Time (min) Time (min)

1 - 2 - 4 - 8 - 16 - 32 - 64 - … log(2n) = n*log(2) ≈ 0.3n

Exponential growth 2n, Taking the log changes the growth
not suitable for regression. pattern into a straight line.
Body weight and brain
weight in 96 mammal
species
r = 0.86, but this is misleading.

The elephant is an influential point. Most

mammals are very small in comparison.
Without this point, r = 0.50 only.

Now we plot the log of brain weight

against the log of body weight.

The pattern is linear, with r = 0.96.

The vertical scatter is homogenous
→ good for predictions of brain weight
from body weight (in the log scale).

Pearson Algebra 1, Geometry and Algebra 2 Common Core Edition (PDFDrive)
75% (4)
Pearson Algebra 1, Geometry and Algebra 2 Common Core Edition (PDFDrive)
26 pages
Analyzing Wimbledon The Power of Statistics PDF
100% (2)
Analyzing Wimbledon The Power of Statistics PDF
269 pages
Dana S. Dunn, Suzanne Mannes - Statistics and Data Analysis For The Behavioral Sciences-McGraw-Hill Companies (2001)
100% (1)
Dana S. Dunn, Suzanne Mannes - Statistics and Data Analysis For The Behavioral Sciences-McGraw-Hill Companies (2001)
758 pages
TurboCad Tutorial
No ratings yet
TurboCad Tutorial
5 pages
Machine Learning Assignment Report - Cars
100% (4)
Machine Learning Assignment Report - Cars
42 pages
Unit 2 - Scatterplots Correlation and Regression Summer 2021
No ratings yet
Unit 2 - Scatterplots Correlation and Regression Summer 2021
43 pages
Lecture 6 Linear Regression
No ratings yet
Lecture 6 Linear Regression
8 pages
Regression Presentation
No ratings yet
Regression Presentation
20 pages
OpenStax Chapter 12 Power Point
No ratings yet
OpenStax Chapter 12 Power Point
81 pages
SEE5211 Chapter3-P2017
No ratings yet
SEE5211 Chapter3-P2017
58 pages
ASS#1-FINALS Doromal
No ratings yet
ASS#1-FINALS Doromal
8 pages
Prediction Is A Key Task of Statistics
No ratings yet
Prediction Is A Key Task of Statistics
18 pages
A Tutorial On How To Run A Simple Linear Regression in Excel
No ratings yet
A Tutorial On How To Run A Simple Linear Regression in Excel
19 pages
Correlation Regression Tutorial
No ratings yet
Correlation Regression Tutorial
42 pages
Investigating Variables
No ratings yet
Investigating Variables
15 pages
Chapter 3: Describing Relationships: Section 3.2
No ratings yet
Chapter 3: Describing Relationships: Section 3.2
23 pages
Ra Web
No ratings yet
Ra Web
70 pages
RegrCorr PDF
No ratings yet
RegrCorr PDF
20 pages
Stats10 - Chapter+4 2
No ratings yet
Stats10 - Chapter+4 2
54 pages
Lectures 14 15
No ratings yet
Lectures 14 15
66 pages
@regression
No ratings yet
@regression
33 pages
Introduction To Linear Regression
No ratings yet
Introduction To Linear Regression
6 pages
Bivariate Data Analysis
100% (1)
Bivariate Data Analysis
34 pages
Regression and Correlation Analysis
No ratings yet
Regression and Correlation Analysis
16 pages
Topic 9: 9.1 Objectives
No ratings yet
Topic 9: 9.1 Objectives
16 pages
Topic 8 - Regression Analysis
No ratings yet
Topic 8 - Regression Analysis
51 pages
Chapter4 - Part 2
No ratings yet
Chapter4 - Part 2
37 pages
Parametric Test
No ratings yet
Parametric Test
49 pages
06 Regression
No ratings yet
06 Regression
18 pages
Lecture8 4
No ratings yet
Lecture8 4
29 pages
Regression and Correlation
No ratings yet
Regression and Correlation
14 pages
Chapter 3 Describing Relationships
No ratings yet
Chapter 3 Describing Relationships
39 pages
Part 2 Exploring Relationships Among Variables
No ratings yet
Part 2 Exploring Relationships Among Variables
8 pages
Midterm 2 Nem Veg Leges
No ratings yet
Midterm 2 Nem Veg Leges
9 pages
Regression
No ratings yet
Regression
6 pages
Handout 5 Correlation and Regression (Recovered)
No ratings yet
Handout 5 Correlation and Regression (Recovered)
6 pages
CH 4 - Correlation and Regression YARA&LAMA
No ratings yet
CH 4 - Correlation and Regression YARA&LAMA
27 pages
Statistical Analysis: Linear Regression
No ratings yet
Statistical Analysis: Linear Regression
36 pages
Examining Relationships Regression Facts
No ratings yet
Examining Relationships Regression Facts
10 pages
Simple Liner Regration
No ratings yet
Simple Liner Regration
45 pages
Chương - Du Bao Hoi Quy Đon
No ratings yet
Chương - Du Bao Hoi Quy Đon
60 pages
Chapter 12 Notes
No ratings yet
Chapter 12 Notes
60 pages
Corr - Regression Analysis
No ratings yet
Corr - Regression Analysis
19 pages
Common Pitfalls in Statistical Analysis: Linear Regression Analysis
No ratings yet
Common Pitfalls in Statistical Analysis: Linear Regression Analysis
4 pages
(Mathe) Simple Linear Regression and Correlation
No ratings yet
(Mathe) Simple Linear Regression and Correlation
61 pages
Cha 6
No ratings yet
Cha 6
8 pages
Practical Biostatistics BMB-308: Torial Port and Presentation
No ratings yet
Practical Biostatistics BMB-308: Torial Port and Presentation
28 pages
SQQS2073 Note 1 Simple Linear Regression
No ratings yet
SQQS2073 Note 1 Simple Linear Regression
11 pages
Regression Analysis (Simple)
100% (1)
Regression Analysis (Simple)
8 pages
6 Continuous Data Analysis
No ratings yet
6 Continuous Data Analysis
49 pages
Regression: by Vijeta Gupta Amity University
No ratings yet
Regression: by Vijeta Gupta Amity University
15 pages
Chapter 5 - Regression
No ratings yet
Chapter 5 - Regression
7 pages
HELM Workbook 43 Regression and Correlation
No ratings yet
HELM Workbook 43 Regression and Correlation
32 pages
Correlation Regression And: Learning Outcomes
No ratings yet
Correlation Regression And: Learning Outcomes
16 pages
8-Simple Regression Analysis
No ratings yet
8-Simple Regression Analysis
9 pages
3.2 Least-Squares Regression: Andrew Brown AP Stats Per. 6
No ratings yet
3.2 Least-Squares Regression: Andrew Brown AP Stats Per. 6
15 pages
Module 2 - Section 4 (Linear Regression) - 11
No ratings yet
Module 2 - Section 4 (Linear Regression) - 11
20 pages
Correlation and Regression Analyses
No ratings yet
Correlation and Regression Analyses
8 pages
Correlation - Linear - Logistic Regression
No ratings yet
Correlation - Linear - Logistic Regression
123 pages
Mda-Session-7 Simple Linear Regression
No ratings yet
Mda-Session-7 Simple Linear Regression
75 pages
Simple Linear Regression Analysis
No ratings yet
Simple Linear Regression Analysis
6 pages
1486016038da Mod12 Q1 e Text
No ratings yet
1486016038da Mod12 Q1 e Text
11 pages
Correlation and Regression
No ratings yet
Correlation and Regression
10 pages
Simple Linear Regression Analysis
No ratings yet
Simple Linear Regression Analysis
7 pages
241 Survey Research
No ratings yet
241 Survey Research
28 pages
Photoglyph Specs
No ratings yet
Photoglyph Specs
2 pages
Data Analysis/Interpretation: Describing Data, Confidence Intervals, Correlation
100% (1)
Data Analysis/Interpretation: Describing Data, Confidence Intervals, Correlation
18 pages
Observational Research
No ratings yet
Observational Research
18 pages
Psy 241 Research Methods: Course Introduction Philosophy of Science and The Scientific Method
No ratings yet
Psy 241 Research Methods: Course Introduction Philosophy of Science and The Scientific Method
12 pages
Apa Style: Week 10 11/13 Ch. 14 (Pg. 461-476) (You May Also Want To Read The Apa Handout On Blackboard)
No ratings yet
Apa Style: Week 10 11/13 Ch. 14 (Pg. 461-476) (You May Also Want To Read The Apa Handout On Blackboard)
19 pages
VB NET Quick Reference
100% (3)
VB NET Quick Reference
1 page
Inference For Distributions: - : Optional Topics in Comparing Distributions
No ratings yet
Inference For Distributions: - : Optional Topics in Comparing Distributions
11 pages
Introduction To Inference: Use and Abuse of Tests Power and Decision
No ratings yet
Introduction To Inference: Use and Abuse of Tests Power and Decision
15 pages
Looking at Data: Relationships - : Caution About Correlation and Regression The Question of Causation
No ratings yet
Looking at Data: Relationships - : Caution About Correlation and Regression The Question of Causation
20 pages
Producing Data: - : Design of Experiments
No ratings yet
Producing Data: - : Design of Experiments
20 pages
Probability and Inference: Random Variables
No ratings yet
Probability and Inference: Random Variables
22 pages
Bbaldi Ips Chapter04
No ratings yet
Bbaldi Ips Chapter04
10 pages
Galloway, Engstrom & Emmers-Sommer - 2015
No ratings yet
Galloway, Engstrom & Emmers-Sommer - 2015
27 pages
Gender Discrimination
No ratings yet
Gender Discrimination
52 pages
The Relationship Between Parental Mediation and Internet Addiction Among Adolescents, and The Association With Cyberbullying and Depression
No ratings yet
The Relationship Between Parental Mediation and Internet Addiction Among Adolescents, and The Association With Cyberbullying and Depression
12 pages
Glossary of Analytical Terms: Accuracy
100% (1)
Glossary of Analytical Terms: Accuracy
32 pages
The Influence of Dispersible Clay and Wettingidrying Cycles On The Tensile Strength of A Red-Brown Earth
No ratings yet
The Influence of Dispersible Clay and Wettingidrying Cycles On The Tensile Strength of A Red-Brown Earth
14 pages
Ventilation Calculation
100% (1)
Ventilation Calculation
130 pages
Rebecca Spooner-Lane Thesis
No ratings yet
Rebecca Spooner-Lane Thesis
431 pages
American Airlines Flight Arrival Delay Analysis
No ratings yet
American Airlines Flight Arrival Delay Analysis
11 pages
Introduction To Item Analysis
No ratings yet
Introduction To Item Analysis
6 pages
STAT1008 Assignment
No ratings yet
STAT1008 Assignment
10 pages
Regression Analysis: Study Hours GPA 5 2.8 8 3.1 6 3.4 7 3.5 1 2.2 4 3.67 3 3 8 2.5 5 3.33 2 3
No ratings yet
Regression Analysis: Study Hours GPA 5 2.8 8 3.1 6 3.4 7 3.5 1 2.2 4 3.67 3 3 8 2.5 5 3.33 2 3
9 pages
The Impact of Marketing Mix On Consumers
No ratings yet
The Impact of Marketing Mix On Consumers
7 pages
Simple Regression and Correlation
No ratings yet
Simple Regression and Correlation
30 pages
Bell Mccaffrey 2002
No ratings yet
Bell Mccaffrey 2002
16 pages
Fundamentals of Business Statistics 2019
No ratings yet
Fundamentals of Business Statistics 2019
2 pages
Factors Influencing Implementation of The Nursing Process in Naivasha District Hospital, Kenya. African Journal of Midwifery and Women's Health
No ratings yet
Factors Influencing Implementation of The Nursing Process in Naivasha District Hospital, Kenya. African Journal of Midwifery and Women's Health
5 pages
Reliability (Part 2)
No ratings yet
Reliability (Part 2)
31 pages
Michal Kosinski - Private Traits and Attributes Are Predictable From Digital Records of Human Behavior PDF
No ratings yet
Michal Kosinski - Private Traits and Attributes Are Predictable From Digital Records of Human Behavior PDF
4 pages
Econometrics Chapter 14, 15 & 16 PPT Slides
100% (2)
Econometrics Chapter 14, 15 & 16 PPT Slides
113 pages
Business Statistics Operations Research
No ratings yet
Business Statistics Operations Research
8 pages
Bos B Com Corporate Secreteryship
No ratings yet
Bos B Com Corporate Secreteryship
40 pages
COMSATS University Islamabad Department of Management Sciences Terminal Exam - Spring 2021
No ratings yet
COMSATS University Islamabad Department of Management Sciences Terminal Exam - Spring 2021
8 pages
Coursera Statistics One - Notes and Formulas
No ratings yet
Coursera Statistics One - Notes and Formulas
48 pages
Cost Concept Discussion
No ratings yet
Cost Concept Discussion
2 pages
A Study of Beggars Characteristics and Attitude of People Towards The
No ratings yet
A Study of Beggars Characteristics and Attitude of People Towards The
14 pages
Common Method Variance in IS Research: A Comparison of Alternative Approaches and A Reanalysis of Past Research
100% (1)
Common Method Variance in IS Research: A Comparison of Alternative Approaches and A Reanalysis of Past Research
20 pages

Looking at Data: Relationships: Least-Squares Regression

Uploaded by

Looking at Data: Relationships: Least-Squares Regression

Uploaded by

Looking at data:

© 2006 W. H. Freeman and Company

 The regression line

In addition, we would like to have a numerical description of how both

Distances between the points and

Once we know b, the slope, we can calculate a, the y-intercept:

But typically, we use a 2-var stats calculator or stats software.

The points you use for

They are NOT points from

Hubble telescope data about

These two lines are the two

The correlation is a measure In regression we examine

yˆ = 0.0144 * 6.5 + 0.0008

There is a positive linear relationship between the number of powerboats

The least squares regression line has the equation: y =0.

yˆ = 0.125(500) − 41.4 ⇒ yˆ = 62.5 − 41.4 = 21.1

Extrapolation is the use of a

This can be a very stupid thing

Sometimes the y-intercept is not biologically possible. Here we have

appropriate for the equation

There is a lot of scatter in the

r2 represents the percentage of

If class attendance explains 16% of the variation in grades, what is

1. We need to make an assumption: attendance and grades are

2. r2 = 0.16, so r = +√0.16 = + 0.4

Example: Patterns of growth are often exponential, at least in their initial

Log of bacterial count

1 - 2 - 4 - 8 - 16 - 32 - 64 - … log(2n) = n*log(2) ≈ 0.3n

The elephant is an influential point. Most

Now we plot the log of brain weight

The pattern is linear, with r = 0.96.

You might also like