Linear Regression
Linear Regression
Like correlation, linear regression also shows the direction and strength of the
relationship between two numeric variables, but regression also uses the best-
fitting straight line through the points on a scatter plot to predict Y values from
X values. With correlation, the values of X and Y are interchangeable. With
regression, the results of the analysis will change if X and Y are swapped.
Note
Concepts in this unit are adapted from Introduction to Statistics.
The regression line on the scatter plot is the best-fitting straight line through
the points on the scatter plot. In other words, it is a line that goes through the
points with the least amount of distance from each point to the line.
Why is this line helpful and useful? We can use the linear regression calculation
to calculate, or predict, our Y value if we have a known X value.
A Regression Example
Let’s say you want to predict how much you will need to spend to buy a house that
is 1,500 square feet. Let's use linear regression to predict.
Place the variable that you want to predict, home prices, on the y-axis (this is
also called the dependent variable).
Place the variable you're basing your predictions on, square footage, on the x-axis
(this is also called the independent variable).
Here is a scatter plot showing house prices (y-axis) and square footage (x-axis).
A scatter plot with blue marks showing house prices (y-axis) and square footage (x-
axis)
The scatter plot shows homes with more square feet tend to have higher prices, but
how much will you have to spend for a house that measures 1,500 square feet?
To help answer that question, create a line through the points. This is linear
regression. The regression line will help you to predict what a typical house of a
certain square footage will cost. In this example, you can see the equation for the
regression line.
To find Y, multiply the value of X by 113 and then add 98,653. In this case, we are
looking at no square footage, so the value of X is 0.
Y = (113 * 0) + 98,653
Y = 0 + 98,653
Y = 98,653
The value 98,653 is called the y-intercept because this is where the line crosses,
or intercepts, the y-axis. It is the value of Y when X equals 0.
The number 113 is the slope of the line. Slope is a number that describes both the
direction and the steepness of the line. In this case, the slope forecasts that for
every additional square foot, the house price will increase by $113.
So, here’s what you need to spend on a 1,500 square foot house:
Take another look at this scatter plot. The blue marks are the actual data. You can
see that you have data for homes between 1,100 and 2,450 square feet.
A scatter plot with blue marks, a gray regression line, and orange lines showing
where X and Y meet on the regression line
Note that this equation cannot be used to predict the price of all houses. Since a
500-square-foot house and a 10,000-square-foot house are both outside of the range
of the actual data, you would need to be careful about making predictions with
those values using this equation.
This value is a statistical measure of how close the data is to the regression
line, or how well the model fits your observations. If the data is perfectly on the
line, the r-squared value would be 1, or 100%, meaning that your model fits
perfectly (all observed data points are on the line).
For our home price data, the r-squared value is 0.70, or 70%.
Being familiar with the statistical concepts of correlation and regression helps
you to explore and understand the data you work with by examining relationships.