An introduction to simple linear regression
An introduction to simple linear regression
regression
Published on February 19, 2020 by Rebecca Bevans. Revised on October 26, 2020.
Regression models describe the relationship between variables by fitting a line to the observed
data. Linear regression models use a straight line, while logistic and nonlinear regression models
use a curved line. Regression allows you to estimate how a dependent variable changes as the
independent variable(s) change.
Simple linear regression is used to estimate the relationship between two quantitative
variables. You can use simple linear regression when you want to know:
1. How strong the relationship is between two variables (e.g. the relationship between
rainfall and soil erosion).
2. The value of the dependent variable at a certain value of the independent variable (e.g.
the amount of soil erosion at a certain level of rainfall).
Example You are a social researcher interested in the relationship between income and
happiness. You survey 500 people whose incomes range from $15k to $75k and ask them to rank
their happiness on a scale from 1 to 10.
Your independent variable (income) and dependent variable (happiness) are both quantitative, so
you can do a regression analysis to see if there is a linear relationship between them.
If you have more than one independent variable, use multiple linear regression instead.
Table of contents
4. The relationship between the independent and dependent variable is linear: the line of
best fit through the data points is a straight line (rather than a curve or some sort of
grouping factor).
If your data do not meet the assumptions of homoscedasticity or normality, you may be able to
use a nonparametric test instead, such as the Spearman rank test.
Example: Data that doesn’t meet the assumptionsYou think there is a linear relationship between
cured meat consumption and the incidence of colorectal cancer in the U.S. However, you find
that much more data has been collected at high rates of meat consumption than at low rates of
meat consumption, with the result that there is much more variation in the estimate of cancer
rates at the low range than at the high range. Because the data violate the assumption of
homoscedasticity, it doesn’t work for regression, but you perform a Spearman rank test instead.
If your data violate the assumption of independence of observations (e.g. if observations are
repeated over time), you may be able to perform a linear mixed-effects model that accounts for
the additional structure in the data.
y is the predicted value of the dependent variable (y) for any given value of the
independent variable (x).
B0 is the intercept, the predicted value of y when the x is 0.
B1 is the regression coefficient – how much we expect y to change as x increases.
x is the independent variable ( the variable we expect is influencing y).
e is the error of the estimate, or how much variation there is in our estimate of the
regression coefficient.
Linear regression finds the line of best fit line through your data by searching for the regression
coefficient (B1) that minimizes the total error (e) of the model.
While you can perform a linear regression by hand, this is a tedious process, so most people use
statistical programs to help them quickly analyze the data.