0% found this document useful (0 votes)
51 views

Linear Regression. Com

Linear regression is a statistical technique that uses one or more independent variables to predict a dependent variable. It fits a straight line through data points to establish a relationship between variables. Simple linear regression uses one independent variable to predict one dependent variable, while multiple linear regression uses multiple independent variables. Linear regression is commonly used in fields like business, science, and machine learning. It works by plotting variables on a graph, finding the line of best fit, and generating an equation to make predictions. The assumptions of linear regression include a linear relationship between variables and homoscedasticity.

Uploaded by

Hamida patino
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views

Linear Regression. Com

Linear regression is a statistical technique that uses one or more independent variables to predict a dependent variable. It fits a straight line through data points to establish a relationship between variables. Simple linear regression uses one independent variable to predict one dependent variable, while multiple linear regression uses multiple independent variables. Linear regression is commonly used in fields like business, science, and machine learning. It works by plotting variables on a graph, finding the line of best fit, and generating an equation to make predictions. The assumptions of linear regression include a linear relationship between variables and homoscedasticity.

Uploaded by

Hamida patino
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 13

What is Cloud Computing?

Cloud Computing Concepts Hub Machine Learning & AI

What Is Linear Regression?

What is linear regression?

Linear regression is a data analysis technique that predicts the value of unknown data by using another
related and known data value. It mathematically models the unknown or dependent variable and the
known or independent variable as a linear equation. For instance, suppose that you have data about
your expenses and income for last year. Linear regression techniques analyze this data and determine
that your expenses are half your income. They then calculate an unknown future expense by halving a
future known income.

Why is linear regression important?

Linear regression models are relatively simple and provide an easy-to-interpret mathematical formula to
generate predictions. Linear regression is an established statistical technique and applies easily to
software and computing. Businesses use it to reliably and predictably convert raw data into business
intelligence and actionable insights. Scientists in many fields, including biology and the behavioral,
environmental, and social sciences, use linear regression to conduct preliminary data analysis and
predict future trends. Many data science methods, such as machine learning and artificial intelligence,
use linear regression to solve complex problems.

How does linear regression work?

At its core, a simple linear regression technique attempts to plot a line graph between two data
variables, x and y. As the independent variable, x is plotted along the horizontal axis. Independent
variables are also called explanatory variables or predictor variables. The dependent variable, y, is
plotted on the vertical axis. You can also refer to y values as response variables or predicted variables.

Steps in linear regression

For this overview, consider the simplest form of the line graph equation between y and x; y=c*x+m,
where c and m are constant for all possible values of x and y. So, for example, suppose that the input
dataset for (x,y) was (1,5), (2,8), and (3,11). To identify the linear regression method, you would take the
following steps:
Plot a straight line, and measure the correlation between 1 and 5.

Keep changing the direction of the straight line for new values (2,8) and (3,11) until all values fit.

Identify the linear regression equation as y=3*x+2.

Extrapolate or predict that y is 14 when x is


What are the types of linear regression?

Some types of regression analysis are more suited to handle complex datasets than others. The
following are some examples.

Simple linear regression

Simple linear regression is defined by the linear function:

Y= β0*X + β1 + ε

β0 and β1 are two unknown constants representing the regression slope, whereas ε (epsilon) is the error
term.

You can use simple linear regression to model the relationship between two variables, such as these:

Rainfall and crop yield

Age and height in children

Temperature and expansion of the metal mercury in a thermometer

Multiple linear regression

In multiple linear regression analysis, the dataset contains one dependent variable and multiple
independent variables. The linear regression line function changes to include more factors as follows:

Y= β0*X0 + β1X1 + β2X2+…… βnXn+ ε

As the number of predictor variables increases, the β constants also increase correspondingly.

Multiple linear regression models multiple variables and their impact on an outcome:
Rainfall, temperature, and fertilizer use on crop yield

Diet and exercise on heart disease

Wage growth and inflation on home loan rates

Logistic regression

Data scientists use logistic regression to measure the probability of an event occurring. The prediction is
a value between 0 and 1, where 0 indicates an event that is unlikely to happen, and 1 indicates a
maximum likelihood that it will happen. Logistic equations use logarithmic functions to compute the
regression line.

These are some examples:

The probability of a win or loss in a sporting match

The probability of passing or failing a test

The probability of an image being a fruit or an animal

Simple linear regression is used to estimate the relationship between two quantitative variables. You
can use simple linear regression when you want to know:
How strong the relationship is between two variables (e.g., the relationship between rainfall and soil
erosion).

The value of the dependent variable at a certain value of the independent variable (e.g., the amount of
soil erosion at a certain level of rainfall).

Regression models describe the relationship between variables by fitting a line to the observed data.
Linear regression models use a straight line, while logistic and nonlinear regression models use a curved
line. Regression allows you to estimate how a dependent variable changes as the independent
variable(s) change.

Simple linear regression example

You are a social researcher interested in the relationship between income and happiness. You survey
500 people whose incomes range from 15k to 75k and ask them to rank their happiness on a scale from
1 to 10.

Your independent variable (income) and dependent variable (happiness) are both quantitative, so you
can do a regression analysis to see if there is a linear relationship between them.

If you have more than one independent variable, use multiple linear regression instead.

Table of contents

Assumptions of simple linear regression

How to perform a simple linear regression

Interpreting the results

Presenting the results

Can you predict values outside the range of your data?

Frequently asked questions about simple linear regression

Assumptions of simple linear regression

Simple linear regression is a parametric test, meaning that it makes certain assumptions about the data.
These assumptions are:
Homogeneity of variance (homoscedasticity): the size of the error in our prediction doesn’t change
significantly across the values of the independent variable.

Independence of observations: the observations in the dataset were collected using statistically valid
sampling methods, and there are no hidden relationships among observations.

Normality: The data follows a normal distribution.

Linear regression makes one additional assumption:

The relationship between the independent and dependent variable is linear: the line of best fit through
the data points is a straight line (rather than a curve or some sort of grouping factor).

If your data do not meet the assumptions of homoscedasticity or normality, you may be able to use a
nonparametric test instead, such as the Spearman rank test.

Example: Data that doesn’t meet the assumptions

You think there is a linear relationship between cured meat consumption and the incidence of colorectal
cancer in the U.S. However, you find that much more data has been collected at high rates of meat
consumption than at low rates of meat consumption, with the result that there is much more variation
in the estimate of cancer rates at the low range than at the high range. Because the data violate the
assumption of homoscedasticity, it doesn’t work for regression, but you perform a Spearman rank test
instead.

If your data violate the assumption of independence of observations (e.g., if observations are repeated
over time), you may be able to perform a linear mixed-effects model that accounts for the additional
structure in the data.

How to perform a simple linear regression

Simple linear regression formula

The formula for a simple linear regression is:

y = {\beta_0} + {\beta_1{X}} + {\epsilon}

y is the predicted value of the dependent variable (y) for any given value of the independent variable (x).
B0 is the intercept, the predicted value of y when the x is 0.

B1 is the regression coefficient – how much we expect y to change as x increases.

x is the independent variable ( the variable we expect is influencing y).

e is the error of the estimate, or how much variation there is in our estimate of the regression
coefficient.

Linear regression finds the line of best fit line through your data by searching for the regression
coefficient (B1) that minimizes the total error (e) of the model.

While you can perform a linear regression by hand, this is a tedious process, so most people use
statistical programs to help them quickly analyze the data.

Simple linear regression in R

R is a free, powerful, and widely-used statistical program. Download the dataset to try it yourself using
our income and happiness example.

Load the income.data dataset into your R environment, and then run the following command to
generate a linear model describing the relationship between income and happiness:

R code for simple linear regression

income.happiness.lm <- lm(happiness ~ income, data = income.data)

This code takes the data you have collected data = income.data and calculates the effect that the
independent variable income has on the dependent variable happiness using the equation for the linear
model: lm().

To learn more, follow our full step-by-step guide to linear regression in R.

Receive feedback on language, structure, and formatting


Professional editors proofread and edit your paper by focusing on:

Academic style

Vague sentences

Grammar

Style consistency

Interpreting the results

To view the results of the model, you can use the summary() function in R:

summary(income.happiness.lm)

This function takes the most important parameters from the linear model and puts them into a table,
which looks like this:

Simple linear regression summary output in R

This output table first repeats the formula that was used to generate the results (‘Call’), then
summarizes the model residuals (‘Residuals’), which give an idea of how well the model fits the real
data.

Next is the ‘Coefficients’ table. The first row gives the estimates of the y-intercept, and the second row
gives the regression coefficient of the model.

Row 1 of the table is labeled (Intercept). This is the y-intercept of the regression equation, with a value
of 0.20. You can plug this into your regression equation if you want to predict happiness values across
the range of income that you have observed:

happiness = 0.20 + 0.71*income ± 0.018


The next row in the ‘Coefficients’ table is income. This is the row that describes the estimated effect of
income on reported happiness:

The Estimate column is the estimated effect, also called the regression coefficient or r2 value. The
number in the table (0.713) tells us that for every one unit increase in income (where one unit of income
= 10,000) there is a corresponding 0.71-unit increase in reported happiness (where happiness is a scale
of 1 to 10).

The Std. Error column displays the standard error of the estimate. This number shows how much
variation there is in our estimate of the relationship between income and happiness.

The t value column displays the test statistic. Unless you specify otherwise, the test statistic used in
linear regression is the t value from a two-sided t test. The larger the test statistic, the less likely it is that
our results occurred by chance.

The Pr(>| t |) column shows the p value. This number tells us how likely we are to see the estimated
effect of income on happiness if the null hypothesis of no effect were true.

Because the p value is so low (p < 0.001), we can reject the null hypothesis and conclude that income
has a statistically significant effect on happiness.

The last three lines of the model summary are statistics about the model as a whole. The most
important thing to notice here is the p value of the model. Here it is significant (p < 0.001), which means
that this model is a good fit for the observed data.

Presenting the results

When reporting your results, include the estimated effect (i.e. the regression coefficient), standard error
of the estimate, and the p value. You should also interpret your numbers to make it clear to your
readers what your regression coefficient means:
We found a significant relationship (p < 0.001) between income and happiness (R2 = 0.71 ± 0.018), with
a 0.71-unit increase in reported happiness for every 10,000 increase in income.

It can also be helpful to include a graph with your results. For a simple linear regression, you can simply
plot the observations on the x and y axis and then include the regression line and regression function:

Simple linear regression graph

Can you predict values outside the range of your data?

No! We often say that regression models can be used to predict the value of the dependent variable at
certain values of the independent variable. However, this is only true for the range of values where we
have actually measured the response.

We can use our income and happiness regression analysis as an example. Between 15,000 and 75,000,
we found an r2 of 0.73 ± 0.0193. But what if we did a second survey of people making between 75,000
and 150,000?

Extrapolating data in R

The r2 for the relationship between income and happiness is now 0.21, or a 0.21-unit increase in
reported happiness for every 10,000 increase in income. While the relationship is still statistically
significant (p<0.001), the slope is much smaller than before.

Extrapolating data in R graph

What if we hadn’t measured this group, and instead extrapolated the line from the 15–75k incomes to
the 70–150k incomes?
You can see that if we simply extrapolated from the 15–75k income data, we would overestimate the
happiness of people in the 75–150k income range.

Curved data line

If we instead fit a curve to the data, it seems to fit the actual pattern much better.

It looks as though happiness actually levels off at higher incomes, so we can’t use the same regression
line we calculated from our lower-income data to predict happiness at higher levels of income.

Even when you see a strong pattern in your data, you can’t know for certain whether that pattern
continues beyond the range of values you have actually measured. Therefore, it’s important to avoid
extrapolating beyond what the data actually tell you.

Frequently asked questions about simple linear regression

What is a regression model?

What is simple linear regression?

How is the error calculated in a linear regression model?

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article”
button to automatically add the citation to our free Citation Generator.

Bevans, R. (2022, November 15). Simple Linear Regression | An Easy Introduction & Examples. Scribbr.
Retrieved June 8, 2023, from https://www.scribbr.com/statistics/simple-linear-regression/
Cite this article

Is this article helpful?

Rebecca Bevans

Rebecca Bevans

Rebecca is working on her PhD in soil ecology and spends her free time writing. She's very happy to be
able to nerd out about statistics with all of you.

Other students also liked

An Introduction to t Tests | Definitions, Formula and Examples

A t test is a statistical test used to compare the means of two groups. The type of t test you use depends
on what you want to find out.

2084

Multiple Linear Regression | A Quick Guide (Examples)

Multiple linear regression is a model for predicting the value of one dependent variable based on two or
more independent variables.

613

Linear Regression in R | A Step-by-Step Guide & Examples

To perform linear regression in R, there are 6 main steps. Use our sample data and code to perform
simple or multiple regression.

952

Scribbr

Our editors

Jobs

FAQ

Partners

Our services
Plagiarism Checker

Proofreading Services

Citation Generator

Paraphrasing Tool

Grammar Checker

Free Text Summarizer

Citation Checker

Knowledge Base

Contact

[email protected]

+1 (510) 822-8066

Trustpilot logo4.7

Terms of Use Privacy Policy Copyright Policy Happiness guarantee

We have updated our Privacy Policy. By continuing to our site, you confirm

You might also like