0% found this document useful (0 votes)
22 views

Module 2 - Intro To Regression Analysis

Regression analysis is a statistical technique used to determine the relationship between variables and predict future outcomes. It involves collecting data on dependent and independent variables, analyzing their correlation, and generating an equation to best model their relationship while accounting for uncertainty through an error term.

Uploaded by

godizlatan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Module 2 - Intro To Regression Analysis

Regression analysis is a statistical technique used to determine the relationship between variables and predict future outcomes. It involves collecting data on dependent and independent variables, analyzing their correlation, and generating an equation to best model their relationship while accounting for uncertainty through an error term.

Uploaded by

godizlatan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Regression Analysis

(One of the most important types of data analysis is called regression analysis.)

Consider Scenario: Suppose you’re a sales manager trying to predict next month’s numbers.
You know that dozens, perhaps even hundreds of factors — from the weather to a competitor’s promotion
to the rumor of a new and improved model — can impact the numbers. Perhaps people in your organization
even have a theory about what will have the biggest effect on sales. The more rain we have, the more we
sell.” “Six weeks after the competitor’s promotion, sales jump.”

Regression analysis is a way of mathematically sorting out which of those variables does indeed have
an impact. It answers the questions: Which factors matter most? Which can we ignore? How do those
factors interact with one another? And, perhaps most important, how certain are we about all these factors?

In regression analysis, those factors are called “variables.” You have your dependent variable — the
main factor that you’re trying to understand or predict. The dependent variable is monthly sales. And then
you have your independent variables— the factors you suspect have an impact on your dependent variable.

How Does it Work?

To conduct a regression analysis, you gather the data on the variables in question. (Reminder: You likely
don’t have to do this yourself, but it’s helpful for you to understand the process your data analyst colleague
uses.) You take all your monthly sales numbers for, say, the past three years and any data on the independent
variables you’re interested in. So, in this case, let’s say you find out the average monthly rainfall for the
past three years as well. Then you plot all that information on a chart that looks like this:

1
The y-axis is the amount of sales (the dependent variable, the thing you’re interested in, is always on the
y-axis), and the x-axis is the total rainfall. Each blue dot represents one month’s data—how much it rained
that month and how many sales you made that same month.

Glancing at this data, you probably notice that sales are higher on days when it rains a lot. That’s
interesting to know, but by how much? If it rains three inches, do you know how much you’ll sell? What
about if it rains four inches?

Now imagine drawing a line through the chart above, one that runs roughly through the middle of all the
data points. This line will help you answer, with some degree of certainty, how much you typically sell
when it rains a certain amount.

This is called the “regression line,” and it’s drawn (using a statistics program like SPSS or STATA or
even Excel) to show the line that best fits the data. In other words, “The red line is the best explanation
of the relationship between the independent variable and dependent variable.”

In addition to drawing the line, your statistics program also outputs a formula that explains the slope of the
line and looks something like this:

y = 200 + 5x + error term

Ignore the error term for now. It refers to the fact that regression isn’t perfectly precise. Just focus on the
model:

2
y = 200 + 5x

This formula is telling you that if there is no x then y = 200.


Note: The y-axis is the amount of sales (dependent), x-axis is the total rainfall (independent)

So, historically, when it didn’t rain at all, you made an average of 200 sales and you can expect to do the
same going forward, assuming other variables stay the same. And in the past, for every additional inch of
rain, you made an average of five more sales. “For every increment that x goes up one, y goes up by
five,”

Now let’s return to the error term. You might be tempted to say that rain has a big impact on sales if for
every inch you get five more sales, but whether this variable is worth your attention will depend on the
error term. A regression line always has an error term because, in real life, independent variables are
never perfect predictors of the dependent variables. Rather the line is an estimate based on the
available data. So, the error term tells you how certain you can be about the formula. The larger it
is, the less certain the regression line.

The above example uses only one variable to predict the factor of interest — in this case, rain to predict
sales. Typically you start a regression analysis wanting to understand the impact of several independent
variables. You might include not just rain but also data about a competitor’s promotion. “You keep doing
this until the error term is very small,”. “You’re trying to get the line that fits best with your data.” Although
there can be dangers in trying to include too many variables in a regression analysis, skilled analysts can
minimize those risks. And considering the impact of multiple variables at once is one of the biggest
advantages of regression analysis.

How Do Companies Use It?

Regression analysis is the “go-to method in analytics”. And smart companies use it to make decisions
about all sorts of business issues. “As managers, we want to figure out how we can affect sales, retain
employees, or recruit the best people. It helps us figure out what we can do.”

Most companies use regression analysis to explain a phenomenon they want to understand (for example,
Why did customer service calls drop last month?); predict things about the future (for example, What will
sales look like over the next six months?); or to decide what to do (for example, Should we go with this
promotion or a different one?).

3
A note about “correlation is not causation”: Whenever you work with regression analysis or any other
analysis that tries to explain the impact of one factor on another, you need to remember the important adage:
Correlation is not causation. This is critical. Here’s why: It’s easy to say that there is a correlation between
rain and monthly sales. The regression shows that they are indeed related. But it’s an entirely different thing
to say that rain caused the sales. Unless you’re selling umbrellas, it might be difficult to prove that there is
cause and effect.

Sometimes factors that are so obviously not connected by cause and effect are correlated, but more often in
business, it’s not so obvious. When you see a correlation from a regression analysis, you can’t make
assumptions. Instead, “you must go out and see what’s happening in the real world. What’s the physical
mechanism that’s causing the relationship?” Observe consumers buying your product in the rain, talk to
them, and find out what is actually causing them to make the purchase. “A lot of people skip this step, and
I think it’s because they’re lazy. The goal is not to figure out what is going on in the data but to figure out
is what is going on in the world. You must pound the pavement,” he says.

What Mistakes Do People Make When Working with Regression Analysis?

As a consumer of regression analysis, you need to keep several things in mind. First, don’t tell your data
analysts to figure out what is affecting sales. It’s your job to identify the factors that you suspect are
having an impact and ask your analyst to look at those. “If you tell a data scientist to go on a fishing
expedition, or to tell you something you don’t know, then you deserve what you get, which is bad analysis”.
In other words, don’t ask your analysts to look at every variable they can possibly get their hands on all at
once. If you do, you’ll probably find relationships that don’t really exist. It’s the same principle as flipping
a coin: Do it enough times and you’ll eventually think you see something interesting, like a bunch of heads
all in a row.

(Note: The y-axis is the amount of sales (dependent), x-axis is the total rainfall (independent) we
considered).

You can’t change how much it rains, so how important is it to understand that? “We can’t do anything
about weather or our competitor’s promotion, but we can affect our own promotions or add features.
Second, “analyses are very sensitive to bad data,” so be careful about the data you collect and how you
collect it, and know whether you can trust it. “All the data doesn’t need to be correct or perfect.” If the
decisions you’ll make as a result don’t have a huge impact on your business, then it’s OK if the data is
“kind of leaky.” But “if you’re trying to decide whether to build 8 or 10 of something and each one costs
$1 million to build, then it’s a bigger deal.”

4
In regression analysis the majority make the mistake of ignoring the error term. This is dangerous because
they’re making the relationship between something more certain than it is. “Oftentimes the results spit out
of a computer and managers think, ‘That’s great, let’s use this going forward.’” But remember that the
results are always uncertain. “If the regression explains 90% of the relationship, that’s great. But if it
explains 10%, and you act like it’s 90%, that’s not good.” The point of the analysis is to quantify the
certainty that something will happen. “It’s not telling you how rain will influence your sales, but it’s
telling you the probability that rain may influence your sales.”

Ask yourself whether the results fit with your understanding of the situation. And if you see something that
doesn’t make sense, ask whether the data was right or whether there is indeed a large error term

Summarizing:
What is Regression Analysis?

Predictive modelling techniques such as regression analysis may be used to determine the relationship

between a dataset’s dependent (goal) and independent variables. It is widely used when the dependent and

independent variables are linked in a linear or non-linear fashion, and the target variable has a set of

continuous values. Thus, regression analysis approaches help establish causal relationships between

variables, modelling time series, and forecasting. Regression analysis, for example, is the best way to

examine the relationship between sales and advertising expenditures for a corporation.

What is the purpose of a regression model?

Regression analysis is used for one of two purposes: predicting the value of the dependent variable when

information about the independent variables is known or predicting the effect of an independent variable

on the dependent variable.

5
Types of Regression

There are various types of regressions which are used in data science and machine learning. Each type has
its own importance on different scenarios, but at the core, all the regression methods analyze the effect of
the independent variable on dependent variables. Here we are discussing some important types of regression
which are given below:

o Linear Regression

o Logistic Regression

o Polynomial Regression

o Support Vector Regression


o Decision Tree Regression
o Random Forest Regression
o Ridge Regression

o Lasso Regression:

6
Linear Regression:
o Linear regression is a statistical regression method which is used for predictive analysis.
o It is one of the very simple and easy algorithms which works on regression and shows the relationship
between the continuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the independent variable (X-axis) and the
dependent variable (Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear regression is called simple linear regression.
And if there is more than one input variable, then such linear regression is called multiple linear
regression.
o The relationship between variables in the linear regression model can be explained using the below
image. Here we are predicting the salary of an employee on the basis of the year of experience.

7
o Below is the mathematical equation for Linear regression:

1. Y= aX+b

Here, Y = dependent variables (target variables), X= Independent variables (predictor variables), a


and b are the linear coefficients

Some popular applications of linear regression are:

o Analyzing trends and sales estimates


o Salary forecasting
o Real estate prediction

o Arriving at ETAs in traffic.

Logistic Regression:
o Logistic regression is another supervised learning algorithm which is used to solve the classification
problems. In classification problems, we have dependent variables in a binary or discrete format such
as 0 or 1.
o Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or No, True or
False, Spam or not spam, etc.
o It is a predictive analysis algorithm which works on the concept of probability.
o Logistic regression is a type of regression, but it is different from the linear regression algorithm in the
term how they are used.
o Logistic regression uses sigmoid function or logistic function which is a complex cost function. This
sigmoid function is used to model the data in logistic regression. The function can be represented as:

o f(x)= Output between the 0 and 1 value.


o x= input to the function
o e= base of natural logarithm.

When we provide the input values (data) to the function, it gives the S-curve as follows:

8
o It uses the concept of threshold levels, values above the threshold level are rounded up to 1, and values
below the threshold level are rounded up to 0.

There are three types of logistic regression:

o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)

Polynomial Regression:
o Polynomial Regression is a type of regression which models the non-linear dataset using a linear
model.
o It is similar to multiple linear regression, but it fits a non-linear curve between the value of x and
corresponding conditional values of y.
o Suppose there is a dataset which consists of datapoints which are present in a non-linear fashion, so for
such case, linear regression will not best fit to those datapoints. To cover such datapoints, we need
Polynomial regression.
o In Polynomial regression, the original features are transformed into polynomial features of given
degree and then modeled using a linear model. Which means the datapoints are best fitted using a
polynomial line.

9
o The equation for polynomial regression also derived from linear regression equation that means Linear
regression equation Y= b0+ b1x, is transformed into Polynomial regression equation Y= b0+b1x+ b2x2+
b3x3+.....+ bnxn.
o Here Y is the predicted/target output, b0, b1,... bn are the regression coefficients. x is
our independent/input variable.
o The model is still linear as the coefficients are still linear with quadratic

Support Vector Regression:

Support Vector Machine is a supervised learning algorithm which can be used for regression as well as
classification problems. So if we use it for regression problems, then it is termed as Support Vector
Regression.

Support Vector Regression is a regression algorithm which works for continuous variables. Below are some
keywords which are used in Support Vector Regression:

o Kernel: It is a function used to map a lower-dimensional data into higher dimensional data.
o Hyperplane: In general SVM, it is a separation line between two classes, but in SVR, it is a line which
helps to predict the continuous variables and cover most of the datapoints.
o Boundary line: Boundary lines are the two lines apart from hyperplane, which creates a margin for
datapoints.
o Support vectors: Support vectors are the datapoints which are nearest to the hyperplane and opposite
class.
10
In SVR, we always try to determine a hyperplane with a maximum margin, so that maximum number
of datapoints are covered in that margin. The main goal of SVR is to consider the maximum
datapoints within the boundary lines and the hyperplane (best-fit line) must contain a maximum
number of datapoints. Consider the below image:

Here, the blue line is called hyperplane, and the other two lines are known as boundary lines.

Decision Tree Regression:


o Decision Tree is a supervised learning algorithm which can be used for solving both classification and
regression problems.
o It can solve problems for both categorical and numerical data
o Decision Tree regression builds a tree-like structure in which each internal node represents the "test" for an
attribute, each branch represent the result of the test, and each leaf node represents the final decision or
result.
oA decision tree is constructed starting from the root node/parent node (dataset), which splits into left and
right child nodes (subsets of dataset). These child nodes are further divided into their children node, and
themselves become the parent node of those nodes. Consider the below image:

11
Above image showing the example of Decision Tee regression, here, the model is trying to predict the
choice of a person between Sports cars or Luxury car.

o Random forest is one of the most powerful supervised learning algorithms which is capable of
performing regression as well as classification tasks.
o The Random Forest regression is an ensemble learning method which combines multiple decision trees
and predicts the final output based on the average of each tree output. The combined decision trees are
called as base models, and it can be represented more formally as:

g(x)= f0(x)+ f1(x)+ f2(x)+....

o Random forest uses Bagging or Bootstrap Aggregation technique of ensemble learning in which
aggregated decision tree runs in parallel and do not interact with each other.
o With the help of Random Forest regression, we can prevent Overfitting in the model by creating random
subsets of the dataset.

12
Ridge Regression:
o Ridge regression is one of the most robust versions of linear regression in which a small amount of bias is
introduced so that we can get better long term predictions.
o The amount of bias added to the model is known as Ridge Regression penalty. We can compute this
penalty term by multiplying with the lambda to the squared weight of each individual features.
o The equation for ridge regression will be:

oA general linear or polynomial regression will fail if there is high collinearity (exactly correlated) between
the independent variables, so to solve such problems, Ridge regression can be used.
o Ridge regression is a regularization technique, which is used to reduce the complexity of the model. It is
also called as L2 regularization.
o It helps to solve the problems if we have more parameters than samples.

13
Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity of the model.
o It is similar to the Ridge Regression except that penalty term contains only the absolute weights instead of
a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression can only shrink

it near to 0.
o It is also called as L1 regularization. The equation for Lasso regression will be:

14
Simple Linear Regression
Simple linear regression is a statistical method you can use to understand the relationship between two
variables, x and y.

One variable, x, is known as the predictor variable.

The other variable, y, is known as the response variable.

Simple linear regression models the relationship between the magnitude of one variable and that of a second
— for example, as X increases, Y also increases. Or as X increases, Y decreases
.
Correlation is another way to measure how two variables are related. The difference is that while correlation
measures the strength of an association between two variables, regression quantifies the nature of the
relationship.

Simple linear regression is used to find out the best relationship between a single input variable (predictor,
independent variable, input feature, input parameter) & output variable (predicted, dependent variable,
output feature, output parameter) provided that both variables are continuous in nature. This relationship
represents how an input variable is related to the output variable and how it is represented by a straight line.

To understand this concept, let us have a look at scatter plots. Scatter diagrams or plots provides a graphical
representation of the relationship of two continuous variables.

15
After looking at scatter plot we can understand:
1. The direction
2. The strength
3. The linearity

The above characteristics are between variable Y and variable X. The above scatter plot shows us that
variable Y and variable X possess a strong positive linear relationship. Hence, we can project a straight line
which can define the data in the most accurate way possible.

If the relationship between variable X and variable Y is strong and linear, then we conclude that particular
independent variable X is the effective input variable to predict dependent variable Y.

To check the collinearity ( collinearity of two variables means that strong correlation exists between them)
between variable X and variable Y, we have correlation coefficient (r), which will give you numerical value
of correlation between two variables.

You can have strong, moderate or weak correlation between two variables. Higher the value of “r”, higher
the preference given for particular input variable X for predicting output variable Y. Few properties of “r”
are listed as follows:
1. Range of r: -1 to +1
2. Perfect positive relationship: +1
3. Perfect negative relationship: -1
4. No Linear relationship: 0
5. Strong correlation: r > 0.85 (depends on business scenario)

Command used for calculation “r” in RStudio is:


> cor(X, Y)
where, X: independent variable & Y: dependent variable Now, if the result of the above command is greater
than 0.85 then choose simple linear regression.
If r < 0.85 then use transformation of data to increase the value of “r” and then build a simple linear
regression model on transformed data.

16
Steps to Implement Simple Linear Regression:
1. Analyze data (analyze scatter plot for linearity)
2. Get sample data for model building
3. Then design a model that explains the data
4. And use the same developed model on the whole population to make predictions.

The equation that represents how an independent variable X is related to a dependent variable Y.

• y is the predicted value of the dependent variable (y) for any given value of the independent variable
(x).
• B0 is the intercept, the predicted value of y when the x is 0.
• B1 is the regression coefficient – how much we expect y to change as x increases.
• x is the independent variable ( the variable we expect is influencing y).
• e is the error of the estimate, or how much variation there is in our regression coefficient estimate.

Example:

Let us understand simple linear regression by considering an example. Consider we want to predict the
weight gain based upon calories consumed only based on the below given data.

17
Now, if we want to predict weight gain when you consume 2500 calories. Firstly, we need to visualize
data by drawing a scatter plot of the data to conclude that calories consumed is the best independent
variable X to predict dependent variable Y.

We can also calculate “r” as follows:

As, r = 0.9910422 which is greater than 0.85, we shall consider calories consumed as the best
independent variable(X) and weight gain(Y) as the predict dependent variable.

Now, try to imagine a straight line drawn in a way that should be close to every data point in the scatter
diagram.

18
To predict the weight gain for consumption of 2500 calories, you can simply extend the straight line
further to the y-axis at a value of 2,500 on x-axis . This projected value of y-axis gives you the rough
weight gain. This straight line is a regression line.

Similarly, if we substitute the x value in equation of regression model such as:

y value will be predicted.

Following is the command to build a linear regression model.

We obtain the following values

Substitute these values in the equation to get y as shown below.

So, weight gain predicted by our simple linear regression model is 4.49Kgs after consumption of 2500
calories.

19
20
Fitted values
The fitted values for a linear regression model are the predicted values of the outcome variable for the data
that is used to fit the model. A fitted value is a statistical model’s prediction of the mean response value
when you input the values of the predictors, factor levels, or components into the model.

Suppose you have the following regression equation: y = 3X + 5. If you enter a value of 5 for the predictor,
the fitted value is 20. Fitted values are also called predicted values.

Residuals
In linear regression, the residuals are the differences between each of the fitted values and true values of
the outcome variable. They can be calculated by subtracting the fitted values from the true values.

The difference between the observed data value and the predicted value (the value on the straight line) is
the error or residual.

Residual = Observed – Predicted

21
Least Squares
The least squares method is a form of mathematical regression analysis used to determine the line of best
fit for a set of data, providing a visual demonstration of the relationship between the data points. Each point
of data represents the relationship between a known independent variable and an unknown dependent
variable.

During the process of finding the relation between two variables, the trend of outcomes are estimated
quantitatively. This process is termed as regression analysis. The method of curve fitting is an approach
to regression analysis. This method of fitting equations which approximates the curves to given raw data is
the least squares.

Consider:
Let us look at a simple example, Ms. Dolma said in the class "Hey students who spend more time on their
assignments are getting better grades". A student wants to estimate his grade for spending 2.3 hours on an
assignment. Through the magic of the least-squares method, it is possible to determine the predictive model
that will help him estimate the grades far more accurately. This method is much simpler because it requires
nothing more than some data and maybe a calculator.

Least Square Method Definition

The least-squares method is a statistical method used to find the line of best fit of the form of an equation
such as y = mx + b to the given data. The curve of the equation is called the regression line.

Our main objective in this method is to reduce the sum of the squares of errors as much as possible. This is
the reason this method is called the least-squares method. This method is often used in data fitting where
the best fit result is assumed to reduce the sum of squared errors that is considered to be the difference
between the observed values and corresponding fitted value. The sum of squared errors helps in finding the
variation in observed data. For example, we have 4 data points and using this method we arrive at the
following graph.

22
Least Square Method Formula

Least-square method is the curve that best fits a set of observations with a minimum sum of squared
residuals or errors. Let us assume that the given points of data are (x 1, y1), (x2, y2), (x3, y3), …, (xn, yn) in
which all x’s are independent variables, while all y’s are dependent ones. This method is used to find
a linear line of the form y = mx + b, where y and x are variables, m is the slope, and b is the y-intercept.
The formula to calculate slope m and the value of b is given by:
m = (n∑xy - ∑y∑x)/n∑x2 - (∑x)2

b = (∑y - m∑x)/n

Here, n is the number of data points.

Following are the steps to calculate the least square using the above formulas.

• Step 1: Draw a table with 4 columns where the first two columns are for x and y points.
• Step 2: In the next two columns, find xy and (x)2.
• Step 3: Find ∑x, ∑y, ∑xy, and ∑(x)2.
• Step 4: Find the value of slope m using the above formula.
• Step 5: Calculate the value of b using the above formula.
• Step 6: Substitute the value of m and b in the equation y = mx + b

23
Let us look at an example to understand this better.

Example: Let's say we have data as shown below.

x 1 2 3 4 5

y 2 5 3 8 7

Solution: We will follow the steps to find the linear line.

x y xy x2

1 2 2 1

2 5 10 4

3 3 9 9

4 8 32 16

5 7 35 25

∑x =15 ∑y = 25 ∑xy = 88 ∑x2 = 55

Find the value of m by using the formula,

m = (n∑xy - ∑y∑x)/n∑x2 - (∑x)2


m = [(5×88) - (15×25)]/(5×55) - (15)2

m = (440 - 375)/(275 - 225)

m = 65/50 = 13/10

Find the value of b by using the formula,

b = (∑y - m∑x)/n

b = (25 - 1.3×15)/5

b = (25 - 19.5)/5

b = 5.5/5

So, the required equation of least squares is y = mx + b = 13/10x + 5.5/5.

24
Assumptions of Linear Regression
To conduct a simple linear regression, one has to make certain assumptions about the data. This is because
it is a parametric test. The assumptions used while performing a simple linear regression are as follows:

• Homogeneity of variance (homoscedasticity)- One of the main predictions in a simple linear


regression method is that the size of the error stays constant. This simply means that in the value of
the independent variable, the error size never changes significantly.
• Independence of observations- All the relationships between the observations are transparent,
which means that nothing is hidden, and only valid sampling methods are used during the collection
of data.
• Normality- There is a normal rate of flow in the data.

However, there is one additional assumption that has to be taken into consideration while specifically
conducting a linear regression.

• The line is always a straight line- There is no curve or grouping factor during the conduction of a
linear regression. There is a linear relationship between the variables (dependent variable and
independent variable). If the data fails the assumptions of homoscedastic

Applications of Simple Linear Regression


• Marks scored by students based on number of hours studied (ideally)- Here marks scored in exams
are independent and the number of hours studied is independent.
• Predicting crop yields based on the amount of rainfall- Yield is a dependent variable while the
measure of precipitation is an independent variable.
• Predicting the Salary of a person based on years of experience- Therefore, Experience becomes the
independent while Salary turns into the dependent variable.

Limitations of Simple Linear Regression


Indeed, even the best information doesn't recount a total story. Regression investigation is ordinarily
utilized in examination to set up that a relationship exists between variables. However, correlation isn't
equivalent to causation: a connection between two variables doesn't mean one causes the other to occur.
Indeed, even a line in a simple linear regression that fits the information focuses well may not ensure a
circumstances and logical results relationship.

25
Utilizing a linear regression model will permit you to find whether a connection between variables exists
by any means. To see precisely what that relationship is and whether one variable causes another, you will
require extra examination and statistical analysis.

26
Multiple Linear Regression
Multiple linear regression is used to estimate the relationship between two or more
independent variables and one dependent variable. You can use multiple linear
regression when you want to know:

1. How strong the relationship is between two or more independent variables and one
dependent variable (e.g. how rainfall, temperature, and amount of fertilizer added
affect crop growth).
2. The value of the dependent variable at a certain value of the independent variables
(e.g. the expected yield of a crop at certain levels of rainfall, temperature, and
fertilizer addition).

Multiple Linear Regression Formula

Where:

• yi is the dependent or predicted variable


• β0 is the y-intercept, i.e., the value of y when both xi and x2 are 0.
• β1 and β2 are the regression coefficients representing the change in y relative to a
one-unit change in xi1 and xi2, respectively.
• βp is the slope coefficient for each independent variable
• ϵ is the model’s random error (residual) term.

Understanding Multiple Linear Regression

Simple linear regression enables statisticians to predict the value of one variable using the
available information about another variable. Linear regression attempts to establish the
relationship between the two variables along a straight line.

27
Multiple regression is a type of regression where the dependent variable shows
a linear relationship with two or more independent variables. It can also be non-linear,
where the dependent and independent variables do not follow a straight line.

Both linear and non-linear regression track a particular response using two or more variables
graphically. However, non-linear regression is usually difficult to execute since it is created
from assumptions derived from trial and error.

Assumptions of Multiple Linear Regression

Multiple linear regression is based on the following assumptions:

1. A linear relationship between the dependent and independent variables

The first assumption of multiple linear regression is that there is a linear relationship
between the dependent variable and each of the independent variables. The best way to
check the linear relationships is to create scatterplots and then visually inspect the
scatterplots for linearity. If the relationship displayed in the scatterplot is not linear, then the
analyst will need to run a non-linear regression or transform the data using statistical
software, such as SPSS.

2. The independent variables are not highly correlated with each other

The data should not show multicollinearity, which occurs when the independent variables
(explanatory variables) are highly correlated. When independent variables show
multicollinearity, there will be problems figuring out the specific variable that contributes
to the variance in the dependent variable. The best method to test for the assumption is the
Variance Inflation Factor method.

3. The variance of the residuals is constant

Multiple linear regression assumes that the amount of error in the residuals is similar at each
point of the linear model (the size of the error in our prediction doesn’t change significantly
across the values of the independent variable). This scenario is known as homoscedasticity.

28
When analyzing the data, the analyst should plot the standardized residuals against the
predicted values to determine if the points are distributed fairly across all the values of
independent variables. To test the assumption, the data can be plotted on a scatterplot or by
using statistical software to produce a scatterplot that includes the entire model.

4. Independence of observation

The model assumes that the observations should be independent of one another. Simply put,
the model assumes that the values of residuals are independent. To test for this assumption,
we use the Durbin Watson statistic.

The observations in the dataset were collected using statistically valid sampling methods,
and there are no hidden relationships among variables.

In multiple linear regression, it is possible that some of the independent variables are
actually correlated with one another, so it is important to check these before developing the
regression model. If two independent variables are too highly correlated (r2 > ~0.6), then
only one of them should be used in the regression model.

5. Multivariate normality

Multivariate normality occurs when residuals are normally distributed. To test this
assumption, look at how the values of residuals are distributed. It can also be tested using
two main methods, i.e., a histogram with a superimposed normal curve or the Normal
Probability Plot method.

29

You might also like