0% found this document useful (0 votes)

22 views

Module 2 - Intro To Regression Analysis

Regression analysis is a statistical technique used to determine the relationship between variables and predict future outcomes. It involves collecting data on dependent and independent variables, analyzing their correlation, and generating an equation to best model their relationship while accounting for uncertainty through an error term.

Uploaded by

godizlatan

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

Module 2 - Intro To Regression Analysis

Uploaded by

godizlatan

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Regression Analysis

(One of the most important types of data analysis is called regression analysis.)

Consider Scenario: Suppose you’re a sales manager trying to predict next month’s numbers.
You know that dozens, perhaps even hundreds of factors — from the weather to a competitor’s promotion
to the rumor of a new and improved model — can impact the numbers. Perhaps people in your organization
even have a theory about what will have the biggest effect on sales. The more rain we have, the more we
sell.” “Six weeks after the competitor’s promotion, sales jump.”

Regression analysis is a way of mathematically sorting out which of those variables does indeed have
an impact. It answers the questions: Which factors matter most? Which can we ignore? How do those
factors interact with one another? And, perhaps most important, how certain are we about all these factors?

In regression analysis, those factors are called “variables.” You have your dependent variable — the
main factor that you’re trying to understand or predict. The dependent variable is monthly sales. And then
you have your independent variables— the factors you suspect have an impact on your dependent variable.

How Does it Work?

To conduct a regression analysis, you gather the data on the variables in question. (Reminder: You likely
don’t have to do this yourself, but it’s helpful for you to understand the process your data analyst colleague
uses.) You take all your monthly sales numbers for, say, the past three years and any data on the independent
variables you’re interested in. So, in this case, let’s say you find out the average monthly rainfall for the
past three years as well. Then you plot all that information on a chart that looks like this:

1
The y-axis is the amount of sales (the dependent variable, the thing you’re interested in, is always on the
y-axis), and the x-axis is the total rainfall. Each blue dot represents one month’s data—how much it rained
that month and how many sales you made that same month.

Glancing at this data, you probably notice that sales are higher on days when it rains a lot. That’s
interesting to know, but by how much? If it rains three inches, do you know how much you’ll sell? What
about if it rains four inches?

Now imagine drawing a line through the chart above, one that runs roughly through the middle of all the
data points. This line will help you answer, with some degree of certainty, how much you typically sell
when it rains a certain amount.

This is called the “regression line,” and it’s drawn (using a statistics program like SPSS or STATA or
even Excel) to show the line that best fits the data. In other words, “The red line is the best explanation
of the relationship between the independent variable and dependent variable.”

In addition to drawing the line, your statistics program also outputs a formula that explains the slope of the
line and looks something like this:

y = 200 + 5x + error term

Ignore the error term for now. It refers to the fact that regression isn’t perfectly precise. Just focus on the
model:

2
y = 200 + 5x

This formula is telling you that if there is no x then y = 200.

Note: The y-axis is the amount of sales (dependent), x-axis is the total rainfall (independent)

So, historically, when it didn’t rain at all, you made an average of 200 sales and you can expect to do the
same going forward, assuming other variables stay the same. And in the past, for every additional inch of
rain, you made an average of five more sales. “For every increment that x goes up one, y goes up by
five,”

Now let’s return to the error term. You might be tempted to say that rain has a big impact on sales if for
every inch you get five more sales, but whether this variable is worth your attention will depend on the
error term. A regression line always has an error term because, in real life, independent variables are
never perfect predictors of the dependent variables. Rather the line is an estimate based on the
available data. So, the error term tells you how certain you can be about the formula. The larger it
is, the less certain the regression line.

The above example uses only one variable to predict the factor of interest — in this case, rain to predict
sales. Typically you start a regression analysis wanting to understand the impact of several independent
variables. You might include not just rain but also data about a competitor’s promotion. “You keep doing
this until the error term is very small,”. “You’re trying to get the line that fits best with your data.” Although
there can be dangers in trying to include too many variables in a regression analysis, skilled analysts can
minimize those risks. And considering the impact of multiple variables at once is one of the biggest
advantages of regression analysis.

How Do Companies Use It?

Regression analysis is the “go-to method in analytics”. And smart companies use it to make decisions
about all sorts of business issues. “As managers, we want to figure out how we can affect sales, retain
employees, or recruit the best people. It helps us figure out what we can do.”

Most companies use regression analysis to explain a phenomenon they want to understand (for example,
Why did customer service calls drop last month?); predict things about the future (for example, What will
sales look like over the next six months?); or to decide what to do (for example, Should we go with this
promotion or a different one?).

3
A note about “correlation is not causation”: Whenever you work with regression analysis or any other
analysis that tries to explain the impact of one factor on another, you need to remember the important adage:
Correlation is not causation. This is critical. Here’s why: It’s easy to say that there is a correlation between
rain and monthly sales. The regression shows that they are indeed related. But it’s an entirely different thing
to say that rain caused the sales. Unless you’re selling umbrellas, it might be difficult to prove that there is
cause and effect.

Sometimes factors that are so obviously not connected by cause and effect are correlated, but more often in
business, it’s not so obvious. When you see a correlation from a regression analysis, you can’t make
assumptions. Instead, “you must go out and see what’s happening in the real world. What’s the physical
mechanism that’s causing the relationship?” Observe consumers buying your product in the rain, talk to
them, and find out what is actually causing them to make the purchase. “A lot of people skip this step, and
I think it’s because they’re lazy. The goal is not to figure out what is going on in the data but to figure out
is what is going on in the world. You must pound the pavement,” he says.

What Mistakes Do People Make When Working with Regression Analysis?

As a consumer of regression analysis, you need to keep several things in mind. First, don’t tell your data
analysts to figure out what is affecting sales. It’s your job to identify the factors that you suspect are
having an impact and ask your analyst to look at those. “If you tell a data scientist to go on a fishing
expedition, or to tell you something you don’t know, then you deserve what you get, which is bad analysis”.
In other words, don’t ask your analysts to look at every variable they can possibly get their hands on all at
once. If you do, you’ll probably find relationships that don’t really exist. It’s the same principle as flipping
a coin: Do it enough times and you’ll eventually think you see something interesting, like a bunch of heads
all in a row.

(Note: The y-axis is the amount of sales (dependent), x-axis is the total rainfall (independent) we
considered).

You can’t change how much it rains, so how important is it to understand that? “We can’t do anything
about weather or our competitor’s promotion, but we can affect our own promotions or add features.
Second, “analyses are very sensitive to bad data,” so be careful about the data you collect and how you
collect it, and know whether you can trust it. “All the data doesn’t need to be correct or perfect.” If the
decisions you’ll make as a result don’t have a huge impact on your business, then it’s OK if the data is
“kind of leaky.” But “if you’re trying to decide whether to build 8 or 10 of something and each one costs
$1 million to build, then it’s a bigger deal.”

4
In regression analysis the majority make the mistake of ignoring the error term. This is dangerous because
they’re making the relationship between something more certain than it is. “Oftentimes the results spit out
of a computer and managers think, ‘That’s great, let’s use this going forward.’” But remember that the
results are always uncertain. “If the regression explains 90% of the relationship, that’s great. But if it
explains 10%, and you act like it’s 90%, that’s not good.” The point of the analysis is to quantify the
certainty that something will happen. “It’s not telling you how rain will influence your sales, but it’s
telling you the probability that rain may influence your sales.”

Ask yourself whether the results fit with your understanding of the situation. And if you see something that
doesn’t make sense, ask whether the data was right or whether there is indeed a large error term

Summarizing:
What is Regression Analysis?

Predictive modelling techniques such as regression analysis may be used to determine the relationship

between a dataset’s dependent (goal) and independent variables. It is widely used when the dependent and

independent variables are linked in a linear or non-linear fashion, and the target variable has a set of

continuous values. Thus, regression analysis approaches help establish causal relationships between

variables, modelling time series, and forecasting. Regression analysis, for example, is the best way to

examine the relationship between sales and advertising expenditures for a corporation.

What is the purpose of a regression model?

Regression analysis is used for one of two purposes: predicting the value of the dependent variable when

information about the independent variables is known or predicting the effect of an independent variable

on the dependent variable.

5
Types of Regression

There are various types of regressions which are used in data science and machine learning. Each type has
its own importance on different scenarios, but at the core, all the regression methods analyze the effect of
the independent variable on dependent variables. Here we are discussing some important types of regression
which are given below:

o Linear Regression

o Logistic Regression

o Polynomial Regression

o Support Vector Regression

o Decision Tree Regression
o Random Forest Regression
o Ridge Regression

o Lasso Regression:

6
Linear Regression:
o Linear regression is a statistical regression method which is used for predictive analysis.
o It is one of the very simple and easy algorithms which works on regression and shows the relationship
between the continuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the independent variable (X-axis) and the
dependent variable (Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear regression is called simple linear regression.
And if there is more than one input variable, then such linear regression is called multiple linear
regression.
o The relationship between variables in the linear regression model can be explained using the below
image. Here we are predicting the salary of an employee on the basis of the year of experience.

7
o Below is the mathematical equation for Linear regression:

1. Y= aX+b

Here, Y = dependent variables (target variables), X= Independent variables (predictor variables), a

and b are the linear coefficients

Some popular applications of linear regression are:

o Analyzing trends and sales estimates

o Salary forecasting
o Real estate prediction

o Arriving at ETAs in traffic.

Logistic Regression:
o Logistic regression is another supervised learning algorithm which is used to solve the classification
problems. In classification problems, we have dependent variables in a binary or discrete format such
as 0 or 1.
o Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or No, True or
False, Spam or not spam, etc.
o It is a predictive analysis algorithm which works on the concept of probability.
o Logistic regression is a type of regression, but it is different from the linear regression algorithm in the
term how they are used.
o Logistic regression uses sigmoid function or logistic function which is a complex cost function. This
sigmoid function is used to model the data in logistic regression. The function can be represented as:

o f(x)= Output between the 0 and 1 value.

o x= input to the function
o e= base of natural logarithm.

When we provide the input values (data) to the function, it gives the S-curve as follows:

8
o It uses the concept of threshold levels, values above the threshold level are rounded up to 1, and values
below the threshold level are rounded up to 0.

There are three types of logistic regression:

o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)

Polynomial Regression:
o Polynomial Regression is a type of regression which models the non-linear dataset using a linear
model.
o It is similar to multiple linear regression, but it fits a non-linear curve between the value of x and
corresponding conditional values of y.
o Suppose there is a dataset which consists of datapoints which are present in a non-linear fashion, so for
such case, linear regression will not best fit to those datapoints. To cover such datapoints, we need
Polynomial regression.
o In Polynomial regression, the original features are transformed into polynomial features of given
degree and then modeled using a linear model. Which means the datapoints are best fitted using a
polynomial line.

9
o The equation for polynomial regression also derived from linear regression equation that means Linear
regression equation Y= b0+ b1x, is transformed into Polynomial regression equation Y= b0+b1x+ b2x2+
b3x3+.....+ bnxn.
o Here Y is the predicted/target output, b0, b1,... bn are the regression coefficients. x is
our independent/input variable.
o The model is still linear as the coefficients are still linear with quadratic

Support Vector Regression:

Support Vector Machine is a supervised learning algorithm which can be used for regression as well as
classification problems. So if we use it for regression problems, then it is termed as Support Vector
Regression.

Support Vector Regression is a regression algorithm which works for continuous variables. Below are some
keywords which are used in Support Vector Regression:

o Kernel: It is a function used to map a lower-dimensional data into higher dimensional data.
o Hyperplane: In general SVM, it is a separation line between two classes, but in SVR, it is a line which
helps to predict the continuous variables and cover most of the datapoints.
o Boundary line: Boundary lines are the two lines apart from hyperplane, which creates a margin for
datapoints.
o Support vectors: Support vectors are the datapoints which are nearest to the hyperplane and opposite
class.
10
In SVR, we always try to determine a hyperplane with a maximum margin, so that maximum number
of datapoints are covered in that margin. The main goal of SVR is to consider the maximum
datapoints within the boundary lines and the hyperplane (best-fit line) must contain a maximum
number of datapoints. Consider the below image:

Here, the blue line is called hyperplane, and the other two lines are known as boundary lines.

Decision Tree Regression:

o Decision Tree is a supervised learning algorithm which can be used for solving both classification and
regression problems.
o It can solve problems for both categorical and numerical data
o Decision Tree regression builds a tree-like structure in which each internal node represents the "test" for an
attribute, each branch represent the result of the test, and each leaf node represents the final decision or
result.
oA decision tree is constructed starting from the root node/parent node (dataset), which splits into left and
right child nodes (subsets of dataset). These child nodes are further divided into their children node, and
themselves become the parent node of those nodes. Consider the below image:

11
Above image showing the example of Decision Tee regression, here, the model is trying to predict the
choice of a person between Sports cars or Luxury car.

o Random forest is one of the most powerful supervised learning algorithms which is capable of
performing regression as well as classification tasks.
o The Random Forest regression is an ensemble learning method which combines multiple decision trees
and predicts the final output based on the average of each tree output. The combined decision trees are
called as base models, and it can be represented more formally as:

g(x)= f0(x)+ f1(x)+ f2(x)+....

o Random forest uses Bagging or Bootstrap Aggregation technique of ensemble learning in which
aggregated decision tree runs in parallel and do not interact with each other.
o With the help of Random Forest regression, we can prevent Overfitting in the model by creating random
subsets of the dataset.

12
Ridge Regression:
o Ridge regression is one of the most robust versions of linear regression in which a small amount of bias is
introduced so that we can get better long term predictions.
o The amount of bias added to the model is known as Ridge Regression penalty. We can compute this
penalty term by multiplying with the lambda to the squared weight of each individual features.
o The equation for ridge regression will be:

oA general linear or polynomial regression will fail if there is high collinearity (exactly correlated) between
the independent variables, so to solve such problems, Ridge regression can be used.
o Ridge regression is a regularization technique, which is used to reduce the complexity of the model. It is
also called as L2 regularization.
o It helps to solve the problems if we have more parameters than samples.

13
Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity of the model.
o It is similar to the Ridge Regression except that penalty term contains only the absolute weights instead of
a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression can only shrink

it near to 0.
o It is also called as L1 regularization. The equation for Lasso regression will be:

14
Simple Linear Regression
Simple linear regression is a statistical method you can use to understand the relationship between two
variables, x and y.

One variable, x, is known as the predictor variable.

The other variable, y, is known as the response variable.

Simple linear regression models the relationship between the magnitude of one variable and that of a second
— for example, as X increases, Y also increases. Or as X increases, Y decreases
.
Correlation is another way to measure how two variables are related. The difference is that while correlation
measures the strength of an association between two variables, regression quantifies the nature of the
relationship.

Simple linear regression is used to find out the best relationship between a single input variable (predictor,
independent variable, input feature, input parameter) & output variable (predicted, dependent variable,
output feature, output parameter) provided that both variables are continuous in nature. This relationship
represents how an input variable is related to the output variable and how it is represented by a straight line.

To understand this concept, let us have a look at scatter plots. Scatter diagrams or plots provides a graphical
representation of the relationship of two continuous variables.

15
After looking at scatter plot we can understand:
1. The direction
2. The strength
3. The linearity

The above characteristics are between variable Y and variable X. The above scatter plot shows us that
variable Y and variable X possess a strong positive linear relationship. Hence, we can project a straight line
which can define the data in the most accurate way possible.

If the relationship between variable X and variable Y is strong and linear, then we conclude that particular
independent variable X is the effective input variable to predict dependent variable Y.

To check the collinearity ( collinearity of two variables means that strong correlation exists between them)
between variable X and variable Y, we have correlation coefficient (r), which will give you numerical value
of correlation between two variables.

You can have strong, moderate or weak correlation between two variables. Higher the value of “r”, higher
the preference given for particular input variable X for predicting output variable Y. Few properties of “r”
are listed as follows:
1. Range of r: -1 to +1
2. Perfect positive relationship: +1
3. Perfect negative relationship: -1
4. No Linear relationship: 0
5. Strong correlation: r > 0.85 (depends on business scenario)

Command used for calculation “r” in RStudio is:

> cor(X, Y)
where, X: independent variable & Y: dependent variable Now, if the result of the above command is greater
than 0.85 then choose simple linear regression.
If r < 0.85 then use transformation of data to increase the value of “r” and then build a simple linear
regression model on transformed data.

16
Steps to Implement Simple Linear Regression:
1. Analyze data (analyze scatter plot for linearity)
2. Get sample data for model building
3. Then design a model that explains the data
4. And use the same developed model on the whole population to make predictions.

The equation that represents how an independent variable X is related to a dependent variable Y.

• y is the predicted value of the dependent variable (y) for any given value of the independent variable
(x).
• B0 is the intercept, the predicted value of y when the x is 0.
• B1 is the regression coefficient – how much we expect y to change as x increases.
• x is the independent variable ( the variable we expect is influencing y).
• e is the error of the estimate, or how much variation there is in our regression coefficient estimate.

Example:

Let us understand simple linear regression by considering an example. Consider we want to predict the
weight gain based upon calories consumed only based on the below given data.

17
Now, if we want to predict weight gain when you consume 2500 calories. Firstly, we need to visualize
data by drawing a scatter plot of the data to conclude that calories consumed is the best independent
variable X to predict dependent variable Y.

We can also calculate “r” as follows:

As, r = 0.9910422 which is greater than 0.85, we shall consider calories consumed as the best
independent variable(X) and weight gain(Y) as the predict dependent variable.

Now, try to imagine a straight line drawn in a way that should be close to every data point in the scatter
diagram.

18
To predict the weight gain for consumption of 2500 calories, you can simply extend the straight line
further to the y-axis at a value of 2,500 on x-axis . This projected value of y-axis gives you the rough
weight gain. This straight line is a regression line.

Similarly, if we substitute the x value in equation of regression model such as:

y value will be predicted.

Following is the command to build a linear regression model.

We obtain the following values

Substitute these values in the equation to get y as shown below.

So, weight gain predicted by our simple linear regression model is 4.49Kgs after consumption of 2500
calories.

19
20
Fitted values
The fitted values for a linear regression model are the predicted values of the outcome variable for the data
that is used to fit the model. A fitted value is a statistical model’s prediction of the mean response value
when you input the values of the predictors, factor levels, or components into the model.

Suppose you have the following regression equation: y = 3X + 5. If you enter a value of 5 for the predictor,
the fitted value is 20. Fitted values are also called predicted values.

Residuals
In linear regression, the residuals are the differences between each of the fitted values and true values of
the outcome variable. They can be calculated by subtracting the fitted values from the true values.

The difference between the observed data value and the predicted value (the value on the straight line) is
the error or residual.

Residual = Observed – Predicted

21
Least Squares
The least squares method is a form of mathematical regression analysis used to determine the line of best
fit for a set of data, providing a visual demonstration of the relationship between the data points. Each point
of data represents the relationship between a known independent variable and an unknown dependent
variable.

During the process of finding the relation between two variables, the trend of outcomes are estimated
quantitatively. This process is termed as regression analysis. The method of curve fitting is an approach
to regression analysis. This method of fitting equations which approximates the curves to given raw data is
the least squares.

Consider:
Let us look at a simple example, Ms. Dolma said in the class "Hey students who spend more time on their
assignments are getting better grades". A student wants to estimate his grade for spending 2.3 hours on an
assignment. Through the magic of the least-squares method, it is possible to determine the predictive model
that will help him estimate the grades far more accurately. This method is much simpler because it requires
nothing more than some data and maybe a calculator.

Least Square Method Definition

The least-squares method is a statistical method used to find the line of best fit of the form of an equation
such as y = mx + b to the given data. The curve of the equation is called the regression line.

Our main objective in this method is to reduce the sum of the squares of errors as much as possible. This is
the reason this method is called the least-squares method. This method is often used in data fitting where
the best fit result is assumed to reduce the sum of squared errors that is considered to be the difference
between the observed values and corresponding fitted value. The sum of squared errors helps in finding the
variation in observed data. For example, we have 4 data points and using this method we arrive at the
following graph.

22
Least Square Method Formula

Least-square method is the curve that best fits a set of observations with a minimum sum of squared
residuals or errors. Let us assume that the given points of data are (x 1, y1), (x2, y2), (x3, y3), …, (xn, yn) in
which all x’s are independent variables, while all y’s are dependent ones. This method is used to find
a linear line of the form y = mx + b, where y and x are variables, m is the slope, and b is the y-intercept.
The formula to calculate slope m and the value of b is given by:
m = (n∑xy - ∑y∑x)/n∑x2 - (∑x)2

b = (∑y - m∑x)/n

Here, n is the number of data points.

Following are the steps to calculate the least square using the above formulas.

• Step 1: Draw a table with 4 columns where the first two columns are for x and y points.
• Step 2: In the next two columns, find xy and (x)2.
• Step 3: Find ∑x, ∑y, ∑xy, and ∑(x)2.
• Step 4: Find the value of slope m using the above formula.
• Step 5: Calculate the value of b using the above formula.
• Step 6: Substitute the value of m and b in the equation y = mx + b

23
Let us look at an example to understand this better.

Example: Let's say we have data as shown below.

x 1 2 3 4 5

y 2 5 3 8 7

Solution: We will follow the steps to find the linear line.

x y xy x2

1 2 2 1

2 5 10 4

3 3 9 9

4 8 32 16

5 7 35 25

∑x =15 ∑y = 25 ∑xy = 88 ∑x2 = 55

Find the value of m by using the formula,

m = (n∑xy - ∑y∑x)/n∑x2 - (∑x)2

m = [(5×88) - (15×25)]/(5×55) - (15)2

m = (440 - 375)/(275 - 225)

m = 65/50 = 13/10

Find the value of b by using the formula,

b = (∑y - m∑x)/n

b = (25 - 1.3×15)/5

b = (25 - 19.5)/5

b = 5.5/5

So, the required equation of least squares is y = mx + b = 13/10x + 5.5/5.

24
Assumptions of Linear Regression
To conduct a simple linear regression, one has to make certain assumptions about the data. This is because
it is a parametric test. The assumptions used while performing a simple linear regression are as follows:

• Homogeneity of variance (homoscedasticity)- One of the main predictions in a simple linear

regression method is that the size of the error stays constant. This simply means that in the value of
the independent variable, the error size never changes significantly.
• Independence of observations- All the relationships between the observations are transparent,
which means that nothing is hidden, and only valid sampling methods are used during the collection
of data.
• Normality- There is a normal rate of flow in the data.

However, there is one additional assumption that has to be taken into consideration while specifically
conducting a linear regression.

• The line is always a straight line- There is no curve or grouping factor during the conduction of a
linear regression. There is a linear relationship between the variables (dependent variable and
independent variable). If the data fails the assumptions of homoscedastic

Applications of Simple Linear Regression

• Marks scored by students based on number of hours studied (ideally)- Here marks scored in exams
are independent and the number of hours studied is independent.
• Predicting crop yields based on the amount of rainfall- Yield is a dependent variable while the
measure of precipitation is an independent variable.
• Predicting the Salary of a person based on years of experience- Therefore, Experience becomes the
independent while Salary turns into the dependent variable.

Limitations of Simple Linear Regression

Indeed, even the best information doesn't recount a total story. Regression investigation is ordinarily
utilized in examination to set up that a relationship exists between variables. However, correlation isn't
equivalent to causation: a connection between two variables doesn't mean one causes the other to occur.
Indeed, even a line in a simple linear regression that fits the information focuses well may not ensure a
circumstances and logical results relationship.

25
Utilizing a linear regression model will permit you to find whether a connection between variables exists
by any means. To see precisely what that relationship is and whether one variable causes another, you will
require extra examination and statistical analysis.

26
Multiple Linear Regression
Multiple linear regression is used to estimate the relationship between two or more
independent variables and one dependent variable. You can use multiple linear
regression when you want to know:

1. How strong the relationship is between two or more independent variables and one
dependent variable (e.g. how rainfall, temperature, and amount of fertilizer added
affect crop growth).
2. The value of the dependent variable at a certain value of the independent variables
(e.g. the expected yield of a crop at certain levels of rainfall, temperature, and
fertilizer addition).

Multiple Linear Regression Formula

Where:

• yi is the dependent or predicted variable

• β0 is the y-intercept, i.e., the value of y when both xi and x2 are 0.
• β1 and β2 are the regression coefficients representing the change in y relative to a
one-unit change in xi1 and xi2, respectively.
• βp is the slope coefficient for each independent variable
• ϵ is the model’s random error (residual) term.

Understanding Multiple Linear Regression

Simple linear regression enables statisticians to predict the value of one variable using the
available information about another variable. Linear regression attempts to establish the
relationship between the two variables along a straight line.

27
Multiple regression is a type of regression where the dependent variable shows
a linear relationship with two or more independent variables. It can also be non-linear,
where the dependent and independent variables do not follow a straight line.

Both linear and non-linear regression track a particular response using two or more variables
graphically. However, non-linear regression is usually difficult to execute since it is created
from assumptions derived from trial and error.

Assumptions of Multiple Linear Regression

Multiple linear regression is based on the following assumptions:

1. A linear relationship between the dependent and independent variables

The first assumption of multiple linear regression is that there is a linear relationship
between the dependent variable and each of the independent variables. The best way to
check the linear relationships is to create scatterplots and then visually inspect the
scatterplots for linearity. If the relationship displayed in the scatterplot is not linear, then the
analyst will need to run a non-linear regression or transform the data using statistical
software, such as SPSS.

2. The independent variables are not highly correlated with each other

The data should not show multicollinearity, which occurs when the independent variables
(explanatory variables) are highly correlated. When independent variables show
multicollinearity, there will be problems figuring out the specific variable that contributes
to the variance in the dependent variable. The best method to test for the assumption is the
Variance Inflation Factor method.

3. The variance of the residuals is constant

Multiple linear regression assumes that the amount of error in the residuals is similar at each
point of the linear model (the size of the error in our prediction doesn’t change significantly
across the values of the independent variable). This scenario is known as homoscedasticity.

28
When analyzing the data, the analyst should plot the standardized residuals against the
predicted values to determine if the points are distributed fairly across all the values of
independent variables. To test the assumption, the data can be plotted on a scatterplot or by
using statistical software to produce a scatterplot that includes the entire model.

4. Independence of observation

The model assumes that the observations should be independent of one another. Simply put,
the model assumes that the values of residuals are independent. To test for this assumption,
we use the Durbin Watson statistic.

The observations in the dataset were collected using statistically valid sampling methods,
and there are no hidden relationships among variables.

In multiple linear regression, it is possible that some of the independent variables are
actually correlated with one another, so it is important to check these before developing the
regression model. If two independent variables are too highly correlated (r2 > ~0.6), then
only one of them should be used in the regression model.

5. Multivariate normality

Multivariate normality occurs when residuals are normally distributed. To test this
assumption, look at how the values of residuals are distributed. It can also be tested using
two main methods, i.e., a histogram with a superimposed normal curve or the Normal
Probability Plot method.

Construction Companies
100% (1)
Construction Companies
24 pages
A Refresher On Regression Analysis
No ratings yet
A Refresher On Regression Analysis
1 page
The Importance of Statistics in Management Decision Making
71% (17)
The Importance of Statistics in Management Decision Making
3 pages
Sample Concrete Mix Design
100% (1)
Sample Concrete Mix Design
3 pages
Gallo HBR Regression 2015
No ratings yet
Gallo HBR Regression 2015
8 pages
A Refresher On Regression Analysis
No ratings yet
A Refresher On Regression Analysis
11 pages
A refresher on regression analysis
No ratings yet
A refresher on regression analysis
9 pages
Understanding Regression Analysis: by Amy Gallo
No ratings yet
Understanding Regression Analysis: by Amy Gallo
16 pages
HarvardBusiness HBRGuideToDataAnalyti 2018 Ch10UnderstandingRegr
No ratings yet
HarvardBusiness HBRGuideToDataAnalyti 2018 Ch10UnderstandingRegr
16 pages
MGT256-Article-RefresherLinearRegression
No ratings yet
MGT256-Article-RefresherLinearRegression
4 pages
What Is Regression Analysis
No ratings yet
What Is Regression Analysis
4 pages
Aba Exam Regression
No ratings yet
Aba Exam Regression
5 pages
Forecasting Article
No ratings yet
Forecasting Article
10 pages
UNIT 3, 4 AND 5
No ratings yet
UNIT 3, 4 AND 5
67 pages
MA Unit-3
No ratings yet
MA Unit-3
7 pages
003-Forecasting Techniques Detailed
No ratings yet
003-Forecasting Techniques Detailed
20 pages
Research Method 1
No ratings yet
Research Method 1
26 pages
Regression Analysis PDF
100% (2)
Regression Analysis PDF
205 pages
Wilson CHPT 1
No ratings yet
Wilson CHPT 1
24 pages
Unit 2
No ratings yet
Unit 2
76 pages
4.analyze and Data Driven - Facebook
No ratings yet
4.analyze and Data Driven - Facebook
27 pages
Finals-Predictive-Time-Series-Analysis - Module
No ratings yet
Finals-Predictive-Time-Series-Analysis - Module
14 pages
Assignment 3
No ratings yet
Assignment 3
3 pages
Marketing Analytics Price and Promotion
No ratings yet
Marketing Analytics Price and Promotion
90 pages
Ba 210 - Managerial Economics
No ratings yet
Ba 210 - Managerial Economics
30 pages
Business Plan Sample
No ratings yet
Business Plan Sample
7 pages
Big Data Chapter 3
No ratings yet
Big Data Chapter 3
29 pages
Business Applications of Multiple Regression
50% (4)
Business Applications of Multiple Regression
48 pages
What Is Regression Analysis (LI Post)
No ratings yet
What Is Regression Analysis (LI Post)
4 pages
Approaches of Forecasting Group 5
No ratings yet
Approaches of Forecasting Group 5
25 pages
Correlation and Regression Are The Two Analysis Based On Multivariate Distribution
No ratings yet
Correlation and Regression Are The Two Analysis Based On Multivariate Distribution
10 pages
Unit 2 (3)
No ratings yet
Unit 2 (3)
100 pages
MA (Unit-3)
No ratings yet
MA (Unit-3)
13 pages
Financial Decision Ananlysis
No ratings yet
Financial Decision Ananlysis
36 pages
White Paper On Regression
No ratings yet
White Paper On Regression
14 pages
Regression-time series summary
No ratings yet
Regression-time series summary
55 pages
UNIT 2
No ratings yet
UNIT 2
6 pages
Correlation and Regression: Explaining Association and Causation
No ratings yet
Correlation and Regression: Explaining Association and Causation
23 pages
Regression: An Introduction To Econometrics
No ratings yet
Regression: An Introduction To Econometrics
19 pages
Regression Analysis (1722021)
No ratings yet
Regression Analysis (1722021)
279 pages
Chapter 009
No ratings yet
Chapter 009
50 pages
Beginner’s Guide to Correlation Analysis: Bite-Size Stats, #4
From Everand
Beginner’s Guide to Correlation Analysis: Bite-Size Stats, #4
Lee Baker
No ratings yet
Chapter 10 Regression Slides
No ratings yet
Chapter 10 Regression Slides
46 pages
Chapter 1 the nature of regression analysis
No ratings yet
Chapter 1 the nature of regression analysis
9 pages
Unit – III - PREDICTIVE ANALYTICS
No ratings yet
Unit – III - PREDICTIVE ANALYTICS
28 pages
3 Regression Analysis
No ratings yet
3 Regression Analysis
6 pages
Regression Analysis in Machine Learning - Javatpoint
No ratings yet
Regression Analysis in Machine Learning - Javatpoint
1 page
Regression Analysis in Business Analytics
No ratings yet
Regression Analysis in Business Analytics
14 pages
Module - 03
No ratings yet
Module - 03
28 pages
Data Analytics
No ratings yet
Data Analytics
6 pages
Forecasting With Excel
No ratings yet
Forecasting With Excel
20 pages
Regression Notes-I
No ratings yet
Regression Notes-I
10 pages
Regression Analysis
No ratings yet
Regression Analysis
17 pages
QT Module-3
No ratings yet
QT Module-3
26 pages
Data Science Analytics Finals Reviewer
No ratings yet
Data Science Analytics Finals Reviewer
64 pages
Chatterjee & Hadi
100% (1)
Chatterjee & Hadi
30 pages
Regression Analysis in Excel
No ratings yet
Regression Analysis in Excel
20 pages
Ch8 - 1533638090646993.pdf 2
No ratings yet
Ch8 - 1533638090646993.pdf 2
42 pages
Data Science Interview Preparation (30 Days of Interview Preparation)
No ratings yet
Data Science Interview Preparation (30 Days of Interview Preparation)
18 pages
Data Science Interview Preparation (30 Days of Interview Preparation)
No ratings yet
Data Science Interview Preparation (30 Days of Interview Preparation)
45 pages
Forecasts: Forecasting Using Historical Data
No ratings yet
Forecasts: Forecasting Using Historical Data
11 pages
Understanding Statistics: An Introduction
From Everand
Understanding Statistics: An Introduction
Antony Davies
No ratings yet
Be Information Technology Semester 5 2022 May Software Engineeringrev 2019 C Scheme
No ratings yet
Be Information Technology Semester 5 2022 May Software Engineeringrev 2019 C Scheme
1 page
Be - Information Technology - Semester 5 - 2023 - May - Software Engineeringrev 2019 C Scheme
No ratings yet
Be - Information Technology - Semester 5 - 2023 - May - Software Engineeringrev 2019 C Scheme
1 page
ML Exp-7
No ratings yet
ML Exp-7
5 pages
Dav Exp7 56
No ratings yet
Dav Exp7 56
8 pages
ML Exp8
No ratings yet
ML Exp8
3 pages
Motor Feeder Cable & Cable Tray Sizing and Data
No ratings yet
Motor Feeder Cable & Cable Tray Sizing and Data
5 pages
Aesthetic Experience: and Literary Hermeneutics
No ratings yet
Aesthetic Experience: and Literary Hermeneutics
389 pages
Asus K53u Pbl60 La-7322p
No ratings yet
Asus K53u Pbl60 La-7322p
47 pages
Lift PDF
No ratings yet
Lift PDF
12 pages
22.electromotive Force and Potential Difference
No ratings yet
22.electromotive Force and Potential Difference
4 pages
Ufgs 40 17 26.00 20
No ratings yet
Ufgs 40 17 26.00 20
28 pages
AP Biology Final Exam Review
No ratings yet
AP Biology Final Exam Review
1 page
Manual DS-416
No ratings yet
Manual DS-416
16 pages
Lab 4
No ratings yet
Lab 4
2 pages
1 s2.0 002008919290052U Main
No ratings yet
1 s2.0 002008919290052U Main
6 pages
5 Day Manifestation Ritual JulietCleary
89% (9)
5 Day Manifestation Ritual JulietCleary
13 pages
Satip A 004 06
No ratings yet
Satip A 004 06
10 pages
Philosophy and Objectives of Edukasyon Sa Pagpapakatao
No ratings yet
Philosophy and Objectives of Edukasyon Sa Pagpapakatao
5 pages
Design of Foundation For CC Pole of 10m Height
No ratings yet
Design of Foundation For CC Pole of 10m Height
3 pages
DRRR Q2 Module13 Juliet-C.-baduya
No ratings yet
DRRR Q2 Module13 Juliet-C.-baduya
20 pages
Final Exam Pre-Intermediate
No ratings yet
Final Exam Pre-Intermediate
12 pages
140 H
100% (1)
140 H
7 pages
Catalogo ISC
No ratings yet
Catalogo ISC
6 pages
Project: Indiana Eligibility Determination Services System (IEDSS)
No ratings yet
Project: Indiana Eligibility Determination Services System (IEDSS)
3 pages
ESD - CS6 - Small Scale Embedded System Design Example
No ratings yet
ESD - CS6 - Small Scale Embedded System Design Example
54 pages
Instruction Sheet: VIQUA UV Lamp and Quartz Sleeve Replacement
No ratings yet
Instruction Sheet: VIQUA UV Lamp and Quartz Sleeve Replacement
5 pages
Teaching Modern Macro (Taylor)
No ratings yet
Teaching Modern Macro (Taylor)
5 pages
Science 9, Science 10, Mapeh 8, Tle 10
No ratings yet
Science 9, Science 10, Mapeh 8, Tle 10
2 pages
Science Form 2 March 2018 Monthly Test Marking Scheme
No ratings yet
Science Form 2 March 2018 Monthly Test Marking Scheme
3 pages
UML 2 0 in a Nutshell 1st ed Edition Dan Pilone - The full ebook version is available, download now to explore
100% (2)
UML 2 0 in a Nutshell 1st ed Edition Dan Pilone - The full ebook version is available, download now to explore
57 pages
HIST 2231 Book Review
No ratings yet
HIST 2231 Book Review
6 pages
Archaeological Theory Today 1st Edition Ian Hodder All Chapters Instant Download
100% (1)
Archaeological Theory Today 1st Edition Ian Hodder All Chapters Instant Download
81 pages
With the Photographer MCQ
No ratings yet
With the Photographer MCQ
8 pages