Chapter3 First Application Linear Regression
Chapter3 First Application Linear Regression
I. Introduction
Linear regression is one of the most important regression models which are used in machine
learning. In the regression model, the output variable, which has to be predicted, should be a
continuous variable, such as predicting the weight of a person in a class.
The regression model also follows the supervised learning method, which means that to build
the model, we’ll use past data with labels, which helps predict the output variable in the
future.
Using the linear regression model, we’ll predict the relationship between the two factors/
variables. The variable which we are expecting is called the dependent variable.
• Simple linear regression: It contains only one independent variable, which we use to
predict the dependent variable using one straight line.
• Multiple linear regression, which includes more than one independent variable.
We have data from a company containing the amount spent on Marketing and its sales
corresponding to that marketing budget. The data looks like this,
Using Microsoft Excel charts, we can make a scatter plot that looks like the following for the
above data.
20
15
10
0
0 50 100 150 200 250 300 350 400
The above plot signifies the scatter plot of all the data points according to our given data.
Now, we have to fit a straight line through the data points, which helps us predict future sales.
y = mx + c
Y = β0 + β1X
Now, there can be so many straight lines that can be passed through the data points. We must
find out the best fit line that can be a model to use it for future predictions. To find the best fit
line among all the lines, we’ll introduce a parameter called Residual(e).
Residual is the difference between Y-axis’s actual value and the Y-axis’s predicted value
based on the straight-line equation for that particular X.
Let’s say we have the scatter plot and straight line like the following figure,
Now, using the above figure, the residual value for x = 2 is:
e = 3–4 = -1
Similarly, we have a residual value for every data point, which is the difference between the
actual Y value and predicted Y value.
𝑒𝑖 = 𝑦𝑖 — ⏞
𝑦𝑖
So, to find out the best fit line, we’ll use a method called Ordinary Least Squares Method
or Residual Sum of Square (RSS) Method.
𝑅𝑆𝑆 = 𝑒1 2 + 𝑒2 2 + 𝑒3 2 + . . . +𝑒𝑚 2
The RSS value will be least for the best fit line.
Typically, machine learning models define a cost function for a particular problem. Then we
try to minimize or maximize the cost function based on our requirement. In the above
regression model, the RSS is the cost function; we would like to reduce the cost and find out
the β0 and β1 for the straight-line equation.
Now, let’s come back to our marketing dataset in the excel sheet. Using the Linear
Forecast option in Trendline for the above scatter plot, we’ll directly get the best-fit line
for scatter plot without manually calculating the residual values.
15
10
0
0 50 100 150 200 250 300 350 400
As we can see,
Slope(β1) = 0.0528
Intercept(β0) = 3.3525
Let us calculate the predicted sales(Y) for all the data points(X) using the above straight-line
equation.
After that, let’s also calculate the Residual Square value for each data point.
Let’s see the excel sheet after applying the above formula to calculate residual square.
Now, RSS is the sum of all the Residual square values from the above sheet.
RSS = 28.77190461
Since this is the best-fit line, the RSS value we got here is the minimum.
If we observe RSS value here, it is an absolute quantity. In the future, if we change the
problem setting where we measure sales in terms of billions instead of millions, the RSS
quantity is going to change.
So, we need to define an alternate measure that is relative and not an absolute quantity. That
alternate measure is called Total Sum of Squares (TSS). Using TSS, we’ll calculate the R²
value, which will determine if the model is viable or not.
Where,
Y1, Y2, Y3,….., Ym are the values from the data points.
Ȳ is the average value of the Y-axis column
If R² is close to 1, then our model is excellent, and we can use the model to predict the
analysis. The model is not suitable for the predictive analysis if the value is close to 0.
First, we’ll find out the (Yn-Ȳ)² value for every data point, the Ȳ (Average Y-value) is
15.56470588
Marketing Budget (X) in Actual Sales(Y) in Predicted Sales(Y^) in Residual Square Sum of Square
Thousands Millions Millions (y-y-pred)^2 (y-yaverage)
127.4 10.5 10.07922 0.177055808 25.65124567
364.4 21.4 22.59282 1.422819552 34.05065744
150 10 11.2725 1.61925625 30.96595156
128.7 9.6 10.14786 0.30015058 35.57771626
285.9 17.4 18.44802 1.09834592 3.368304498
200 12.5 13.9125 1.99515625 9.392422145
303.3 20 19.36674 0.401018228 19.67183391
315.7 21 20.02146 0.957540532 29.54242215
Figure 6: Computing the Sum of Squares using the Y-value and average of all Y-values
TSS = 297.5188235
Since we have already calculated the RSS above. Let’s find out the value of R²,
R² = 1-(RSS/TSS) = 0.903293834.
If we observe the scatter plot graph with the best-fit line above, there below the straight-line
equation, the excel already calculated the R² value as 0.9033, which is what we got using all
the calculations.
Since the R² value is more than 90%, this model is highly recommended to predict future
analysis.
IV Conclusion
The regression model is one of the essential models in machine learning. Using this model,
we can predict the outcome of the variable. If the output variable is categorical, we’ll use
another type of model called the Classification model.