0% found this document useful (0 votes)
15 views8 pages

Chapter3 First Application Linear Regression

Uploaded by

Paterson Nguepi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views8 pages

Chapter3 First Application Linear Regression

Uploaded by

Paterson Nguepi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

School year 2023-2024/ Second semester/ FET / Computer Engineering

Artificial Intelligence and machine learning

Chapter 3: First Machine Learning Algorithm Linear


regression

I. Introduction

Linear regression is one of the most important regression models which are used in machine
learning. In the regression model, the output variable, which has to be predicted, should be a
continuous variable, such as predicting the weight of a person in a class.

The regression model also follows the supervised learning method, which means that to build
the model, we’ll use past data with labels, which helps predict the output variable in the
future.

Using the linear regression model, we’ll predict the relationship between the two factors/
variables. The variable which we are expecting is called the dependent variable.

The linear regression model is of two types:

• Simple linear regression: It contains only one independent variable, which we use to
predict the dependent variable using one straight line.
• Multiple linear regression, which includes more than one independent variable.

In this chapter, we will concentrate on the Simple linear regression model.

II. Simple Linear Regression

We have data from a company containing the amount spent on Marketing and its sales
corresponding to that marketing budget. The data looks like this,

Proposed by Dr. SOP DEFFO Lionel Landry Page 1


School year 2023-2024/ Second semester/ FET / Computer Engineering

Marketing Budget (X) in Actual Sales(Y) in


Thousands Millions
127.4 10.5
364.4 21.4
150 10
128.7 9.6
285.9 17.4
200 12.5
303.3 20
315.7 21
169.8 14.7
104.9 10.1
297.7 21.5
256.4 16.6
249.1 17.1
323.1 20.7
223 15.5
235 13.5
200 12.5

Figure 1: Sample Marketing Data

Using Microsoft Excel charts, we can make a scatter plot that looks like the following for the
above data.

Actual Sales(Y) in Millions


25

20

15

10

0
0 50 100 150 200 250 300 350 400

Figure 2 : Scatter Plot for the above data

Proposed by Dr. SOP DEFFO Lionel Landry Page 2


School year 2023-2024/ Second semester/ FET / Computer Engineering

The above plot signifies the scatter plot of all the data points according to our given data.
Now, we have to fit a straight line through the data points, which helps us predict future sales.

We know that a straight line is represented as:

y = mx + c

Here, we call the line as Regression Line, which is represented as:

Y = β0 + β1X

Now, there can be so many straight lines that can be passed through the data points. We must
find out the best fit line that can be a model to use it for future predictions. To find the best fit
line among all the lines, we’ll introduce a parameter called Residual(e).

Residual is the difference between Y-axis’s actual value and the Y-axis’s predicted value
based on the straight-line equation for that particular X.

Let’s say we have the scatter plot and straight line like the following figure,

Figure 3 : Image by Author — Calculating Residual value using the graph

Proposed by Dr. SOP DEFFO Lionel Landry Page 3


School year 2023-2024/ Second semester/ FET / Computer Engineering

Now, using the above figure, the residual value for x = 2 is:

Residual(e) = Actual value of Y — the predicted value of Y using the line

e = 3–4 = -1

So, the residual for the value x = 2 is -1.

Similarly, we have a residual value for every data point, which is the difference between the
actual Y value and predicted Y value.

𝑒𝑖 = 𝑦𝑖 — ⏞
𝑦𝑖

So, to find out the best fit line, we’ll use a method called Ordinary Least Squares Method
or Residual Sum of Square (RSS) Method.

𝑅𝑆𝑆 = 𝑒1 2 + 𝑒2 2 + 𝑒3 2 + . . . +𝑒𝑚 2

The RSS value will be least for the best fit line.

III Cost Function

Typically, machine learning models define a cost function for a particular problem. Then we
try to minimize or maximize the cost function based on our requirement. In the above
regression model, the RSS is the cost function; we would like to reduce the cost and find out
the β0 and β1 for the straight-line equation.

Now, let’s come back to our marketing dataset in the excel sheet. Using the Linear
Forecast option in Trendline for the above scatter plot, we’ll directly get the best-fit line
for scatter plot without manually calculating the residual values.

Proposed by Dr. SOP DEFFO Lionel Landry Page 4


School year 2023-2024/ Second semester/ FET / Computer Engineering

Actual Sales(Y) in Millions


25
y = 0.0528x + 3.3525
20

15

10

0
0 50 100 150 200 250 300 350 400

Figure 4 : Best-fit line using Microsoft Excel scatter plot options

As we can see,

Slope(β1) = 0.0528

Intercept(β0) = 3.3525

Let us calculate the predicted sales(Y) for all the data points(X) using the above straight-line
equation.

The Predicted sales will be,

Marketing Budget (X) in Actual Sales(Y) in Predicted Sales(Y-pred)


Thousands Millions in Millions
127.4 10.5 10.07922
364.4 21.4 22.59282
150 10 11.2725
128.7 9.6 10.14786
285.9 17.4 18.44802
200 12.5 13.9125
303.3 20 19.36674
315.7 21 20.02146
169.8 14.7 12.31794
104.9 10.1 8.89122
297.7 21.5 19.07106
256.4 16.6 16.89042
249.1 17.1 16.50498

Proposed by Dr. SOP DEFFO Lionel Landry Page 5


School year 2023-2024/ Second semester/ FET / Computer Engineering

323.1 20.7 20.41218


223 15.5 15.1269
235 13.5 15.7605
200 12.5 13.9125

Figure 5 : Predicting the sales using (y=0.0528x+3.33525) equation

After that, let’s also calculate the Residual Square value for each data point.

Residual Square = (Actual Y value — Predicted Y value)²

Let’s see the excel sheet after applying the above formula to calculate residual square.

Marketing Budget (X) in Actual Sales(Y) in Predicted Sales(Y^) in Residual Square


Thousands Millions Millions (y_y-pred)^2
127.4 10.5 10.07922 0.177055808
364.4 21.4 22.59282 1.422819552
150 10 11.2725 1.61925625
128.7 9.6 10.14786 0.30015058
285.9 17.4 18.44802 1.09834592
200 12.5 13.9125 1.99515625
303.3 20 19.36674 0.401018228
315.7 21 20.02146 0.957540532
169.8 14.7 12.31794 5.674209844
104.9 10.1 8.89122 1.461149088
297.7 21.5 19.07106 5.899749524
256.4 16.6 16.89042 0.084343776
249.1 17.1 16.50498 0.3540488
323.1 20.7 20.41218 0.082840352
223 15.5 15.1269 0.13920361
235 13.5 15.7605 5.10986025
200 12.5 13.9125 1.99515625

Figure 6 : Dataset after calculating the Residual Squares

Now, RSS is the sum of all the Residual square values from the above sheet.

RSS = 28.77190461

Since this is the best-fit line, the RSS value we got here is the minimum.

Proposed by Dr. SOP DEFFO Lionel Landry Page 6


School year 2023-2024/ Second semester/ FET / Computer Engineering

If we observe RSS value here, it is an absolute quantity. In the future, if we change the
problem setting where we measure sales in terms of billions instead of millions, the RSS
quantity is going to change.

So, we need to define an alternate measure that is relative and not an absolute quantity. That
alternate measure is called Total Sum of Squares (TSS). Using TSS, we’ll calculate the R²
value, which will determine if the model is viable or not.

TSS = (Y1-Ȳ)² + (Y2-Ȳ)² + (Y3-Ȳ)² + ……. + (Ym-Ȳ)²

Where,
Y1, Y2, Y3,….., Ym are the values from the data points.
Ȳ is the average value of the Y-axis column

Now, after calculating TSS, we will compute R².


R² = 1-(RSS/TSS)

R² value always lies between 0 and 1.

If R² is close to 1, then our model is excellent, and we can use the model to predict the
analysis. The model is not suitable for the predictive analysis if the value is close to 0.

Now, let’s calculate the TSS in our excel dataset.

First, we’ll find out the (Yn-Ȳ)² value for every data point, the Ȳ (Average Y-value) is
15.56470588

Now, the dataset looks like,

Marketing Budget (X) in Actual Sales(Y) in Predicted Sales(Y^) in Residual Square Sum of Square
Thousands Millions Millions (y-y-pred)^2 (y-yaverage)
127.4 10.5 10.07922 0.177055808 25.65124567
364.4 21.4 22.59282 1.422819552 34.05065744
150 10 11.2725 1.61925625 30.96595156
128.7 9.6 10.14786 0.30015058 35.57771626
285.9 17.4 18.44802 1.09834592 3.368304498
200 12.5 13.9125 1.99515625 9.392422145
303.3 20 19.36674 0.401018228 19.67183391
315.7 21 20.02146 0.957540532 29.54242215

Proposed by Dr. SOP DEFFO Lionel Landry Page 7


School year 2023-2024/ Second semester/ FET / Computer Engineering

169.8 14.7 12.31794 5.674209844 0.747716263


104.9 10.1 8.89122 1.461149088 29.86301038
297.7 21.5 19.07106 5.899749524 35.22771626
256.4 16.6 16.89042 0.084343776 1.07183391
249.1 17.1 16.50498 0.3540488 2.357128028
323.1 20.7 20.41218 0.082840352 26.37124567
223 15.5 15.1269 0.13920361 0.004186851
235 13.5 15.7605 5.10986025 4.263010381
200 12.5 13.9125 1.99515625 9.392422145

Figure 6: Computing the Sum of Squares using the Y-value and average of all Y-values

TSS = Sum of all the Sum of Squares from the dataset

TSS = 297.5188235

Since we have already calculated the RSS above. Let’s find out the value of R²,
R² = 1-(RSS/TSS) = 0.903293834.

If we observe the scatter plot graph with the best-fit line above, there below the straight-line
equation, the excel already calculated the R² value as 0.9033, which is what we got using all
the calculations.

Since the R² value is more than 90%, this model is highly recommended to predict future
analysis.

IV Conclusion

The regression model is one of the essential models in machine learning. Using this model,
we can predict the outcome of the variable. If the output variable is categorical, we’ll use
another type of model called the Classification model.

Proposed by Dr. SOP DEFFO Lionel Landry Page 8

You might also like