0% found this document useful (0 votes)
11 views60 pages

BookSlides - 7 Part A - Error-Based - Learning

The document outlines the fundamentals of multivariate linear regression using gradient descent, emphasizing the iterative adjustment of parameters to minimize prediction error. It discusses the importance of error functions, particularly the sum of squared errors, and illustrates these concepts with a dataset of office rental prices. The document also introduces the concept of error surfaces and how they relate to finding optimal model parameters.

Uploaded by

Houda Maarfi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views60 pages

BookSlides - 7 Part A - Error-Based - Learning

The document outlines the fundamentals of multivariate linear regression using gradient descent, emphasizing the iterative adjustment of parameters to minimize prediction error. It discusses the importance of error functions, particularly the sum of squared errors, and illustrates these concepts with a dataset of office rental prices. The document also introduces the concept of error surfaces and how they relate to finding optimal model parameters.

Uploaded by

Houda Maarfi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Error-based Learning
Sections 7.1, 7.2, 7.3

Dr. Mohamed Brahimi and Prof. Ahmed Guessoum


Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

1 Big Idea

2 Fundamentals
Simple Linear Regression
Measuring Error
Error Surfaces
3 Standard Approach: Multivariate Linear Regression
with Gradient Descent
Multivariate Linear Regression
Gradient Descent
Choosing Learning Rates & Initial Weights
A Worked Example

4 Summary
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Big Idea
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

It is common practice for humans solving various problems


(facing different challenges) to progressively adjust their
solution to the problem and aiming to reduce the gap
(error) they are making in their attempt to find the ”right”
(acceptable) solution.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

It is common practice for humans solving various problems


(facing different challenges) to progressively adjust their
solution to the problem and aiming to reduce the gap
(error) they are making in their attempt to find the ”right”
(acceptable) solution.
A paramaterised prediction model is initialised with a set
of random parameters and an error function is used to
judge how well this initial model performs when making
predictions for instances in a training dataset.
Based on the value of the error function, the parameters
are iteratively adjusted to create a more and more accurate
model.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

It is common practice for humans solving various problems


(facing different challenges) to progressively adjust their
solution to the problem and aiming to reduce the gap
(error) they are making in their attempt to find the ”right”
(acceptable) solution.
A paramaterised prediction model is initialised with a set
of random parameters and an error function is used to
judge how well this initial model performs when making
predictions for instances in a training dataset.
Based on the value of the error function, the parameters
are iteratively adjusted to create a more and more accurate
model.
A family of error-based machine learning algorithms takes
this same approach.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Fundamentals
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Simple Linear Regression

Table: The office rentals dataset: a dataset that includes office


rental prices and a number of descriptive features for 10 Dublin
city-centre offices.
Size B ROADBAND E NERGY R ENTAL
ID SQ F T F LOOR R ATE R ATING P RICE
1 500 4 8 C 320
2 550 7 50 A 380
3 620 9 7 A 400
4 630 5 24 B 390
5 665 8 100 C 385
6 700 4 8 B 410
7 770 10 7 B 480
8 880 12 50 A 600
9 920 14 8 C 570
10 1,000 9 24 B 620

We start by considering only SIZE and RENTAL PRICE.


Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Simple Linear Regression

Table: The office rentals dataset: a dataset that includes office


rental prices and a number of descriptive features for 10 Dublin
city-centre offices.
R ENTAL
ID S IZE P RICE
1 500 320
2 550 380
3 620 400
4 630 390
5 665 385
6 700 410
7 770 480
8 880 600
9 920 570
10 1,000 620
600
550
Rental Price
500
450
400
350

500 600 700 800 900 1000


Size

Figure: A scatter plot of the S IZE and R ENTAL P RICE features from
the office rentals dataset.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Simple Linear Regression

From the scatter plot it appears that there is a linear


relationship between the S IZE and R ENTAL P RICE.
The equation of a line can be written as:
y = mx + b (1)
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Simple Linear Regression

The scatter plot below shows the same scatter plot as


shown in Figure 1 [10] with a simple linear model added to
capture the relationship between office sizes and office
rental prices.
This model is:
R ENTAL P RICE = 6.47 + 0.62 × S IZE
600
550
Rental Price
500
450
400
350

500 600 700 800 900 1000


Size
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Simple Linear Regression

R ENTAL P RICE = 6.47 + 0.62 × S IZE

Using this model determine the expected rental price of the


730 square foot office:
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Simple Linear Regression

R ENTAL P RICE = 6.47 + 0.62 × S IZE

Using this model determine the expected rental price of the


730 square foot office:
R ENTAL P RICE = 6.47 + 0.62 × 730
= 459.07
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Simple Linear Regression

Mw (d) = w[0] + w[1] × d[1] (2)


Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Measuring Error

600
550
Rental Price
500
450
400
350

500 600 700 800 900 1000


Size

Figure: A scatter plot of the S IZE and R ENTAL P RICE features from
the office rentals dataset. A collection of possible simple linear
regression models capturing the relationship between these two
features are also shown. For all models w[0] is set to 6.47. From top
to bottom the models use 0.4, 0.5, 0.62, 0.7 and 0.8 respectively for
w[1].
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Measuring Error

600
550
Rental Price
500
450
400
350

500 600 700 800 900 1000


Size

Figure: A scatter plot of the S IZE and R ENTAL P RICE features from
the office rentals dataset showing a candidate prediction model (with
w[0] = 6.47 and w[1] = 0.62) and the resulting errors.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Measuring Error

There are many different kinds of error functions, but for


measuring the fit of simple linear regression models, the most
commonly used is the sum of squared errors error function, or
L2 .
n
1X
L2 (Mw , D) = (ti − Mw (di [1]))2 (3)
2
i=1
n
1X
= (ti − (w[0] + w[1] × di [1]))2 (4)
2
i=1
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Measuring Error

Table: Calculating the sum of squared errors for the candidate model
(with w[0] = 6.47 and w[1] = 0.62) making predictions for the office
rentals dataset.
R ENTAL Model Error Squared
ID P RICE PredictionError Error
1 320 316.79 3.21 10.32
2 380 347.82 32.18 1,035.62
3 400 391.26 8.74 76.32
4 390 397.47 -7.47 55.80
5 385 419.19-34.19 1,169.13
6 410 440.91-30.91 955.73
7 480 484.36 -4.36 19.01
8 600 552.63 47.37 2,243.90
9 570 577.46 -7.46 55.59
10 620 627.11 -7.11 50.51
Sum 5,671.64
Sum of squared errors (Sum/2) 2,835.82
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Measuring Error

If we perform the same calculation for the other candidate


models shown in an earlier Figure, we find that with w[1]
set to 0.4, 0.5, 0.7, and 0.8, the sums of squared errors are
136218, 42712, 20092, and 90978, respectively.
The errors for these models are larger than for the model
with w[1] set to 0.62 which confirms the previous visual
intuition that this model most accurately fits the training
data was correct.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Error Surfaces

For every possible combination of weights, w[0] and w[1],


there is a corresponding sum of squared errors value that
can be joined together to make a surface.

4.0e+07
3

3.5e+07

2
3.0e+07

2.5e+07
1

2.0e+07

w1
0 1.5e+07

1.0e+07

−1

5.0e+06

0.0e+00
−2

−5 0 5 10 15

w0

(a) (b)

Figure: (a) A 3D surface plot and (b) a contour plot of the error
surface generated by plotting the sum of squared errors value for the
office rentals training set for each possible combination of values for
w[0] (from the range [−10, 20]) and w[1] (from the range [−2, 3]).
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Error Surfaces

The x-y plane is known as a weight space and the


surface is known as an error surface.
The model that best fits the training data is the model
corresponding to the lowest point on the error surface.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Error Surfaces

The x-y plane is known as a weight space and the


surface is known as an error surface.
The model that best fits the training data is the model
corresponding to the lowest point on the error surface.
Due to the model used, the error surface is convex and
possesses a global minimum.
So we can find the optimal weights at the point where the
partial derivatives of the error surface with respect to w[0]
and w[1] are equal to 0.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Error Surfaces

Using Equation (4)[18] we can formally define this point on


the error surface as the point at which:

n
∂ 1X
(ti − (w[0] + w[1] × di [1]))2 = 0 (5)
∂w[0] 2
i=1
and
n
∂ 1 X
(ti − (w[0] + w[1] × di [1]))2 = 0 (6)
∂w[1] 2
i=1
There are a number of different ways to find this point.
We will describe a guided search approach known as the
gradient descent algorithm.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Standard Approach: Multivariate


Linear Regression with Gradient
Descent
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Multivariate Linear Regression

The most common approach to error-based machine learning


for predictive analytics is to use multivariable linear regression
with gradient descent to train a best-fit model for a given
training dataset.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Multivariate Linear Regression

Table: A dataset that includes office rental prices and a number of


descriptive features for 10 Dublin city-center offices.
B ROADBAND E NERGY R ENTAL
ID S IZE F LOOR R ATE R ATING P RICE
1 500 4 8 C 320
2 550 7 50 A 380
3 620 9 7 A 400
4 630 5 24 B 390
5 665 8 100 C 385
6 700 4 8 B 410
7 770 10 7 B 480
8 880 12 50 A 600
9 920 14 8 C 570
10 1,000 9 24 B 620
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Multivariate Linear Regression

We can define a multivariate linear regression model as:


Mw (d) = w[0] + w[1] × d[1] + · · · + w[m] × d[m] (7)
Xm
= w[0] + w[j] × d[j] (8)
j=1
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Multivariate Linear Regression

We can make Equation (8)[28] look a little neater by


inventing a dummy descriptive feature, d[0], that is always
equal to 1:

Mw (d) = w[0] × d[0] + w[1] × d[1] + . . . + w[m] × d[m] (9)


Xm
= w[j] × d[j] (10)
j=0
= w·d (11)
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Multivariate Linear Regression

The sum of squared errors loss function, L2 , definition that


we gave in Equation (4)[18] changes only very slightly to
reflect the new regression equation:
n
1X
L2 (Mw , D) = (ti − Mw (di ))2 (12)
2
i=1
n
1X
= (ti − (w · di ))2 (13)
2
i=1
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Multivariate Linear Regression

This multivariate model allows us to include all but one of


the descriptive features in Table 3 [19] in a regression model
to predict office rental prices.
The resulting multivariate regression model equation is:

R ENTAL P RICE = w[0] + w[1] × S IZE + w[2] × F LOOR


+ w[3] × B ROADBAND R ATE
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Multivariate Linear Regression

We will see in the next section how the best-fit set of


weights for this equation are found, but for now we will set:
w[0] = −0.1513,
w[1] = 0.6270,
w[2] = −0.1781,
w[3] = 0.0714.
This means that the model is rewritten as:

R ENTAL P RICE = −0.1513 + 0.6270 × S IZE


− 0.1781 × F LOOR
+ 0.0714 × B ROADBAND R ATE
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Multivariate Linear Regression

Using this model:

R ENTAL P RICE = −0.1513 + 0.6270 × S IZE


− 0.1781 × F LOOR
+ 0.0714 × B ROADBAND R ATE

We can, for example, predict the expected rental price of a


690 square foot office on the 11th floor of a building with a
broadband rate of 50 Mb per second as:

R ENTAL P RICE = ?
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Multivariate Linear Regression

Using this model:

R ENTAL P RICE = −0.1513 + 0.6270 × S IZE


− 0.1781 × F LOOR
+ 0.0714 × B ROADBAND R ATE

we can, for example, predict the expected rental price of a


690 square foot office on the 11th floor of a building with a
broadband rate of 50 Mb per second as:

R ENTAL P RICE = −0.1513 + 0.6270 × 690


−0.1781 × 11 + 0.0714 × 50
= 434.0896
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Gradient Descent

(a) (b)

Figure: (a) A 3D surface plot and (b) a contour plot of the same error
surface. The lines indicate the path that the gradient decent algorithm
would take across this error surface from different starting positions to
the global minimum - marked as the white dot in the centre.
The journey across the error surface that is taken by the
gradient descent algorithm when training the simple
version of the office rentals example - involving just S IZE
and R ENTAL P RICE.

(a) (b)

Figure: (a) A 3D surface plot and (b) a contour plot of the error
surface for the office rentals dataset showing the path that the
gradient descent algorithm takes towards the best fit model.
600

600

600
550

550

550
Rental Price

Rental Price

Rental Price
500

500

500
450

450

450
400

400

400
350

350

350
500 600 700 800 900 1000 500 600 700 800 900 1000 500 600 700 800 900 1000
Size Size Size

8e+05
600

600

6e+05
550

550

Sum of Squared Errors


Rental Price

Rental Price
500

500

4e+05
450

450

2e+05
400

400
350

350

0e+00
500 600 700 800 900 1000 500 600 700 800 900 1000 0 20 40 60 80 100
Size Size Training Iteration

Figure: A selection of the simple linear regression models developed


during the gradient descent process for the office rentals dataset. The
final panel shows the sum of squared error values generated during
the gradient descent process.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Gradient Descent

Require: set of training instances D


Require: a learning rate α that controls how quickly the
algorithm converges
Require: a function, errorDelta, that determines the direction
in which to adjust a given weight, w[j], so as to move down
the slope of an error surface determined by the dataset, D
Require: a convergence criterion that indicates that the
algorithm has completed
1: w ← random starting point in the weight space
2: repeat
3: for each w[j] in w do
4: w[j] ← w[j] + α × errorDelta(D, w[j])
5: end for
6: until convergence occurs

The gradient descent algorithm for training multivariate


linear regression models.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Gradient Descent

The most important part to the gradient descent algorithm


is Line Rule 4 on which the weights are updated.

w[j] ← w[j] + α × errorDelta(D, w[j])

Each weight is considered independently and for each one


a small adjustment is made by adding a small delta value
to the current weight, w[j].
This adjustment should ensure that the change in the
weight leads to a move downwards on the error surface.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Gradient Descent

Imagine for a moment that our training dataset, D contains


just one training example: (d, t)
The gradient of the error surface is given as the partial
derivative of L2 with respect to each weight, w[j]:

 
∂ ∂1 2
L2 (Mw , D) = (t − Mw (d)) (14)
∂w[j] 2
∂w[j]

= (t − Mw (d)) × (t − Mw (d)) (15)
∂w[j]

= (t − Mw (d)) × (t − (w · d)) (16)
∂w[j]
= (t − Mw (d)) × −d[j] (17)
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Gradient Descent

Adjusting the calculation to take into account multiple


training instances:
n
∂ X
L2 (Mw , D) = ((ti − Mw (di )) × −di [j])
∂w[j]
i=1

We use this equation to define the errorDelta in our


gradient descent algorithm.
X n
w[j] ← w[j] + α ((ti − Mw (di )) × di [j])
|i=1 {z }
errorDelta(D,w[j])

This rule is known as the known as the weight update rule for
multivariable linear regression with gradient descent.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Gradient Descent

The approach to training multivariable linear regression


models described so far is known as batch gradient
descent.
The word batch is used because only one adjustment is
made to each weight at each iteration of the algorithm
based on summing the squared error made by the
candidate model for each instance in the training dataset.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Choosing Learning Rates & Initial Weights

The (constant) learning rate, α, determines the size of the


adjustment made to each weight at each step in the
process.
Unfortunately, choosing learning rates is not a well defined
science.
Most practitioners use rules of thumb and trial and error.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Choosing Learning Rates & Initial Weights

(a) (b) (c)


80

80

80
● ● ●








Sum of Squared Errors

Sum of Squared Errors

Sum of Squared Errors


60

60

60











●●
40

40

40

● ●



●●





●●

● ●



●●


20

20

20

●●
● ●
●●


● ●


● ●



●●

●●


●● ●

●●


● ● ●


●●


●●

●●


●●

●●

● ●●
●●

●●


●●

●●


●●

●●


● ● ●●●●


●●


●●

●●


●●

●●


●●

●●


●●

●●


●●

●●


●●

●●


●●

●●


●●

●●


●●

●●


●●

●●


●●

●●


●●

●●


●●

●●


●●

● ●
●●


●●

●●


●●

●●


●●

●●


●●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●



● ●●


●●

●●


●●

●●


●●

●●


●●

●●


●●

●●


●●
0

0
0 100 200 300 400 500 5 10 15 20 0 5 10 15 20 25 30 35
Training Iteration Training Iteration Training Iteration

(d) (e) (f)

Figure: Plots of the journeys made across the error surface for the
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Choosing Learning Rates & Initial Weights

Previous figures: Plots of the journeys made across the error


surface for the simple office rentals prediction problem for
different learning rates: (a) a very small learning rate (0.002),
(b) a medium learning rate (0.08) and (c) a very large learning
rate (0.18).
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Choosing Learning Rates & Initial Weights

A typical range for learning rates is [0.00001, 10]


Based on empirical evidence, choosing random initial
weights uniformly from the range [−0.2, 0.2] tends to work
well.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

A Worked Example

We are now in a position to build a linear regression model


that uses all of the continuous descriptive features in the
office rentals dataset.
The general structure of the model is:

R ENTAL P RICE = w[0] + w[1] × S IZE + w[2] × F LOOR


+ w[3] × B ROADBAND R ATE
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

A Worked Example

Table: The office rentals dataset: a dataset that includes office


rental prices and a number of descriptive features for 10 Dublin
city-centre offices.
B ROADBAND E NERGY R ENTAL
ID S IZE F LOOR R ATE R ATING P RICE
1 500 4 8 C 320
2 550 7 50 A 380
3 620 9 7 A 400
4 630 5 24 B 390
5 665 8 100 C 385
6 700 4 8 B 410
7 770 10 7 B 480
8 880 12 50 A 600
9 920 14 8 C 570
10 1,000 9 24 B 620
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

A Worked Example

For this example let’s assume that:


α = 0.00000002 and
the initial weights are chosen from a uniform random
distribution in the range [−0.2, 0.2].
Initial Weights
w[0]: -0.146 w[1]: 0.185 w[2]: -0.044 w[3]: 0.119
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

A Worked Example

Iteration 1
R ENTAL Squared errorDelta(D, w[i])
ID P RICE Pred.
Error Error w[0] w[1] w[2] w[3]
1 320 93.26
226.74 51411.08 226.74 113370.05 906.96 1813.92
2 380 107.41
272.59 74307.70 272.59 149926.92 1908.16 13629.72
3 400 115.15
284.85 81138.96 284.85 176606.39 2563.64 1993.94
4 390 119.21
270.79 73327.67 270.79 170598.22 1353.95 6498.98
5 385 134.64
250.36 62682.22 250.36 166492.17 2002.91 25036.42
6 410 130.31
279.69 78226.32 279.69 195782.78 1118.76 2237.52
7 480 142.89
337.11 113639.88 337.11 259570.96 3371.05 2359.74
8 600 168.32
431.68 186348.45 431.68 379879.24 5180.17 21584.05
9 570 170.63
399.37 159499.37 399.37 367423.83 5591.23 3194.99
10 620 187.58
432.42 186989.95 432.42 432423.35 3891.81 10378.16
Sum 1067571.59 3185.61 2412073.90 27888.65 88727.43
Sum of squared errors (Sum/2) 533785.80
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

A Worked Example

n
X
w[j] ← w[j] + α ((ti − Mw (di )) × di [j])
i=1
| {z }
errorDelta(D,w[j])

Initial Weights
w[0]: -0.146 w[1]: 0.185 w[2]: -0.044 w[3]: 0.119

Example

w[1] ← 0.185 + 0.00000002 × 2, 412, 074 = 0.23324148


New Weights (Iteration 1)
w[0]: -0.146 w[1]: 0.233 w[2]: -0.043 w[3]: 0.121
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

A Worked Example

Iteration 2
R ENTAL Squared errorDelta(D, w[i])
ID P RICE Pred.
Error Error w[0] w[1] w[2] w[3]
1 320 117.40
202.60 41047.92 202.60 101301.44 810.41 1620.82
2 380 134.03
245.97 60500.69 245.97 135282.89 1721.78 12298.44
3 400 145.08
254.92 64985.12 254.92 158051.51 2294.30 1784.45
4 390 149.65
240.35 57769.68 240.35 151422.55 1201.77 5768.48
5 385 166.90
218.10 47568.31 218.10 145037.57 1744.81 21810.16
6 410 164.10
245.90 60468.86 245.90 172132.91 983.62 1967.23
7 480 180.06
299.94 89964.69 299.94 230954.68 2999.41 2099.59
8 600 210.87
389.13 151424.47 389.13 342437.01 4669.60 19456.65
9 570 215.03
354.97 126003.34 354.97 326571.94 4969.57 2839.76
10 620 187.58
432.42 186989.95 432.42 432423.35 3891.81 10378.16
Sum 886723.04 2884.32 2195615.84 25287.08 80023.74
Sum of squared errors (Sum/2) 443361.52
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

A Worked Example

n
X
w[j] ← w[j] + α ((ti − Mw (di )) × di [j])
i=1
| {z }
errorDelta(D,w[j])

Initial Weights (Iteration 2)


w[0]: -0.146 w[1]: 0.233 w[2]: -0.043 w[3]: 0.121

Exercise
w[1] ←?, α = 0.00000002

New Weights (Iteration 2)


w[0]: ? w[1]: ? w[2]: ? w[3]: ?
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

A Worked Example

n
X
w[j] ← w[j] + α ((ti − Mw (di )) × di [j])
i=1
| {z }
errorDelta(D,w[j])

Initial Weights (Iteration 2)


w[0]: -0.146 w[1]: 0.233 w[2]: -0.043 w[3]: 0.121

Exercise

w[1] ← 0.233 + 0.00000002 × 2195616.08 = 0.27691232


New Weights (Iteration 2)
w[0]: -0.145 w[1]: 0.277 w[2]: -0.043 w[3]: 0.123
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

A Worked Example

The algorithm then keeps iteratively applying the weight


update rule until it converges on a stable set of weights
beyond which little improvement in model accuracy is
possible.
After 100 iterations the final values for the weights are:
w[0] = −0.1513,
w[1] = 0.6270,
w[2] = −0.1781
w[3] = 0.0714
which results in a sum of squared errors value of 2, 913.5
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

A Worked Example

Note that a careful examination of the previous tables


shows why such a low learning rate is used in this example.
The large values of the RENTAL PRICE feature, [320,
620], cause the squared errors and, in turn, the error delta
values to become very large.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

A Worked Example

Note that a careful examination of the previous tables


shows why such a low learning rate is used in this example.
The large values of the RENTAL PRICE feature, [320,
620], cause the squared errors and, in turn, the error delta
values to become very large.
So a very low learning rate is required in order to ensure
that the changes made to the weights at each iteration of
the learning process are small enough for the algorithm to
work effectively.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

A Worked Example

Note that a careful examination of the previous tables


shows why such a low learning rate is used in this example.
The large values of the RENTAL PRICE feature, [320,
620], cause the squared errors and, in turn, the error delta
values to become very large.
So a very low learning rate is required in order to ensure
that the changes made to the weights at each iteration of
the learning process are small enough for the algorithm to
work effectively.
Using normalization on the features can help avoid these
large squared errors, and will be done in most examples
from now on.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

1 Big Idea

2 Fundamentals
Simple Linear Regression
Measuring Error
Error Surfaces
3 Standard Approach: Multivariate Linear Regression
with Gradient Descent
Multivariate Linear Regression
Gradient Descent
Choosing Learning Rates & Initial Weights
A Worked Example

4 Summary
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Slide Acknowledgment

The slides used in this course are based on the official


textbook materials, with modifications made where
necessary to suit the course requirements and enhance
the learning experience.

You might also like