Day3 L1 ML Regression
Day3 L1 ML Regression
Non-Linear Relationships
Fundamentals
Simple Linear Regression
Measuring Error
Error Surfaces
Standard Approach: Multivariate Linear Regression
with Gradient Descent
Multivariate Linear Regression
Gradient Descent
Choosing Learning Rates & Initial Weights
Big Idea
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary
Fundamentals
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary
Figure: A scatter plot of the S IZE and R ENTAL P RICE features from
the office rentals dataset.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary
Measuring Error
600
550
Rental Price
500
450
400
350
Figure: A scatter plot of the S IZE and R ENTAL P RICE features from
the office rentals dataset. A collection of possible simple linear
regression models capturing the relationship between these two
features are also shown. For all models w[0] is set to 6.47. From top
to bottom the models use 0.4, 0.5, 0.62, 0.7 and 0.8 respectively for
w[1].
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary
Measuring Error
600
550
Rental Price
500
450
400
350
Figure: A scatter plot of the S IZE and R ENTAL P RICE features from
the office rentals dataset showing a candidate prediction model (with
w[0] = 6.47 and w[1] = 0.62) and the resulting errors.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary
Measuring Error
n
1X
L2 (Mw , D) = (ti − Mw (di [1]))2 (3)
2
i=1
n
1 X
= (ti − (w[0] + w[1] × di [1]))2 (4)
2
i=1
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary
Measuring Error
Table: Calculating the sum of squared errors for the candidate model
(with w[0] = 6.47 and w[1] = 0.62) making predictions for the the
office rentals dataset.
R ENTAL Model
Error Squared
ID P RICE Prediction
Error Error
1 320 316.79
3.21 10.32
2 380 347.82
32.18 1,035.62
3 400 391.26
8.74 76.32
4 390 397.47
-7.47 55.80
5 385 419.19
-34.19 1,169.13
6 410 440.91
-30.91 955.73
7 480 484.36
-4.36 19.01
8 600 552.63
47.37 2,243.90
9 570 577.46
-7.46 55.59
10 620 627.11
-7.11 50.51
Sum 5,671.64
Sum of squared errors (Sum/2) 2,835.82
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary
Error Surfaces
4.0e+07
3
3.5e+07
2
3.0e+07
2.5e+07
1
2.0e+07
w1
0 1.5e+07
1.0e+07
−1
5.0e+06
0.0e+00
−2
−5 0 5 10 15
w0
(a) (b)
Figure: (a) A 3D surface plot and (b) a contour plot of the error
surface generated by plotting the sum of squared errors value for the
office rentals training set for each possible combination of values for
w[0] (from the range [−10, 20]) and w[1] (from the range [−2, 3]).
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary
Error Surfaces
Error Surfaces
n
∂ 1X
(ti − (w[0] + w[1] × di [1]))2 = 0 (5)
∂w[0] 2
i=1
and
n
∂ 1X
(ti − (w[0] + w[1] × di [1]))2 = 0 (6)
∂w[1] 2
i=1
There are a number of different ways to find this point.
We will describe a guided search approach known as the
gradient descent algorithm.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary
R ENTAL P RICE = ?
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary
Gradient Descent
(a) (b)
Figure: (a) A 3D surface plot and (b) a contour plot of the same error
surface. The lines indicate the path that the gradient decent algorithm
would take across this error surface from different starting positions to
the global minimum - marked as the white dot in the centre.
The journey across the error surface that is taken by the
gradient descent algorithm when training the simple
version of the office rentals example - involving just S IZE
and R ENTAL P RICE.
(a) (b)
Figure: (a) A 3D surface plot and (b) a contour plot of the error
surface for the office rentals dataset showing the path that the
gradient descent algorithm takes towards the best fit model.
600
600
600
550
550
550
Rental Price
Rental Price
Rental Price
500
500
500
450
450
450
400
400
400
350
350
350
500 600 700 800 900 1000 500 600 700 800 900 1000 500 600 700 800 900 1000
Size Size Size
8e+05
600
600
6e+05
550
550
Rental Price
500
500
4e+05
450
450
2e+05
400
400
350
350
0e+00
500 600 700 800 900 1000 500 600 700 800 900 1000 0 20 40 60 80 100
Size Size Training Iteration
Gradient Descent
Gradient Descent
Gradient Descent
∂ ∂1 2
L2 (Mw , D) = (t − Mw (d)) (14)
∂w[j] 2
∂w[j]
∂
= (t − Mw (d)) × (t − Mw (d)) (15)
∂w[j]
∂
= (t − Mw (d)) × (t − (w · d)) (16)
∂w[j]
= (t − Mw (d)) × −d[j] (17)
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary
Gradient Descent
Figure: Plots of the journeys made across the error surface for the
simple office rentals prediction problem for different learning rates: (a)
a very small learning rate (0.002), (b) a medium learning rate (0.08)
and (c) a very large learning rate (0.18).
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary
A Worked Example
A Worked Example
A Worked Example
A Worked Example
Iteration 1
R ENTAL Squared errorDelta(D, w[i])
ID P RICE Pred. Error Error w[0] w[1] w[2] w[3]
1 320 93.26 226.74 51411.08 226.74 113370.05 906.96 1813.92
2 380 107.41 272.59 74307.70 272.59 149926.92 1908.16 13629.72
3 400 115.15 284.85 81138.96 284.85 176606.39 2563.64 1993.94
4 390 119.21 270.79 73327.67 270.79 170598.22 1353.95 6498.98
5 385 134.64 250.36 62682.22 250.36 166492.17 2002.91 25036.42
6 410 130.31 279.69 78226.32 279.69 195782.78 1118.76 2237.52
7 480 142.89 337.11 113639.88 337.11 259570.96 3371.05 2359.74
8 600 168.32 431.68 186348.45 431.68 379879.24 5180.17 21584.05
9 570 170.63 399.37 159499.37 399.37 367423.83 5591.23 3194.99
10 620 187.58 432.42 186989.95 432.42 432423.35 3891.81 10378.16
Sum 1067571.59 3185.61 2412073.90 27888.65 88727.43
Sum of squared errors (Sum/2) 533785.80
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary
A Worked Example
n
X
w[j] ← w[j] + α ((ti − Mw (di )) × di [j])
i=1
| {z }
errorDelta(D,w[j])
Initial Weights
w[0]: -0.146 w[1]: 0.185 w[2]: -0.044 w[3]: 0.119
Example
A Worked Example
Iteration 2
R ENTAL Squared errorDelta(D, w[i])
ID P RICE Pred. Error Error w[0] w[1] w[2] w[3]
1 320 117.40 202.60 41047.92 202.60 101301.44 810.41 1620.82
2 380 134.03 245.97 60500.69 245.97 135282.89 1721.78 12298.44
3 400 145.08 254.92 64985.12 254.92 158051.51 2294.30 1784.45
4 390 149.65 240.35 57769.68 240.35 151422.55 1201.77 5768.48
5 385 166.90 218.10 47568.31 218.10 145037.57 1744.81 21810.16
6 410 164.10 245.90 60468.86 245.90 172132.91 983.62 1967.23
7 480 180.06 299.94 89964.69 299.94 230954.68 2999.41 2099.59
8 600 210.87 389.13 151424.47 389.13 342437.01 4669.60 19456.65
9 570 215.03 354.97 126003.34 354.97 326571.94 4969.57 2839.76
10 620 187.58 432.42 186989.95 432.42 432423.35 3891.81 10378.16
Sum 886723.04 2884.32 2195615.84 25287.08 80023.74
Sum of squared errors (Sum/2) 443361.52
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary
A Worked Example
n
X
w[j] ← w[j] + α ((ti − Mw (di )) × di [j])
|i=1 {z }
errorDelta(D,w[j])
Exercise
w[1] ←?, α = 0.00000002
A Worked Example
n
X
w[j] ← w[j] + α ((ti − Mw (di )) × di [j])
i=1
| {z }
errorDelta(D,w[j])
Exercise
A Worked Example
All the data points above the decision boundary will result
in a negative value when plugged into the decision
boundary equation, while all data points below the decision
boundary will result in a positive value.
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
(a) (b)
Figure: (a) A surface showing the value of Equation (6)[21] for all
values of RPM and V IBRATION. The decision boundary given in
Equation (6)[21] is highlighted. (b) The same surface linearly
thresholded at zero to operate as a predictor.
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
logistic function
1
Logistic(x) = (8)
1 + e−x
where x is a numeric value and e is Euler’s number and is
approximately equal to 2.7183.
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
1.00
0.75
logistic(x)
0.50
0.25
0.00
−10 −5 0 5 10
x
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
Mw (d) = Logistic(w · d)
1
= (9)
1 + e−w·d
Example
Mw ( hRPM, V IBRATIONi)
1
=
1+ e−(−0.4077+4.1697×RPM+6.0460×V IBRATION)
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
1.0
1.0
1.0
0.5
0.5
0.5
Vibration
Vibration
Vibration
0.0
0.0
0.0
−0.5
−0.5
−0.5
−1.0
−1.0
−1.0
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
RPM RPM RPM
1.0
1.0
20
0.5
0.5
Vibration
0.0
0.0
10
−0.5
−0.5
5
−1.0
−1.0
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 0 100 200 300 400 500 600
RPM RPM Training Iteration
n
X
w[j] ← w[j] + α × ((t − Mw (di )) × Mw (di ) × (1 − Mw (di )) × di [j])
i=1
ID RPM V IBRATION S TATUS ID RPM V IBRATION S TATUS
1 498 604 faulty 35 501 463 good
2 517 594 faulty 36 526 443 good
3 541 574 faulty 37 536 412 good
4 555 587 faulty 38 564 394 good
5 572 537 faulty 39 584 398 good
6 600 553 faulty 40 602 398 good
7 621 482 faulty 41 610 428 good
8 632 539 faulty 42 638 389 good
9 656 476 faulty 43 652 394 good
10 653 554 faulty 44 659 336 good
11 679 516 faulty 45 662 364 good
12 688 524 faulty 46 672 308 good
13 684 450 faulty 47 691 248 good
14 699 512 faulty 48 694 401 good
15 703 505 faulty 49 718 313 good
16 717 377 faulty 50 720 410 good
17 740 377 faulty 51 723 389 good
18 749 501 faulty 52 744 227 good
19 756 492 faulty 53 741 397 good
20 752 381 faulty 54 770 200 good
21 762 508 faulty 55 764 370 good
22 781 474 faulty 56 790 248 good
23 781 480 faulty 57 786 344 good
24 804 460 faulty 58 792 290 good
25 828 346 faulty 59 818 268 good
26 830 366 faulty 60 845 232 good
27 864 344 faulty 61 867 195 good
28 882 403 faulty 62 878 168 good
29 891 338 faulty 63 895 218 good
30 921 362 faulty 64 916 221 good
31 941 301 faulty 65 950 156 good
32 965 336 faulty 66 956 174 good
33 976 297 faulty 67 973 134 good
34 994 287 faulty 68 1002 121 good
600
500
400
Vibration
300
200
1.0
1.0
0.5
0.5
0.5
Vibration
Vibration
Vibration
0.0
0.0
0.0
−0.5
−0.5
−0.5
−1.0
−1.0
−1.0
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
RPM RPM RPM
1.0
1.0
25
20
0.5
0.5
Vibration
15
0.0
0.0
10
−0.5
−0.5
5
−1.0
−1.0
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 0 200 400 600 800
RPM RPM Training Iteration
Mw ( hRPM, V IBRATIONi)
1
=
1+ e−(−0.4077+4.1697×RPM+6.0460×V IBRATION)
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
Modeling Non-linear
Relationships
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
0 1 2 3 4 5
Rain
Figure: A scatter plot of the R AIN and G ROWTH feature from the
grass growth dataset.
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
The best linear model we can learn for this data is:
G ROWTH = 13.510 + −0.667 × R AIN
15
10
Growth
5
0
0 1 2 3 4 5
Rain
where
φ0 (R AIN) = 1
φ1 (R AIN) = R AIN
φ2 (R AIN) = R AIN2
1.0
1.0
1.0
0.5
0.5
0.5
Growth
Growth
Growth
0.0
0.0
0.0
−0.5
−0.5
−0.5
−1.0
−1.0
−1.0
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
Rain Rain Rain
1.0
1.0
1.0
0.5
0.5
0.5
Growth
Growth
Growth
0.0
0.0
0.0
−0.5
−0.5
−0.5
−1.0
−1.0
−1.0
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
Rain Rain Rain
φ0 (R AIN) = 1
φ1 (R AIN) = R AIN
φ2 (R AIN) = R AIN2
Figure: A scatter plot of the P20 and P45 features from the EEG
dataset. ’positive’ images are shown as crosses, and ’negative’
images are shown as triangles.
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
1
Mw (d) = (11)
b
X
−
w[j]φj (d)
1+e j=0
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
1.0
1.0
1.0
0.5
0.5
0.5
P45
P45
P45
0.0
0.0
0.0
−0.5
−0.5
−0.5
−1.0
−1.0
−1.0
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
P20 P20 P20
1.0
1.0
0.5
0.5
P45
P45
0.0
0.0
−0.5
−0.5
−1.0
−1.0
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
P20 P20