0% found this document useful (0 votes)
5 views91 pages

Day3 L1 ML Regression

Uploaded by

Dev Damani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views91 pages

Day3 L1 ML Regression

Uploaded by

Dev Damani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 91

Interpreting Learning Rate Cat. Features Logistic Reg.

Non-Linear Relationships

Fundamentals of Machine Learning


Error-based Learning Using
Regression
1 Big Idea

Fundamentals
Simple Linear Regression
Measuring Error
Error Surfaces
Standard Approach: Multivariate Linear Regression
with Gradient Descent
Multivariate Linear Regression
Gradient Descent
Choosing Learning Rates & Initial Weights

Handling Categorical Descriptive Features

Handling Categorical Target Features: Logistic Regression


Modeling Non-linear Relationships
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Big Idea
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

A paramaterised prediction model is initialised with a set


of random parameters and an error function is used to
judge how well this initial model performs when making
predictions for instances in a training dataset.
Based on the value of the error function the parameters
are iteratively adjusted to create a more and more accurate
model.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Fundamentals
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Simple Linear Regression

Table: The office rentals dataset: a dataset that includes office


rental prices and a number of descriptive features for 10 Dublin
city-centre offices.
B ROADBAND E NERGY R ENTAL
ID S IZE F LOOR R ATE R ATING P RICE
1 500 4 8 C 320
2 550 7 50 A 380
3 620 9 7 A 400
4 630 5 24 B 390
5 665 8 100 C 385
6 700 4 8 B 410
7 770 10 7 B 480
8 880 12 50 A 600
9 920 14 8 C 570
10 1,000 9 24 B 620
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Simple Linear Regression

Table: The office rentals dataset: a dataset that includes office


rental prices and a number of descriptive features for 10 Dublin
city-centre offices.
R ENTAL
ID S IZE P RICE
1 500 320
2 550 380
3 620 400
4 630 390
5 665 385
6 700 410
7 770 480
8 880 600
9 920 570
10 1,000 620
600
550
Rental Price
500
450
400
350

500 600 700 800 900 1000


Size

Figure: A scatter plot of the S IZE and R ENTAL P RICE features from
the office rentals dataset.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Simple Linear Regression

From the scatter plot it appears that there is a linear


relationship between the S IZE and R ENTAL P RICE.
The equation of a line can be written as:
y = mx + b (1)
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Simple Linear Regression

The scatter plot below shows the same scatter plot as


shown in Figure 1 [8] with a simple linear model added to
capture the relationship between office sizes and office
rental prices.
This model is:
R ENTAL P RICE = 6.47 + 0.62 × S IZE
600
550
Rental Price
500
450
400
350

500 600 700 800 900 1000


Size
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Simple Linear Regression

R ENTAL P RICE = 6.47 + 0.62 × S IZE

Using this model determine the expected rental price of the


730 square foot office:
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Simple Linear Regression

R ENTAL P RICE = 6.47 + 0.62 × S IZE

Using this model determine the expected rental price of the


730 square foot office:
R ENTAL P RICE = 6.47 + 0.62 × 730
= 459.07
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Simple Linear Regression

Mw (d) = w[0] + w[1] × d[1] (2)


Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Measuring Error

600
550
Rental Price
500
450
400
350

500 600 700 800 900 1000


Size

Figure: A scatter plot of the S IZE and R ENTAL P RICE features from
the office rentals dataset. A collection of possible simple linear
regression models capturing the relationship between these two
features are also shown. For all models w[0] is set to 6.47. From top
to bottom the models use 0.4, 0.5, 0.62, 0.7 and 0.8 respectively for
w[1].
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Measuring Error

600
550
Rental Price
500
450
400
350

500 600 700 800 900 1000


Size

Figure: A scatter plot of the S IZE and R ENTAL P RICE features from
the office rentals dataset showing a candidate prediction model (with
w[0] = 6.47 and w[1] = 0.62) and the resulting errors.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Measuring Error

n
1X
L2 (Mw , D) = (ti − Mw (di [1]))2 (3)
2
i=1
n
1 X
= (ti − (w[0] + w[1] × di [1]))2 (4)
2
i=1
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Measuring Error

Table: Calculating the sum of squared errors for the candidate model
(with w[0] = 6.47 and w[1] = 0.62) making predictions for the the
office rentals dataset.
R ENTAL Model
Error Squared
ID P RICE Prediction
Error Error
1 320 316.79
3.21 10.32
2 380 347.82
32.18 1,035.62
3 400 391.26
8.74 76.32
4 390 397.47
-7.47 55.80
5 385 419.19
-34.19 1,169.13
6 410 440.91
-30.91 955.73
7 480 484.36
-4.36 19.01
8 600 552.63
47.37 2,243.90
9 570 577.46
-7.46 55.59
10 620 627.11
-7.11 50.51
Sum 5,671.64
Sum of squared errors (Sum/2) 2,835.82
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Error Surfaces

For every possible combination of weights, w[0] and w[1],


there is a corresponding sum of squared errors value that
can be joined together to make a surface.

4.0e+07
3

3.5e+07

2
3.0e+07

2.5e+07
1

2.0e+07

w1
0 1.5e+07

1.0e+07

−1

5.0e+06

0.0e+00
−2

−5 0 5 10 15

w0

(a) (b)

Figure: (a) A 3D surface plot and (b) a contour plot of the error
surface generated by plotting the sum of squared errors value for the
office rentals training set for each possible combination of values for
w[0] (from the range [−10, 20]) and w[1] (from the range [−2, 3]).
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Error Surfaces

The x-y plane is known as a weight space and the


surface is known as an error surface.
The model that best fits the training data is the model
corresponding to the lowest point on the error surface.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Error Surfaces

Using Equation (4)[16] we can formally define this point on


the error surface as the point at which:

n
∂ 1X
(ti − (w[0] + w[1] × di [1]))2 = 0 (5)
∂w[0] 2
i=1
and
n
∂ 1X
(ti − (w[0] + w[1] × di [1]))2 = 0 (6)
∂w[1] 2
i=1
There are a number of different ways to find this point.
We will describe a guided search approach known as the
gradient descent algorithm.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Standard Approach: Multivariate


Linear Regression with Gradient
Descent
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Multivariate Linear Regression

Table: A dataset that includes office rental prices and a number of


descriptive features for 10 Dublin city-center offices.
B ROADBAND E NERGY R ENTAL
ID S IZE F LOOR R ATE R ATING P RICE
1 500 4 8 C 320
2 550 7 50 A 380
3 620 9 7 A 400
4 630 5 24 B 390
5 665 8 100 C 385
6 700 4 8 B 410
7 770 10 7 B 480
8 880 12 50 A 600
9 920 14 8 C 570
10 1,000 9 24 B 620
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Multivariate Linear Regression

We can define a multivariate linear regression model as:


Mw (d) = w[0] + w[1] × d[1] + · · · + w[m] × d[m] (7)
Xm
= w[0] + w[j] × d[j] (8)
j=1
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Multivariate Linear Regression

We can make Equation (8)[23] look a little neater by


inventing a dummy descriptive feature, d[0], that is always
equal to 1:

Mw (d) = w[0] × d[0] + w[1] × d[1] + . . . + w[m] × d[m]


(9)
Xm
= w[j] × d[j] (10)
j=0
= w·d (11)
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Multivariate Linear Regression

The sum of squared errors loss function, L2 , definition that


we gave in Equation (4)[16] changes only very slightly to
reflect the new regression equation:
n
1X
L2 (Mw , D) = (ti − Mw (di ))2 (12)
2
i=1
n
1X
= (ti − (w · di ))2 (13)
2
i=1
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Multivariate Linear Regression

This multivariate model allows us to include all but one of


the descriptive features in Table 3 [17] in a regression model
to predict office rental prices.
The resulting multivariate regression model equation is:

R ENTAL P RICE = w[0] + w[1] × S IZE + w[2] × F LOOR


+ w[3] × B ROADBAND R ATE
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Multivariate Linear Regression

We will see in the next section how the best-fit set of


weights for this equation are found, but for now we will set:
w[0] = −0.1513,
w[1] = 0.6270,
w[2] = −0.1781,
w[3] = 0.0714.
This means that the model is rewritten as:

R ENTAL P RICE = −0.1513 + 0.6270 × S IZE


− 0.1781 × F LOOR
+ 0.0714 × B ROADBAND R ATE
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Multivariate Linear Regression

Using this model:

R ENTAL P RICE = −0.1513 + 0.6270 × S IZE


− 0.1781 × F LOOR
+ 0.0714 × B ROADBAND R ATE

we can, for example, predict the expected rental price of a


690 square foot office on the 11th floor of a building with a
broadband rate of 50 Mb per second as:

R ENTAL P RICE = ?
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Multivariate Linear Regression

Using this model:

R ENTAL P RICE = −0.1513 + 0.6270 × S IZE


− 0.1781 × F LOOR
+ 0.0714 × B ROADBAND R ATE

we can, for example, predict the expected rental price of a


690 square foot office on the 11th floor of a building with a
broadband rate of 50 Mb per second as:

R ENTAL P RICE = −0.1513 + 0.6270 × 690


−0.1781 × 11 + 0.0714 × 50
= 434.0896
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Gradient Descent

(a) (b)

Figure: (a) A 3D surface plot and (b) a contour plot of the same error
surface. The lines indicate the path that the gradient decent algorithm
would take across this error surface from different starting positions to
the global minimum - marked as the white dot in the centre.
The journey across the error surface that is taken by the
gradient descent algorithm when training the simple
version of the office rentals example - involving just S IZE
and R ENTAL P RICE.

(a) (b)

Figure: (a) A 3D surface plot and (b) a contour plot of the error
surface for the office rentals dataset showing the path that the
gradient descent algorithm takes towards the best fit model.
600

600

600
550

550

550
Rental Price

Rental Price

Rental Price
500

500

500
450

450

450
400

400

400
350

350

350
500 600 700 800 900 1000 500 600 700 800 900 1000 500 600 700 800 900 1000
Size Size Size

8e+05
600

600

6e+05
550

550

Sum of Squared Errors


Rental Price

Rental Price
500

500

4e+05
450

450

2e+05
400

400
350

350

0e+00
500 600 700 800 900 1000 500 600 700 800 900 1000 0 20 40 60 80 100
Size Size Training Iteration

Figure: A selection of the simple linear regression models developed


during the gradient descent process for the office rentals dataset. The
final panel shows the sum of squared error values generated during
the gradient descent process.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Gradient Descent

Require: set of training instances D


Require: a learning rate α that controls how quickly the
algorithm converges
Require: a function, errorDelta, that determines the direction
in which to adjust a given weight, w[j], so as to move down
the slope of an error surface determined by the dataset, D
Require: a convergence criterion that indicates that the
algorithm has completed
1: w ← random starting point in the weight space
2: repeat
3: for each w[j] in w do
4: w[j] ← w[j] + α × errorDelta(D, w[j])
5: end for
6: until convergence occurs

The gradient descent algorithm for training multivariate


linear regression models.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Gradient Descent

The most important part to the gradient descent algorithm


is Line Rule 4 on which the weights are updated.

w[j] ← w[j] + α × errorDelta(D, w[j])

Each weight is considered independently and for each one


a small adjustment is made by adding a small delta value
to the current weight, w[j].
This adjustment should ensure that the change in the
weight leads to a move downwards on the error surface.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Gradient Descent

Imagine for a moment that our training dataset, D contains


just one training example: (d, t)
The gradient of the error surface is given as the partial
derivative of L2 with respect to each weight, w[j]:

 
∂ ∂1 2
L2 (Mw , D) = (t − Mw (d)) (14)
∂w[j] 2
∂w[j]

= (t − Mw (d)) × (t − Mw (d)) (15)
∂w[j]

= (t − Mw (d)) × (t − (w · d)) (16)
∂w[j]
= (t − Mw (d)) × −d[j] (17)
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Gradient Descent

Adjusting the calculation to take into account multiple


training instances:
n
∂ X
L2 (Mw , D) = ((ti − Mw (di )) × di [j])
∂w[j]
i=1

We use this equation to define the errorDelta in our


gradient descent algorithm.
X n
w[j] ← w[j] + α ((ti − Mw (di )) × di [j])
|i=1 {z }
errorDelta(D,w[j])
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Choosing Learning Rates & Initial Weights

The learning rate, α, determines the size of the adjustment


made to each weight at each step in the process.
Unfortunately, choosing learning rates is not a well defined
science.
Most practitioners use rules of thumb and trial and error.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Choosing Learning Rates & Initial Weights

(a) (b) (c)

Figure: Plots of the journeys made across the error surface for the
simple office rentals prediction problem for different learning rates: (a)
a very small learning rate (0.002), (b) a medium learning rate (0.08)
and (c) a very large learning rate (0.18).
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

Choosing Learning Rates & Initial Weights

A typical range for learning rates is [0.00001, 10]


Based on empirical evidence, choosing random initial
weights uniformly from the range [−0.2, 0.2] tends to work
well.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

A Worked Example

We are now in a position to build a linear regression model


that uses all of the continuous descriptive features in the
office rentals dataset.
The general structure of the model is:

R ENTAL P RICE = w[0] + w[1] × S IZE + w[2] × F LOOR


+ w[3] × B ROADBAND R ATE
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

A Worked Example

Table: The office rentals dataset: a dataset that includes office


rental prices and a number of descriptive features for 10 Dublin
city-centre offices.
B ROADBAND E NERGY R ENTAL
ID S IZE F LOOR R ATE R ATING P RICE
1 500 4 8 C 320
2 550 7 50 A 380
3 620 9 7 A 400
4 630 5 24 B 390
5 665 8 100 C 385
6 700 4 8 B 410
7 770 10 7 B 480
8 880 12 50 A 600
9 920 14 8 C 570
10 1,000 9 24 B 620
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

A Worked Example

For this example let’s assume that:


α = 0.00000002
Initial Weights
w[0]: -0.146 w[1]: 0.185 w[2]: -0.044 w[3]: 0.119
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

A Worked Example

Iteration 1
R ENTAL Squared errorDelta(D, w[i])
ID P RICE Pred. Error Error w[0] w[1] w[2] w[3]
1 320 93.26 226.74 51411.08 226.74 113370.05 906.96 1813.92
2 380 107.41 272.59 74307.70 272.59 149926.92 1908.16 13629.72
3 400 115.15 284.85 81138.96 284.85 176606.39 2563.64 1993.94
4 390 119.21 270.79 73327.67 270.79 170598.22 1353.95 6498.98
5 385 134.64 250.36 62682.22 250.36 166492.17 2002.91 25036.42
6 410 130.31 279.69 78226.32 279.69 195782.78 1118.76 2237.52
7 480 142.89 337.11 113639.88 337.11 259570.96 3371.05 2359.74
8 600 168.32 431.68 186348.45 431.68 379879.24 5180.17 21584.05
9 570 170.63 399.37 159499.37 399.37 367423.83 5591.23 3194.99
10 620 187.58 432.42 186989.95 432.42 432423.35 3891.81 10378.16
Sum 1067571.59 3185.61 2412073.90 27888.65 88727.43
Sum of squared errors (Sum/2) 533785.80
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

A Worked Example

n
X
w[j] ← w[j] + α ((ti − Mw (di )) × di [j])
i=1
| {z }
errorDelta(D,w[j])

Initial Weights
w[0]: -0.146 w[1]: 0.185 w[2]: -0.044 w[3]: 0.119

Example

w[1] ← 0.185 + 0.00000002 × 2, 412, 074 = 0.23324148


New Weights (Iteration 1)
w[0]: -0.146 w[1]: 0.233 w[2]: -0.043 w[3]: 0.121
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

A Worked Example

Iteration 2
R ENTAL Squared errorDelta(D, w[i])
ID P RICE Pred. Error Error w[0] w[1] w[2] w[3]
1 320 117.40 202.60 41047.92 202.60 101301.44 810.41 1620.82
2 380 134.03 245.97 60500.69 245.97 135282.89 1721.78 12298.44
3 400 145.08 254.92 64985.12 254.92 158051.51 2294.30 1784.45
4 390 149.65 240.35 57769.68 240.35 151422.55 1201.77 5768.48
5 385 166.90 218.10 47568.31 218.10 145037.57 1744.81 21810.16
6 410 164.10 245.90 60468.86 245.90 172132.91 983.62 1967.23
7 480 180.06 299.94 89964.69 299.94 230954.68 2999.41 2099.59
8 600 210.87 389.13 151424.47 389.13 342437.01 4669.60 19456.65
9 570 215.03 354.97 126003.34 354.97 326571.94 4969.57 2839.76
10 620 187.58 432.42 186989.95 432.42 432423.35 3891.81 10378.16
Sum 886723.04 2884.32 2195615.84 25287.08 80023.74
Sum of squared errors (Sum/2) 443361.52
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

A Worked Example

n
X
w[j] ← w[j] + α ((ti − Mw (di )) × di [j])
|i=1 {z }
errorDelta(D,w[j])

Initial Weights (Iteration 2)


w[0]: -0.146 w[1]: 0.233 w[2]: -0.043 w[3]: 0.121

Exercise
w[1] ←?, α = 0.00000002

New Weights (Iteration 2)


w[0]: ? w[1]: ? w[2]: ? w[3]: ?
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

A Worked Example

n
X
w[j] ← w[j] + α ((ti − Mw (di )) × di [j])
i=1
| {z }
errorDelta(D,w[j])

Initial Weights (Iteration 2)


w[0]: -0.146 w[1]: 0.233 w[2]: -0.043 w[3]: 0.121

Exercise

w[1] ← −0.233 + 0.00000002 × 2195616.08 = 0.27691232


New Weights (Iteration 2)
w[0]: -0.145 w[1]: 0.277 w[2]: -0.043 w[3]: 0.123
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary

A Worked Example

The algorithm then keeps iteratively applying the weight


update rule until it converges on a stable set of weights
beyond which little improvement in model accuracy is
possible.
After 100 iterations the final values for the weights are:
w[0] = −0.1513,
w[1] = 0.6270,
w[2] = −0.1781
w[3] = 0.0714
which results in a sum of squared errors value of 2, 913.5
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM

Learning rate decay allows the learning rate to start at a


large value and then decay over time according to a
predefined schedule.
A good approach is to use the following decay schedule:
c
ατ = α0 (4)
c+τ
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM

Handling Categorical Descriptive


Features
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM

The basic structure of the multivariable linear regression


model allows for only continuous descriptive features, so
we need a way to handle categorical descriptive features.
The most common approach to handling categorical
features uses a transformation that converts a single
categorical descriptive feature into a number of continuous
descriptive feature values that can encode the levels of the
categorical feature.
For example, the E NERGY R ATING descriptive feature
would be converted into three new continuous descriptive
features, as it has 3 distinct levels: ’A’, ’B’, or ’C’.
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM

Table: The office rentals dataset adjusted to handle the categorical


E NERGY R ATING descriptive feature in linear regression models.

B ROADBAND E NERGY E NERGY E NERGY R ENTAL


ID S IZE F LOOR R ATE R ATING A R ATING B R ATING C P RICE
1 500 4 8 0 0 1 320
2 550 7 50 1 0 0 380
3 620 9 7 1 0 0 400
4 630 5 24 0 1 0 390
5 665 8 100 0 0 1 385
6 700 4 8 0 1 0 410
7 770 10 7 0 1 0 480
8 880 12 50 1 0 0 600
9 920 14 8 0 0 1 570
10 1 000 9 24 0 1 0 620
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM

Returning to our example, the regression equation for this


R ENTAL P RICE model would change to
R ENTAL P RICE = w [0] + w [1] × S IZE + w [2] × F LOOR
+ w [3] × B ROADBAND R ATE
+ w [4] × E NERGY R ATING A
+ w [5] × E NERGY R ATING B
+ w [6] × E NERGY R ATING C
where the newly added categorical features allow the
original E NERGY R ATING feature to be included.
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM

Handling Categorical Target


Features: Logistic Regression
Table: A dataset listing features for a number of generators.

ID RPM V IBRATION S TATUS ID RPM V IBRATION S TATUS


1 568 585 good 29 562 309 faulty
2 586 565 good 30 578 346 faulty
3 609 536 good 31 593 357 faulty
4 616 492 good 32 626 341 faulty
5 632 465 good 33 635 252 faulty
6 652 528 good 34 658 235 faulty
7 655 496 good 35 663 299 faulty
8 660 471 good 36 677 223 faulty
9 688 408 good 37 685 303 faulty
10 696 399 good 38 698 197 faulty
11 708 387 good 39 699 311 faulty
12 701 434 good 40 712 257 faulty
13 715 506 good 41 722 193 faulty
14 732 485 good 42 735 259 faulty
15 731 395 good 43 738 314 faulty
16 749 398 good 44 753 113 faulty
17 759 512 good 45 767 286 faulty
18 773 431 good 46 771 264 faulty
19 782 456 good 47 780 137 faulty
20 797 476 good 48 784 131 faulty
21 794 421 good 49 798 132 faulty
22 824 452 good 50 820 152 faulty
23 835 441 good 51 834 157 faulty
24 862 372 good 52 858 163 faulty
25 879 340 good 53 888 91 faulty
26 892 370 good 54 891 156 faulty
27 913 373 good 55 911 79 faulty
28 933 330 good 56 939 99 faulty
600
500
400
Vibration
300
200
100

600 700 800 900


RPM

Figure: A scatter plot of the RPM and V IBRATION descriptive


features from the generators dataset shown in Table 4 [18] where
’good’ generators are shown as crosses and ’faulty’ generators are
shown as triangles.
600
500
400
Vibration
300
200
100

600 700 800 900


RPM

Figure: A scatter plot of the RPM and V IBRATION descriptive


features from the generators dataset shown in Table 4 [18] . A decision
boundary separating ’good’ generators (crosses) from ’faulty’
generators (triangles) is also shown.
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM

As the decision boundary is a linear separator it can be


defined using the equation of the line as:
V IBRATION = 830 − 0.667 × RPM (5)
or
830 − 0.667 × RPM − V IBRATION = 0 (6)
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM

Applying Equation (6)[21] to the instance RPM = 810,


V IBRATION = 495, which is be above the decision
boundary, gives the following result:

830 − 0.667 × 810 − 495 = −205.27

By contrast, if we apply Equation (6)[21] to the instance


RPM = 650 and V IBRATION = 240, which is be below the
decision boundary, we get

830 − 0.667 × 650 − 240 = 156.45


Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM

All the data points above the decision boundary will result
in a negative value when plugged into the decision
boundary equation, while all data points below the decision
boundary will result in a positive value.
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM

Reverting to our previous notation we have:


(
1 if w · d ≥ 0
Mw (d) = (7)
0 otherwise
The surface defined by this rule is known as a decision
surface.
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM

(a) (b)

Figure: (a) A surface showing the value of Equation (6)[21] for all
values of RPM and V IBRATION. The decision boundary given in
Equation (6)[21] is highlighted. (b) The same surface linearly
thresholded at zero to operate as a predictor.
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM

The hard decision boundary given in Equation (7)[24] is


discontinuous so is not differentiable and so we can’t
calculate the gradient of the error surface.
Furthermore, the model always makes completely
confident predictions of 0 or 1, whereas a little more
subtlety is desirable.
We address these issues by using a more sophisticated
threshold function that is continuous, and therefore
differentiable, and that allows for the subtlety desired: the
logistic function
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM

logistic function

1
Logistic(x) = (8)
1 + e−x
where x is a numeric value and e is Euler’s number and is
approximately equal to 2.7183.
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM

1.00
0.75
logistic(x)
0.50
0.25
0.00

−10 −5 0 5 10
x
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM

To build a logistic regression model, we simply pass the


output of the basic linear regression model through the
logistic function

Mw (d) = Logistic(w · d)
1
= (9)
1 + e−w·d

A note on training logistic regression models:


1 Before we train a logistic regression model we map the
binary target feature levels to 0 or 1.
2 The error of the model on each instance is then the
difference between the target feature (0 or 1) and the value
of the prediction [0, 1].
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM

Example

Mw ( hRPM, V IBRATIONi)
1
=
1+ e−(−0.4077+4.1697×RPM+6.0460×V IBRATION)
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM

The decision surface for the example logistic regression


model.
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM

P(t = ’faulty’|d) = Mw (d)


P(t = ’good’|d) = 1 − Mw (d)
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM

1.0

1.0

1.0
0.5

0.5

0.5
Vibration

Vibration

Vibration
0.0

0.0

0.0
−0.5

−0.5

−0.5
−1.0

−1.0

−1.0
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
RPM RPM RPM
1.0

1.0

20
0.5

0.5

Sum of Squared Errors


15
Vibration

Vibration
0.0

0.0

10
−0.5

−0.5

5
−1.0

−1.0

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 0 100 200 300 400 500 600
RPM RPM Training Iteration

Figure: A selection of the logistic regression models developed


during the gradient descent process for the machinery dataset from
Table 4 [18] . The bottom-right panel shows the sum of squared error
values generated during the gradient descent process.
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM

To repurpose the gradient descent algorithm for training


logistic regression models the only change that needs to
be made is i in the weight update rule.
See pg. 360 in book for details of how to derive the new
weight update rule.
The new weight update rule is:

n
X
w[j] ← w[j] + α × ((t − Mw (di )) × Mw (di ) × (1 − Mw (di )) × di [j])
i=1
ID RPM V IBRATION S TATUS ID RPM V IBRATION S TATUS
1 498 604 faulty 35 501 463 good
2 517 594 faulty 36 526 443 good
3 541 574 faulty 37 536 412 good
4 555 587 faulty 38 564 394 good
5 572 537 faulty 39 584 398 good
6 600 553 faulty 40 602 398 good
7 621 482 faulty 41 610 428 good
8 632 539 faulty 42 638 389 good
9 656 476 faulty 43 652 394 good
10 653 554 faulty 44 659 336 good
11 679 516 faulty 45 662 364 good
12 688 524 faulty 46 672 308 good
13 684 450 faulty 47 691 248 good
14 699 512 faulty 48 694 401 good
15 703 505 faulty 49 718 313 good
16 717 377 faulty 50 720 410 good
17 740 377 faulty 51 723 389 good
18 749 501 faulty 52 744 227 good
19 756 492 faulty 53 741 397 good
20 752 381 faulty 54 770 200 good
21 762 508 faulty 55 764 370 good
22 781 474 faulty 56 790 248 good
23 781 480 faulty 57 786 344 good
24 804 460 faulty 58 792 290 good
25 828 346 faulty 59 818 268 good
26 830 366 faulty 60 845 232 good
27 864 344 faulty 61 867 195 good
28 882 403 faulty 62 878 168 good
29 891 338 faulty 63 895 218 good
30 921 362 faulty 64 916 221 good
31 941 301 faulty 65 950 156 good
32 965 336 faulty 66 956 174 good
33 976 297 faulty 67 973 134 good
34 994 287 faulty 68 1002 121 good
600
500
400
Vibration
300
200

500 600 700 800 900 1000


RPM

Figure: A scatter plot of the extended generators dataset given in


Table 35 [35] , which results in instances with the different target levels
overlapping with each other. ’good’ generators are shown as crosses,
and ’faulty’ generators are shown as triangles.
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM

For logistic regression models we recommend that


descriptive feature values always be normalized.
In this example, before the training process begins, both
descriptive features are normalized to the range [−1, 1].
1.0

1.0

1.0
0.5

0.5

0.5
Vibration

Vibration

Vibration
0.0

0.0

0.0
−0.5

−0.5

−0.5
−1.0

−1.0

−1.0
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
RPM RPM RPM
1.0

1.0

25
20
0.5

0.5

Sum of Squared Errors


Vibration

Vibration

15
0.0

0.0

10
−0.5

−0.5

5
−1.0

−1.0

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 0 200 400 600 800
RPM RPM Training Iteration

Figure: A selection of the logistic regression models developed


during the gradient descent process for the extended generators
dataset in Table 35 [35] . The bottom-right panel shows the sum of
squared error values generated during the gradient descent process.
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM

The final model found is:

Mw ( hRPM, V IBRATIONi)
1
=
1+ e−(−0.4077+4.1697×RPM+6.0460×V IBRATION)
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM

Modeling Non-linear
Relationships
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM

Table: A dataset describing grass growth on Irish farms during July


2012.

ID R AIN G ROWTH ID R AIN G ROWTH ID R AIN G ROWTH


1 2.153 14.016 12 3.754 11.420 23 3.960 10.307
2 3.933 10.834 13 2.809 13.847 24 3.592 12.069
3 1.699 13.026 14 1.809 13.757 25 3.451 12.335
4 1.164 11.019 15 4.114 9.101 26 1.197 10.806
5 4.793 4.162 16 2.834 13.923 27 0.723 7.822
6 2.690 14.167 17 3.872 10.795 28 1.958 14.010
7 3.982 10.190 18 2.174 14.307 29 2.366 14.088
8 3.333 13.525 19 4.353 8.059 30 1.530 12.701
9 1.942 13.899 20 3.684 12.041 31 0.847 9.012
10 2.876 13.949 21 2.140 14.641 32 3.843 10.885
11 4.277 8.643 22 2.783 14.138 33 0.976 9.876
15
10
Growth
5
0

0 1 2 3 4 5
Rain

Figure: A scatter plot of the R AIN and G ROWTH feature from the
grass growth dataset.
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM

The best linear model we can learn for this data is:
G ROWTH = 13.510 + −0.667 × R AIN

15
10
Growth
5
0

0 1 2 3 4 5
Rain

Figure: A simple linear regression model trained to capture the


relationship between the grass growth and rainfall.
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM

In order to handle non-linear relationships we transform the


data rather than the model using a set of basis functions:
b
X
Mw (d) = w[k ] × φk (d) (10)
k =0
where φ0 to φb are a series of b basis functions that each
transform the input vector d in a different way.
The advantage of this is that, except for introducing the
mechanism of basis functions, we do not need to make any
other changes to the approach we have presented so far.
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM

The relationship between rainfall and grass growth in the


grass growth dataset can be accurately represented as a
second order polynomial through the following model:

G ROWTH = w[0] × φ0 (R AIN) + w[1] × φ1 (R AIN) + w[2] × φ2 (R AIN)

where

φ0 (R AIN) = 1
φ1 (R AIN) = R AIN
φ2 (R AIN) = R AIN2
1.0

1.0

1.0
0.5

0.5

0.5
Growth

Growth

Growth
0.0

0.0

0.0
−0.5

−0.5

−0.5
−1.0

−1.0

−1.0
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
Rain Rain Rain
1.0

1.0

1.0
0.5

0.5

0.5
Growth

Growth

Growth
0.0

0.0

0.0
−0.5

−0.5

−0.5
−1.0

−1.0

−1.0
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
Rain Rain Rain

Figure: A selection of the models developed during the gradient


descent process for the grass growth dataset from Table 5 [46] . (Note
that the R AIN and G ROWTH features have been range normalized to
the [−1, 1] range.)
1.0
0.5
Growth
0.0
−0.5
−1.0

−1.0 −0.5 0.0 0.5 1.0


Rain

G ROWTH = 0.3707 × φ0 (R AIN) + 0.8475 × φ1 (R AIN) + −1.717 × φ2 (R AIN)


G ROWTH = 0.3707 × φ0 (R AIN) + 0.8475 × φ1 (R AIN) + −1.717 × φ2 (R AIN)

φ0 (R AIN) = 1
φ1 (R AIN) = R AIN
φ2 (R AIN) = R AIN2

What is the predicted growth for the following R AIN values:


1 R AIN= −0.75
2 R AIN= 0.1
3 R AIN= 0.9
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM

Basis functions can also be used for


1 multivariable simple linear regression models in the same
way, the only extra requirement being the definition of more
basis functions.
2 to train logistic regression models for categorical prediction
problems that involve non-linear relationships.
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM

Table: A dataset showing participants’ responses to viewing ’positive’


and ’negative’ images measured on the EEG P20 and P45
potentials.
ID P20 P45 T YPE ID P20 P45 T YPE
1 0.4497 0.4499 negative 26 0.0656 0.2244 positive
2 0.8964 0.9006 negative 27 0.6336 0.2312 positive
3 0.6952 0.3760 negative 28 0.4453 0.4052 positive
4 0.1769 0.7050 negative 29 0.9998 0.8493 positive
5 0.6904 0.4505 negative 30 0.9027 0.6080 positive
6 0.7794 0.9190 negative 31 0.3319 0.1473 positive
.. ..
. .
1.0
0.8
0.6
P45
0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0


P20

Figure: A scatter plot of the P20 and P45 features from the EEG
dataset. ’positive’ images are shown as crosses, and ’negative’
images are shown as triangles.
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM

A logistic regression model using basis functions is defined


as follows:

1
Mw (d) = (11)
b
 
X
−

w[j]φj (d)

1+e j=0
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM

We will use the following basis functions for the EEG


problem:
φ0 (hP20, P45i) = 1 φ4 (hP20, P45i) = P452
φ1 (hP20, P45i) = P20 φ5 (hP20, P45i) = P203
φ2 (hP20, P45i) = P45 φ6 (hP20, P45i) = P453
φ3 (hP20, P45i) = P202 φ7 (hP20, P45i) = P20 × P45
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM

1.0

1.0

1.0
0.5

0.5

0.5
P45

P45

P45
0.0

0.0

0.0
−0.5

−0.5

−0.5
−1.0

−1.0

−1.0
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
P20 P20 P20
1.0

1.0
0.5

0.5
P45

P45
0.0

0.0
−0.5

−0.5
−1.0

−1.0

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
P20 P20

Figure: A selection of the models developed during the gradient


descent process for the EEG dataset from Table 6 [55] . The final panel
shows the decision surface generated.

You might also like