06 Logistic Regression PDF
06 Logistic Regression PDF
06 Logistic Regression PDF
Classification
Where y is a discrete value
Develop the logistic regression algorithm to determine what class a new input
should fall into
Classification problems
Email -> spam/not spam?
Online transactions -> fraudulent?
Tumor -> Malignant/benign
Variable in these problems is Y
Y is either 0 or 1
0 = negative class (absence of something)
1 = positive class (presence of something)
Start with binary class problems
Later look at multiclass classification problem, although this is just an extension of
binary classification
How do we develop a classification algorithm?
Tumour size vs malignancy (0 or 1)
We could use linear regression
Then threshold the classifier output (i.e. anything over some value is yes, else
no)
In our example below linear regression with thresholding seems to work
We can see above this does a reasonable job of stratifying the data points into one of two
classes
But what if we had a single Yes with a very small tumour
This would lead to classifying all the existing yeses as nos
Another issues with linear regression
We know Y is 0 or 1
Hypothesis can give values large than 1 or less than 0
So, logistic regression generates a value where is always either 0 or 1
Logistic regression is a classification algorithm - don't be confused
Hypothesis representation
What function is used to represent our hypothesis in classification
We want our classifier to output values between 0 and 1
When using linear regression we did hθ(x) = (θT x)
For classification hypothesis representation we do hθ(x) = g((θT x))
Where we define g(z)
z is a real number
g(z) = 1/(1 + e -z)
This is the sigmoid function, or the logistic function
If we combine these equations we can write out the hypothesis as
When our hypothesis (hθ(x)) outputs a number, we treat that value as the estimated
probability that y=1 on input x
Example
If X is a feature vector with x0 = 1 (as always) and x1 = tumourSize
hθ(x) = 0.7
Tells a patient they have a 70% chance of a tumor being malignant
We can write this using the following notation
hθ(x) = P(y=1|x ; θ)
What does this mean?
Probability that y=1, given x, parameterized by θ
Since this is a binary classification task we know y = 0 or 1
So the following must be true
P(y=1|x ; θ) + P(y=0|x ; θ) = 1
P(y=0|x ; θ) = 1 - P(y=1|x ; θ)
Decision boundary
Gives a better sense of what the hypothesis function is computing
Better understand of what the hypothesis function looks like
One way of using the sigmoid function is;
When the probability of y being 1 is greater than 0.5 then we can predict y = 1
Else we predict y = 0
When is it exactly that hθ(x) is greater than 0.5?
Look at sigmoid function
g(z) is greater than or equal to 0.5 when z is greater than or equal to 0
hθ(x) = g(θ0 + θ1 x1 + θ2 x2 )
Mean we can build more complex decision boundaries by fitting complex parameters to
this (relatively) simple hypothesis
More complex decision boundaries?
By using higher order polynomial terms, we can get even more
complex decision boundaries
To get around this we need a different, convex Cost() function which means we can apply
gradient descent
This result is
p(y=1 | x ; θ)
Probability y = 1, given x, parameterized by θ
If you had n features, you would have an n+1 column vector for θ
This equation is the same as the linear regression rule
The only difference is that our definition for the hypothesis has changed
Previously, we spoke about how to monitor gradient descent to check it's working
Can do the same thing here for logistic regression
When implementing logistic regression with gradient descent, we have to update all
the θ values (θ0 to θn ) simultaneously
Could use a for loop
Better would be a vectorized implementation
Feature scaling for gradient descent for logistic regression also applies here
Advanced optimization
Previously we looked at gradient descent for minimizing the cost function
Here look at advanced concepts for minimizing the cost function for logistic regression
Good for large machine learning problems (e.g. huge feature set)
What is gradient descent actually doing?
We have some cost function J( θ), and we want to minimize it
We need to write code which can take θ as input and compute the following
J(θ)
Partial derivative if J(θ) with respect to j (where j=0 to j = n)
Example above
θ1 and θ2 (two parameters)
Cost function here is J(θ) = (θ1 - 5) 2 + ( θ2 - 5) 2
The derivatives of the J( θ) with respect to either θ1 and θ2 turns out to be the 2( θi -
5)
First we need to define our cost function, which should have the following signature
Input for the cost function is THETA, which is a vector of the θ parameters
Two return values from costFunction are
jval
How we compute the cost function θ (the underived cost function)
In this case = (θ1 - 5) 2 + (θ2 - 5) 2
gradient
2 by 1 vector
2 elements are the two partial derivative terms
i.e. this is an n-dimensional vector
Each indexed value gives the partial derivatives for the partial derivative
of J(θ) with respect to θi
Where i is the index position in the gradient vector
With the cost function implemented, we can call the advanced algorithm using
Here
options is a data structure giving options for the algorithm
fminunc
function minimize the cost function ( find minimum of
unconstrained multivariable function)
@costFunction is a pointer to the costFunction function to be used
For the octave implementation
initialTheta must be a matrix of at least two dimensions
Here
theta is a n+1 dimensional column vector
Octave indexes from 1, not 0
Write a cost function which captures the cost function for logistic regression
Given a dataset with three classes, how do we get a learning algorithm to work?
Use one vs. all classification make binary classification work for multiclass
classification
One vs. all classification
Split the training set into three separate binary classification problems
i.e. create a new fake training set
Triangle (1) vs crosses and squares (0) hθ1 (x)
P(y=1 | x 1 ; θ)
Crosses (1) vs triangle and square (0) hθ2 (x)
P(y=1 | x2 ; θ)
Square (1) vs crosses and square (0) hθ3 (x)
P(y=1 | x3 ; θ)
Overall
Train a logistic regression classifier hθ(i)(x) for each class i to predict the probability
that y = i
On a new input, x to make a prediction, pick the class i that maximizes the
probability that hθ(i)(x) = 1