Logistic Regression
Logistic Regression
1
Binary Classification
1 (Cat) vs 0 (non-Cat)
● The goal is to train a classifier with an input image represented by a feature vector 𝑥.
● To predict whether the corresponding label 𝑦 is 1 or 0.
● In this case, whether this is a cat image (1) or a non-cat image (0).
2
Binary Classification
64
64
● An image is stored in the computer in three separate matrices corresponding to the Red,
Green, and Blue color channels of the image.
● The three matrices have the same size as the image, for example, the resolution of the cat
image is 64 pixels X 64 pixels, the three matrices (RGB) are 64 X 64 each.
64
64
● The value in a cell represents the pixel intensity which will be used to create a feature
vector of n dimension. In pattern recognition and machine learning, a feature vector
represents an object, in this case, a cat or no cat.
4
Binary Classification
64 Y (0 or 1)
64
● To create a feature vector, 𝑥, the pixel intensity values will be “unroll” or “reshape” for each
color. The dimension of the input feature vector 𝑥 is 𝑛_𝑥 = 64 𝑥 64 𝑥 3 = 12 288.
5
Logistic Regression
● Logistic regression is a learning algorithm used in a supervised learning
problem when the output 𝑦 are all either zero or one.
● The goal of logistic regression is to minimize the error between its predictions
and training data.
● Given an image represented by a feature vector 𝑥, the algorithm will evaluate
the probability of a cat being in that image.
6
Logistic Regression
bias
Parameters
8
Logistic Regression
● (𝑤𝑇𝑥 + 𝑏) is a linear function like (𝑎𝑥 + 𝑏), but since we are looking for a probability
constraint between [0,1], the sigmoid function is used.
● The function is bounded between [0,1] as shown in the graph below.
Sig(Z)
9
Logistic Regression
Sig(Z)
10
Logistic Regression: Cost Function
● To train the parameters 𝑤 and 𝑏, we need to define a cost function.
● Loss function measures the discrepancy between the prediction (𝑦̂(𝑖)) and the desired output (𝑦(𝑖)).
● In other words, the loss function computes the error for a single training example.
11
Error / Loss Function
Squared Error Function:
● We can see an extra (1/2) in the right side of the equation. Does it matter?
● It is because when you take the derivative of the cost function, that is used in
updating the parameters during gradient descent, that 2 in the power get
cancelled with the (1/2) multiplier.
● These techniques are or somewhat similar are widely used in math in order
"To make the derivations mathematically more convenient".
12
Is squared error function a good choice?
● The squared error function (commonly used function for linear regression) is not very
suitable for logistic regression.
○ In case of logistic regression, the hypothesis / prediction is non-linear (sigmoid function), which
makes the square error function to be non-convex.
○ On the other hand, logarithmic function is a convex function for which there is no local optima,
so gradient descent works well.
● If you are doing binary classification, squared error function generally also penalize
examples that are correctly classified but are still near the decision boundary, thus
creating a "margin."
● Gradient descent waste a lot of time getting predictions very close to {0, 1}
13
Logistic Regression: Cross Entropy Loss
● Cost function
○ The cost function is the average of the loss function of the entire training set. We are
going to find the parameters 𝑤 𝑎𝑛𝑑 𝑏 that minimize the overall cost function.
14
Gradient Descent
15
Content Credit: Andrew Ng
Gradient Descent
● Our cost function is convex. We want to parameters
● First we initialize w and b to 0,0 or initialize W, b that minimize J(W, b)
them to a random value in the convex function
and then try to improve the values the reach
minimum value.
● In Logistic regression people always use 0,0
instead of random.
● This function is convex, no matter where you
initialize you should get to the global optimal
point or roughly close the global optimal
point.
Global Optimum
16
Gradient Descent
● Gradient starts at the initial point and take a step We want to parameters
in the steepest downhill direction after each W, b that minimize J(W, b)
iteration.
● It will try to reach to the global optimum or
somewhere near to the global optimum.
Global Optimum
17
Gradient Descent // Repeatedly do that until the algorithm converges.
J(w) Learning Rate
Repeat {
- +
Update or change you want to
W make to the parameter w
Ignore b for now to make it a one dimensional
plot rather than a higher dimensional plot. }
● 𝜶 = Learning Rate: How bigger step we
choose at each iteration of gradient descent.
● Definition of a derivative:
○ Slope of a function at a point.
18
Gradient Descent : Actual Update Rule
We want to parameters
W, b that minimize J(W, b)
Partial Derivative
J(w,b)
19
Derivatives : Intuition
● a=2 f (a) = 6
a = 2.001 f (a) = 6.003 If we shift a by
0.001 then f (a)
Slope (derivative) of f (a) at a = 2 is 3 shift by 3 times
0.001.
● a=5 f (a) = 15
a = 5.001 f (a) = 15.003
d f (a)
a =3
da
The slope or "rate of change" at
any point is 2x.
20
Do we actually need Gradient Descent?
● Let's pretend that we only have 1 weight. To find the ideal value of our weight
that will minimize our cost, we need to try a bunch of values for W, let's say
we test 1000 values. That doesn't seem so bad, after all, my computer is
pretty fast.
● It takes about 0.04 seconds to
check 1000 different weight
Cost Winner values for our neural network.
W1
Try all 1000 values 22
A Famous Quote
23
Computation Graph
● Neural Networks are organized in terms of a forward pass or backward pass.
● Forward Pass / Propagation
○ Which we compute the output of the neural network
● Backward Pass / Propagation
○ Which we use to compute gradients / derivatives
● Computation Graph
○ Explains why it is organized in this way.
24
Computation Graph
J (a, b, c) = 3 (a + bc)
3 steps of computation:
1. u = bc
a
2. v=a+u
3. j = 3v v=a+u j = 3v
b
u = bc
c
25
Logistic regression : Forward Propagation
26
Logistic regression : Backward Propagation
27
Rules for derivatives of logarithmic expressions
28
Logistic regression : Backward Propagation
Ignoring the (-) sign for now. log (x) refers to e base
log or the natural
logarithm (ln(x)) in
mathematical analysis,
physics, chemistry,
statistics, economics, and
some engineering fields.
29
Logistic regression : Backward Propagation
30
Logistic regression : Backward Propagation
31
Logistic regression : Backward Propagation
32
Logistic regression : Backward Propagation
33
Logistic regression : Backward Propagation
Applying
Chain Rule
34
Logistic regression : Backward Propagation
Applying
Chain Rule
35
Logistic regression : Backward Propagation
Applying
Chain Rule
36
Updating the Parameters: w1, w2 and b
Learning Rate
37
Logistic regression Gradient descent on m examples
Basic Parameters
x1 Feature
x2 Feature
38
Logistic regression Gradient descent on m examples
d(z) a-y
d(w1) x1 * d(z)
d(w2) x2 * d(z)
d(b) d(z)
40
Logistic regression Gradient descent on m examples
J = 0; dw1 = 0; dw2 =0; db = 0; J /= m
w1 = 0; w2 = 0; b=0; dw1 /= m
dw2 /= m
for i = 1 to m db /= m
# Forward pass
z(i) = w1*x1(i) + w2*x2(i) + b # Gradient descent
a(i) = sigmoid(z(i)) w1 = w1 - alpha * dw1
J += (y(i)*log(a(i)) + (1-y(i))*log(1-a(i))) w2 = w2 - alpha * dw2
b = b - alpha * db
# Backward pass
dz(i) = a(i) - y(i) w1, w2, b are the accumulators and single
dw1 += dz(i) * x1(i) instances for the all m training examples.
n=2
dw2 += dz(i) * x2(i)
db += dz(i) One iteration of gradient descent
41
Logistic regression Gradient descent on m examples
● Previous slide is just one step of Gradient Descent, we need to repeat it
multiple times in order to take multiple steps of gradient descent.
● There are weaknesses in the previous implementation. In order to implement
we need to write two for loops.
● Having explicit for loops in your code make your code less efficient.
● Solution:- vectorization techniques
● To train with larger datasets we need to take the help from vectorization
techniques without using for loops.
42
LR Gradient descent on m examples (modified)
J = 0; dw1 = 0; dw2 = 0; db = 0; dw = np.zeros ((nx, 1))
w1 = 0; w2 = 0; b = 0;
for i = 1 to m
# Forward pass
z(i) = w1*x1(i) + w2*x2(i) + b
a(i) = sigmoid(z(i))
J += - (y(i)*log(a(i)) + (1-y(i))*log(1-a(i)))
# Backward pass
dz(i) = a(i) - y(i)
dw1 += dz(i) * x1(i) n=2
dw += x(i) * dz(i)
dw2 += dz(i) * x2(i)
db += dz(i)
43
Logistic regression Gradient descent on m examples
J=J/m
dw1 = dw1 / m
dw2 = dw2 / m dw = dw / m
db = db / m
# Gradient descent
w1 = w1 - alpha * dw1 We have gone from 2 for loops to 1
w2 = w2 - alpha * dw2 for loop, we still have one for loop
b = b - alpha * db that loops over individual training
examples.
w1, w2, b are the accumulators and single
instances for the all m training examples.
44
Vectorizing Logistic Regression (Forward)
[ ]
(1 x m) dimension 45
(1 x m)
Vectorizing Logistic Regression (Forward)
Broadcasting
(1, 1) dimension
(1 x m) dimension
46
Gradient Computation
47
Gradient Computation
.
.
.
..
48
Implementing Logistic Regression
Single Iteration of Gradient Descent
Gradient
Update 49
What does this have to do with the brain?
50
END
51