0% found this document useful (0 votes)
10 views51 pages

Logistic Regression

The document discusses binary classification using logistic regression and neural networks. It explains how logistic regression can be used to classify images into categories like cat vs non-cat by training a model on pixel intensity values as features. The model learns parameters like weights and bias by minimizing a cost function using gradient descent.

Uploaded by

SANJIDA AKTER
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views51 pages

Logistic Regression

The document discusses binary classification using logistic regression and neural networks. It explains how logistic regression can be used to classify images into categories like cat vs non-cat by training a model on pixel intensity values as features. The model learns parameters like weights and bias by minimizing a cost function using gradient descent.

Uploaded by

SANJIDA AKTER
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Neural Networks Basics

CSE 4237 - Soft Computing


Mir Tafseer Nayeem
Faculty Member, CSE AUST
[email protected]

1
Binary Classification

1 (Cat) vs 0 (non-Cat)

Example: Cat vs Non-Cat

● The goal is to train a classifier with an input image represented by a feature vector 𝑥.
● To predict whether the corresponding label 𝑦 is 1 or 0.
● In this case, whether this is a cat image (1) or a non-cat image (0).

2
Binary Classification

64

64

● An image is stored in the computer in three separate matrices corresponding to the Red,
Green, and Blue color channels of the image.
● The three matrices have the same size as the image, for example, the resolution of the cat
image is 64 pixels X 64 pixels, the three matrices (RGB) are 64 X 64 each.

Content Credit: Andrew Ng 3


Binary Classification

64

64

● The value in a cell represents the pixel intensity which will be used to create a feature
vector of n dimension. In pattern recognition and machine learning, a feature vector
represents an object, in this case, a cat or no cat.

4
Binary Classification

64 Y (0 or 1)

64

● To create a feature vector, 𝑥, the pixel intensity values will be “unroll” or “reshape” for each
color. The dimension of the input feature vector 𝑥 is 𝑛_𝑥 = 64 𝑥 64 𝑥 3 = 12 288.

5
Logistic Regression
● Logistic regression is a learning algorithm used in a supervised learning
problem when the output 𝑦 are all either zero or one.
● The goal of logistic regression is to minimize the error between its predictions
and training data.
● Given an image represented by a feature vector 𝑥, the algorithm will evaluate
the probability of a cat being in that image.

6
Logistic Regression

bias
Parameters

Content Credit: Andrew Ng 7


Logistic Regression : Role of bias (b)
● The bias value allows the activation function to be shifted to the left or right, to better
fit the data.
● Changes to the weights alter the steepness of the sigmoid curve, whilst the bias offsets
it, shifting the entire curve so it fits better.
● Bias only influences the output values, it doesn’t interact with the actual input data.
That’s why it is called bias.
● You can think of the bias as a measure of how easy it is to get a node to fire.
○ For a node with a large bias, the output will tend to be intrinsically high, with small
positive weights and inputs producing large positive outputs (near to 1).
○ Biases can be also negative, leading to sigmoid outputs near to 0.
○ If the bias is very small (or 0), the output will be decided by the values of weights
and inputs alone.

8
Logistic Regression
● (𝑤𝑇𝑥 + 𝑏) is a linear function like (𝑎𝑥 + 𝑏), but since we are looking for a probability
constraint between [0,1], the sigmoid function is used.
● The function is bounded between [0,1] as shown in the graph below.

Sig(Z)

9
Logistic Regression

Sig(Z)

10
Logistic Regression: Cost Function
● To train the parameters 𝑤 and 𝑏, we need to define a cost function.

Loss (error) function:

● Loss function measures the discrepancy between the prediction (𝑦̂(𝑖)) and the desired output (𝑦(𝑖)).
● In other words, the loss function computes the error for a single training example.

11
Error / Loss Function
Squared Error Function:

● We can see an extra (1/2) in the right side of the equation. Does it matter?
● It is because when you take the derivative of the cost function, that is used in
updating the parameters during gradient descent, that 2 in the power get
cancelled with the (1/2) multiplier.
● These techniques are or somewhat similar are widely used in math in order
"To make the derivations mathematically more convenient".

12
Is squared error function a good choice?
● The squared error function (commonly used function for linear regression) is not very
suitable for logistic regression.
○ In case of logistic regression, the hypothesis / prediction is non-linear (sigmoid function), which
makes the square error function to be non-convex.
○ On the other hand, logarithmic function is a convex function for which there is no local optima,
so gradient descent works well.
● If you are doing binary classification, squared error function generally also penalize
examples that are correctly classified but are still near the decision boundary, thus
creating a "margin."
● Gradient descent waste a lot of time getting predictions very close to {0, 1}

13
Logistic Regression: Cross Entropy Loss

● Cost function
○ The cost function is the average of the loss function of the entire training set. We are
going to find the parameters 𝑤 𝑎𝑛𝑑 𝑏 that minimize the overall cost function.

14
Gradient Descent

We want to find parameters


W, b that minimize J(W, b)

15
Content Credit: Andrew Ng
Gradient Descent
● Our cost function is convex. We want to parameters
● First we initialize w and b to 0,0 or initialize W, b that minimize J(W, b)
them to a random value in the convex function
and then try to improve the values the reach
minimum value.
● In Logistic regression people always use 0,0
instead of random.
● This function is convex, no matter where you
initialize you should get to the global optimal
point or roughly close the global optimal
point.
Global Optimum

16
Gradient Descent
● Gradient starts at the initial point and take a step We want to parameters
in the steepest downhill direction after each W, b that minimize J(W, b)
iteration.
● It will try to reach to the global optimum or
somewhere near to the global optimum.

Global Optimum

17
Gradient Descent // Repeatedly do that until the algorithm converges.
J(w) Learning Rate
Repeat {

- +
Update or change you want to
W make to the parameter w
Ignore b for now to make it a one dimensional
plot rather than a higher dimensional plot. }
● 𝜶 = Learning Rate: How bigger step we
choose at each iteration of gradient descent.
● Definition of a derivative:
○ Slope of a function at a point.

18
Gradient Descent : Actual Update Rule

We want to parameters
W, b that minimize J(W, b)

Partial Derivative

J(w,b)

19
Derivatives : Intuition
● a=2 f (a) = 6
a = 2.001 f (a) = 6.003 If we shift a by
0.001 then f (a)
Slope (derivative) of f (a) at a = 2 is 3 shift by 3 times
0.001.
● a=5 f (a) = 15
a = 5.001 f (a) = 15.003

Slope (derivative) of f (a) at a = 5 is also 3

d f (a)
a =3
da
The slope or "rate of change" at
any point is 2x.

20
Do we actually need Gradient Descent?
● Let's pretend that we only have 1 weight. To find the ideal value of our weight
that will minimize our cost, we need to try a bunch of values for W, let's say
we test 1000 values. That doesn't seem so bad, after all, my computer is
pretty fast.
● It takes about 0.04 seconds to
check 1000 different weight
Cost Winner values for our neural network.

● Since we’ve computed the cost


for a wide range values of W, we
can just pick the one with the
smallest cost.
W
Try all 1000 values 21
Do we actually need Gradient Descent?
● Let's next consider 2 weights for a moment. ● After our 1 million evaluations we’ve found our
To maintain the same precision we now need solution, but it took an agonizing 40 seconds!
to check 1000 times 1000, or one million Searching through three weights would take a
values. This is a lot of work, even for a fast billion evaluations, or 11 hours!
computer.
● Searching through all 9 weights we need for our
simple network would take
W2 1,268,391,679,350,583.5 years. (Over a
quadrillion years). So for that reason, the "just
Try all 1000 values
try everything" or brute force optimization
method is clearly not going to work.

W1
Try all 1000 values 22
A Famous Quote

23
Computation Graph
● Neural Networks are organized in terms of a forward pass or backward pass.
● Forward Pass / Propagation
○ Which we compute the output of the neural network
● Backward Pass / Propagation
○ Which we use to compute gradients / derivatives
● Computation Graph
○ Explains why it is organized in this way.

24
Computation Graph
J (a, b, c) = 3 (a + bc)

3 steps of computation:

1. u = bc
a
2. v=a+u
3. j = 3v v=a+u j = 3v
b
u = bc
c

25
Logistic regression : Forward Propagation

Computing loss of a single training example

Modify the parameters w1, w2 and b in order to minimize the loss

26
Logistic regression : Backward Propagation

27
Rules for derivatives of logarithmic expressions

If you are unsure about your


derivative check this link to
generate the derivation steps.

28
Logistic regression : Backward Propagation

Ignoring the (-) sign for now. log (x) refers to e base
log or the natural
logarithm (ln(x)) in
mathematical analysis,
physics, chemistry,
statistics, economics, and
some engineering fields.

29
Logistic regression : Backward Propagation

Finally, adding the (-) sign.

30
Logistic regression : Backward Propagation

Applying Chain Rule

31
Logistic regression : Backward Propagation

32
Logistic regression : Backward Propagation

Applying Chain Rule

33
Logistic regression : Backward Propagation

Applying
Chain Rule

34
Logistic regression : Backward Propagation

Applying
Chain Rule

35
Logistic regression : Backward Propagation

Applying
Chain Rule

36
Updating the Parameters: w1, w2 and b

This is one step of


Gradient Descent on a
single example.

Learning Rate

37
Logistic regression Gradient descent on m examples
Basic Parameters

x1 Feature

x2 Feature

w1 Weight of the first feature.

w2 Weight of the second feature.

b Logistic Regression parameter (Bias).

m Number of training examples

y(i) Expected output of i

38
Logistic regression Gradient descent on m examples

For the example 39


Logistic regression Gradient descent on m examples
Derivatives: All it turned out as simple arithmetic operations

d(a) - (y/a) + ((1-y) / (1-a))

d(z) a-y

d(w1) x1 * d(z)

d(w2) x2 * d(z)

d(b) d(z)

40
Logistic regression Gradient descent on m examples
J = 0; dw1 = 0; dw2 =0; db = 0; J /= m
w1 = 0; w2 = 0; b=0; dw1 /= m
dw2 /= m
for i = 1 to m db /= m
# Forward pass
z(i) = w1*x1(i) + w2*x2(i) + b # Gradient descent
a(i) = sigmoid(z(i)) w1 = w1 - alpha * dw1
J += (y(i)*log(a(i)) + (1-y(i))*log(1-a(i))) w2 = w2 - alpha * dw2
b = b - alpha * db
# Backward pass
dz(i) = a(i) - y(i) w1, w2, b are the accumulators and single
dw1 += dz(i) * x1(i) instances for the all m training examples.
n=2
dw2 += dz(i) * x2(i)
db += dz(i) One iteration of gradient descent
41
Logistic regression Gradient descent on m examples
● Previous slide is just one step of Gradient Descent, we need to repeat it
multiple times in order to take multiple steps of gradient descent.
● There are weaknesses in the previous implementation. In order to implement
we need to write two for loops.
● Having explicit for loops in your code make your code less efficient.
● Solution:- vectorization techniques
● To train with larger datasets we need to take the help from vectorization
techniques without using for loops.

42
LR Gradient descent on m examples (modified)
J = 0; dw1 = 0; dw2 = 0; db = 0; dw = np.zeros ((nx, 1))
w1 = 0; w2 = 0; b = 0;

for i = 1 to m
# Forward pass
z(i) = w1*x1(i) + w2*x2(i) + b
a(i) = sigmoid(z(i))
J += - (y(i)*log(a(i)) + (1-y(i))*log(1-a(i)))

# Backward pass
dz(i) = a(i) - y(i)
dw1 += dz(i) * x1(i) n=2
dw += x(i) * dz(i)
dw2 += dz(i) * x2(i)
db += dz(i)

43
Logistic regression Gradient descent on m examples
J=J/m
dw1 = dw1 / m
dw2 = dw2 / m dw = dw / m
db = db / m

# Gradient descent
w1 = w1 - alpha * dw1 We have gone from 2 for loops to 1
w2 = w2 - alpha * dw2 for loop, we still have one for loop
b = b - alpha * db that loops over individual training
examples.
w1, w2, b are the accumulators and single
instances for the all m training examples.

44
Vectorizing Logistic Regression (Forward)

1st training example 2nd training example 3rd training example

We need to do it m times if you have m training examples.

[ ]

(1 x m) dimension 45
(1 x m)
Vectorizing Logistic Regression (Forward)

Broadcasting

(1, 1) dimension
(1 x m) dimension

46
Gradient Computation

47
Gradient Computation

.
.

.
..

48
Implementing Logistic Regression
Single Iteration of Gradient Descent

Gradient
Update 49
What does this have to do with the brain?

50
END

51

You might also like