0% found this document useful (0 votes)
4 views53 pages

Deep Neural Networks

The document provides an introduction to machine learning, focusing on Artificial Neural Networks (ANN) and their foundational concepts. It discusses the structure and functioning of neurons, particularly the Perceptron model, its historical context, and limitations in handling certain binary functions like XOR. Additionally, it outlines the architecture of ANNs, the learning algorithms used to adjust weights, and the application of ANNs in classification tasks.

Uploaded by

alvaro ruiz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views53 pages

Deep Neural Networks

The document provides an introduction to machine learning, focusing on Artificial Neural Networks (ANN) and their foundational concepts. It discusses the structure and functioning of neurons, particularly the Perceptron model, its historical context, and limitations in handling certain binary functions like XOR. Additionally, it outlines the architecture of ANNs, the learning algorithms used to adjust weights, and the application of ANNs in classification tasks.

Uploaded by

alvaro ruiz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 53

Introduction to machine learning

Deep Neural Networks


Introduction

1
Introduction to machine learning

Machine Learning (Artificial Neural Network)

1. Artificial Neural Network (ANN) models relationships between a set of input data and
output data.

2. ANN models are based on the observed behaviour of neural nets in our brains

3. Just as brain uses a network of interconnected neurons to parallelize the processing of


input signals to trigger a response, ANN also make use of interconnected neurons to
parallelly work on input signals and give an output

4. It is interesting to know how biological neurons function Ref:


1. Resting potential
2. Action potential
3. Threshold for action potential
4. Synapses and synaptic transmission

2
Introduction to machine learning

Machine Learning (Artificial Neural Network)

5. Artificial Neural Network (ANN) models relationships between a set of


input data and output data.

6. Natural / Abstract neuron

https://youtu.be/ySgmZOTkQA8

3
Introduction to machine learning
Machine Learning (Artificial Neural Network) /
Perceptron…

7. History artificial neuron


a. Rosenblatt defined a Perceptron as a simple mathematical model of biological neurons
b. It takes as inputs a set of binary values from senses / neighboring neurons.
c. Every input is multiplied by a weight (akin to the synaptic strength to each nearby neuron),
d. A threshold to evaluate the sum of weighted inputs against. Output a 1 (neuron firing) if the
weighted sum is more than threshold, else a zero (neuron not firing)
e. Other than the inputs from sensors / neighboring neurons, they also get a special ‘bias’
input, which just has a value of 1
f. The bias is like intercept in a linear equation. Useful in generating more functions with the
same inputs
g. This was an improvement of the work of Warren McCulloch and Walter Pitts, McColloch-
Pitts neuron.
h. McColloch Pitts neuron had a fixed set of weights associated with inputs and as a result
did not have the ability to learn. They had no bias term

4
Introduction to machine learning

Machine Learning (Artificial Neural Network) /


Perceptron…

7. History artificial neuron (contd….


i. Rosenblatt’s neuron was inspired by a rule coined by Donald Hebb
(researcher studying biological neurons), according to which
learning in brain is stored in form of changes to the strength of
relationships between two connected neurons “Neurons that fire
together, wire together” Donald Hebb 1949
j. Rosenblatt’s neuron can model the basic OR/AND/NOT functions
(building blocks of computing systems)
k. This was a big step towards the belief that making computers able to
perform formal logical reasoning would essentially solve AI.

5
Introduction to machine learning

Warren McCulloch & Walter Pitts Neuron Rosenblatt’s Perceptron

1. It has an Input layer that acts as dendrites

2. It has two parts, first part is weighted addition


Each input is multiplied with a weight (which is
typically initialized with some random value)
1. It has an input layer that acts like dendrites
3. The sum is then passed through an activation
2. It has two parts, the first part, g is weighted addition of function which yields a 1 if threshold is crossed
inputs. The weights are manually initialized and all
have same weight 4. The step function can be defined in such a way
that output can range from -1 to +1
3. The weighted sum is passed through an activation
function f which yields a 1 if threshold is crossed, else
0

6
Introduction to machine learning

Warren McCulloch & Walter Pitts “AND”, “OR” Neuron

>
> =

AND OR

1. In McCulloh Pitt neurons inputs and outputs are binary. The output is only one
but inputs can be many

2. All inputs have same positive weights (not shown in the figure)

3. The inputs multiplied with corresponding weights are summed up and the result
sent to a step function.

4. The threshold of the step function is fixed (for e.g. 1 for “AND” gate with two
inputs each with weight 1)

5. The threshold has to be modified for a gate, for e.g. ‘AND’ gate depending on the
number of inputs. What would the threshold for the “AND” gate be if the number
of inputs is x1, x2, x3?

7
Introduction to machine learning

Boolean Gates and Artificial Neurons

1. The objective of research in artificial neuron was to develop a computing system that could
learn to do tasks on it’s own without instruction how to do it

2. Most of the tasks that we do in our day to day life are classification. Hence, research was on
to develop an artificial neuron that could classify

3. Since computing systems are based on Boolean gates which generated two classes , it was
natural to check whether the artificial neurons can learn to mimic these gates such as the
OR, AND gate

8
Introduction to machine learning

Rosenblatt Neuron / Perceptron

1. Weights for the inputs are not same, can be positive or negative

2. The output can be -1, 0, 1 unlike MCP neuron where the output is only 0 or 1

3. The neuron is associated with a learning rule that modifies the weight to ensure with
same threshold, the neuron can behave like “AND” gate or “OR” gate with no need for
any threshold modification

4. This neuron learns from the data. It has intelligence to learn the pattern from the data

9
Introduction to machine learning

Machine Learning (Artificial Neural Network)


McCulloh Pitts Neuron

AND Gate

w[1] = 1 resul
w[0] = 1 t

training_data = [
(array([0,0]), 0),
(array([0,1]), 0),
(array([1,0]), 0),
(array([1,1]), 1),
]

# Step function with threshold of >1. Anything below is 0


step_function = lambda x: 0 if x < 1 else 1

w[1] = 1
McCullohPitt_Ros
w[0] = 1 enBlat_Ne

for x, _ in training_data:
result = dot(x, w)
print("{}: {} -> {}".format(x[:2], result, step_function(result)))

The McCulloch-Pitts model of a neuron is simple. However, this model is so simplistic that it only generates a binary output and also the
weight and threshold values are fixed.

10
Introduction to machine learning
Machine Learning (Artificial Neural Network) /
Perceptron…

8. Rosenblatt’s Neuron Functioning


i. The objective of the neuron is to extract the relationship between inputs and output
from an example training set
j. The relationship has to be expressed in a mathematical function from. A
mathematical function consists of sum of inputs multiplied with respective weights
k. For each example the weights have to be adjusted (increased or decreased) to
achieve overall correct result across the dataset
l. Perceptron algorithm –
i. Start with a random set of weights for all the input variables and the bias
j. For the input data point, compute the output using the weights and the bias
k. If the calculated output does not match the expected output (from training data) modify the
weights **
l. Go to the next example in the training set and repeat steps j - k until the Perceptron makes
no more mistakes

** The relation between weight adjustment and errors and a learning rate will make the learning possible.
Learning is the process finding the right combination of weights that minimizes the errors in the training data set
11
Introduction to machine learning
Machine Learning (Artificial Neural Network) /
Perceptron…

9. Perceptron Limitations -
a. Papert and Minsky demonstrated that the perceptron was incapable of
handling some of the binary gates such as XOR**
b. Given that it cannot represent all possible binary gates, it could not
have been used for all possible computations hence the objective of
computer based AI was a pipe dream!
c. It was subsequently demonstrated that instead of making a neuron
intelligent, a network of neurons can be used to do what a single
neuron could not
d. This was the birth of a Artificial Neural Network

** Ref: https://alan.do/minskys-and-or-theorem-a-single-perceptron-s-limitations-490c63a02
e9f

Lab - McCullohPitt_RosenBlat_Neurons.ipynb

12
Introduction to machine learning
Machine Learning (Artificial Neural Network)
10. The processing elements of a ANN is called a node, representing the artificial
neuron. Each ANN is composed of a collection of nodes grouped in layers. A typical
structure is shown The initial layer is the input layer and the last layer is the output
layer. In between we have the hidden layers

Input Layer 1st Hidden 2nd Hidden Output Layer


Layer Layer

X1

X2
Y_Pred

X3

B1 B2 B3

Bias Layer

13
Introduction to machine learning
Machine Learning (Artificial Neural Network)
11. Mathematical foundations for artificial neural networks
a. Kolmogorov theorem – any continuous function f defined on n-dimensional cube is
representable by sums and superpositions of continuous functions of only one variable

Input Layer 1st Hidden 2nd Hidden Layer Output Layer


Layer

X
1

X
Y_Pred
2

X
3

Bias Layer
B B B
1 2 3
b. Cover’s theorem - states that given a set of training data that is not linearly separable, one
can with high probability transform it into a training set that is linearly separable by
projecting it into a higher-dimensional space via some non-linear transformation.

14
Introduction to machine learning
Machine Learning (Artificial Neural Network)
12. A given node will fire and feed a signal to subsequent nodes in next layer only if the
non-linear function it implements reaches a threshold. In ANN use of Sigmoid function
is more common than step function

Output ai fired

Threshold
input

15
Introduction to machine learning
Machine Learning (Artificial Neural Network)
13. The summation function can be implemented in many ways. It does not have to be
mathematical addition of the inputs

16
Introduction to machine learning
Machine Learning (Artificial Neural Network)
14. The ANN generic architecture

15. Neural net consists of multiple layers. It has two layers on the edge, one is input layer
and the other is output layer.

16. In between input and output layer, there can be many other layers. These layers are
called hidden layers

17
Introduction to machine learning
Machine Learning (Artificial Neural Network)
17. The input layer is passive, does no processing, only holds
the input data to supply it to the first hidden layer

18
Introduction to machine learning
Machine Learning (Artificial Neural Network)
18. Each node in the first hidden layer, takes all input attributes, multiplies with the
corresponding weights, adds bias and the output is transformed using non_linear
function

X1 Hidden Layer Node 1

W11 W12 W13


X2

ACC = X1*W11 + X2*W12 + X3*W13

X3 N1Output = Sigmoid(ACC)

19. The weights for a given hidden node is pre-fixed and all the nodes in the hidden layer
have their own weights

20. The output of each node is fed to output layer nodes or another set of hidden nodes in
another hidden layer

19
Introduction to machine learning
Machine Learning (Artificial Neural Network)
21. The output value of each hidden node is sent to each output node in the output layer

20
Introduction to machine learning
Machine Learning (Artificial Neural Network)

X1
Output Node 1

X2
011 O12 O13 O14
X3
ACC = X1*WO11 + X2*WO12 +
X4 X3*WO13 + X4*WO14

N1Output = Sigmoid(ACC)

21
Introduction to machine learning
Machine Learning (Artificial Neural Network)
22. In a binary output ANN, the output node acts like a perceptron classifying the input
into one of the two classes

23. Examples of such ANN applications would be to detect fraudulent transaction,


whether a customer will buy a product given the attributes etc.

22
Introduction to machine learning
Machine Learning (Artificial Neural Network)
24. We can have a ANN with multiple output nodes where a given output node may or
may not get triggered given the input and the weights.

25. We can have a ANN with multiple output nodes where a given output node may or
may not get triggered given the input and the weights.

23
Introduction to machine learning
Machine Learning (Artificial Neural Network)
26. The weights required to make a neural network carry out a particular task are found
by a learning algorithm, together with examples of how the system should operate

27. The examples in vehicle identification could be a large hadoop file of serveral millions
sample segments such as bicycle, motorcycle, car, bus etc.

28. The learning algorithms calculate the appropriate weights for each classification for all
nodes at all the levels in the network

29. If we consider each input as a dimension then ANN labels different regions in the n-
dimensional space. In our example one region is cars, other region is bicycle

Car

Bycycle

Image Source: https://en.wikipedia.org/wiki/K-d_tree#/media/File:3dtree.png

24
Introduction to machine learning

Perceptrons

25
Introduction to machine learning
Perceptron
Perceptron Learning Algorithm –

1. Select random sample from training set as input. Draw the first random line (green) such that
blue triangles lie above it and red circles ones below

2. If classification is correct, do nothing. But first time many blue triangles are on wrong side!

3. If classification is incorrect, modify the weight vector w and shift the green line

4. Repeat this procedure until the entire training set is classified correctly

5. Howsoever times we run this algorithm, it will find a surface which separates the two classes

Run 1 Run n

26
Introduction to machine learning

Test

6. Convergence theorem guarantees that when the classes are linearly separable in the
training set, perceptron will find that surface which separates the two classes correctly

7. The perceptron algorithm does not guarantee it will be able to separate the two classes
correctly even when the classes are linearly separable

8. Why? Because it does not look for an optimal plane. It stops the moment it finds the
separator plane (a.k.a dichotomizer).

9. Since the planes, are passing very close to the data points in the training set, it may not
perform well in test set where the distribution of the data will be different

27
Introduction to machine learning
Perceptron Weakness

1. Perceptrons fail to handle many data distributions such as XOR where it cannot
segregate the classes

2. XOR is an example of distribution of classes not linearly separable in two


dimensions but is easily separable in higher 3 dimensional space. Ref:
http://www.mind.ilstu.edu/curriculum/artificial_neural_net/xor_problem_and_soluti
on.php

3. Cover’s theorem – “Formulating classification problem in a space of higher


dimensionality (than original problem) increases the probability of the classes
becoming linearly separable

4. Neural networks help implement this theorem in from of neurons in a layer

28
Introduction to machine learning
Origins of Neural Networks

1. Perceptron were replaced with artificial neuron which not only had a weighted
summation operation but also included a non-linearity function.

2. Multitude of such neurons working together could solve those problems where
individual Perceptron failed

3. Multiple neurons in a single layer is akin to transforming data from lower


dimensions to higher dimensions! (Cover’s theorem in action)

4. Kolmogorov theorem* states that any continuous function f(x1,x2,…,xn) defined


on [0,1] with n>2 can be expressed in form of two carefully chosen functions

5. Looks similar to layer of neurons with non-linear function g acting on summation


of weighted inputs ! No wonder neuralnets form basis of complex processing
* https://en.wikipedia.org/wiki/Universal_approximation_theorem (another mathematical theorem)

29
Introduction to machine learning

Components of Neural Networks

30
Introduction to machine learning

Activation Functions

1. Artificial neuron works in three steps


1. First it multiplies the input signals with corresponding weights
2. Second, adds the weighted signals
3. Third, converts the result to another value using a mathematical transformation
function

2. For the third step, there are multiple mathematical functions available but all
together are called the activation function

3. The purpose of the activation function is to act like a switch for the neuron.
Should the neuron fire or not. Also…

4. The activation function is critical to the overall functioning of the neural network.
Without it, the whole neural network will mathematically become equivalent to
one single neuron!

5. Activation function is one of the critical component that gives the neural networks
ability to deal with complex problems

31
Introduction to machine learning

Activation Functions - why?

1. Let us take a fully connected neural network. Every neuron in


every layer takes multiple inputs
2. The inputs are weighted and summed up at each neuron
3. The nodes in the second layer are simply scaling up the output
of neurons in previous layer

4. For e.g the neuron G takes as input weight sums from D,E and
F, G’s output is scaled version of output of D, E and F
5. G_Out = 3D -2E + 1F
6. = 3(1A + 2B + 3C) – 2(-3A + 2B -1C) + 1(2A+4B-2C)
7. = 8A + 6B + 3C
G-Out
8. Thus this part of the network is like a single neuron with weights
of 8, 6, 3!!!

9. Same argument holds for other neurons

10. Thus the entire neural network collapses to on neuron!


11. A single neuron is not capable of doing much

32
Introduction to machine learning

Activation Functions - What are they?

1. The activation functions are a mathematical transformations that prevent the


network from collapsing to a single neuron

2. The collapse can happen when the neurons do simple addition and
multiplications of the inputs. These are called linear operations. Thus linear
operations collapse the network

3. All activations functions are non-linear transformers for exactly the same
reason. This non-linear transformation not only prevents collapse, it also
empowers the network to do complex tasks because each neuron does
something in the network totality

33
Introduction to machine learning

Activation Functions - Types


4. Types of non-linear activation functions include –
a. Piecewise linear functions
I. Step function
II. ReLU – Rectified Linear Units
III. Leaky ReLU
IV. Parametric ReLU
V. Shifted ReLU

b. Smooth functions
I. Smooth ReLU / Exponential ReLU
II. Sigmoid / Logistic functions
III. Hyperbolic Tangent (tanh)
IV. Swish (combination of Sigmoid and ReLU)

34
Introduction to machine learning

Activation Functions -

35
Introduction to machine learning
Neurons stretch the features space through non-linear functions and achieve
Cover’s theorem
N
ACC = m1X + C1 N1Output = Sigmoid(ACC) 1

Neuron1 X N
3
N
2

Input distribution Transformed thru sigmoid

ACC = m2X + C2 N2Output = Sigmoid(ACC)

Neuron2 Data points get


linearly separable

Transformed thru sigmoid


Input distribution

36
Introduction to machine learning

Neurons stretch the features pace through non-linear functions and


achieve Cover’s theorem

Image Source: https://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

Ref: https://cs.stanford.edu/people/karpathy/convnetjs//demo/classify2d.html

37
Introduction to machine learning

SoftMax Function -
1. A kind of operation applied at the output neurons of a classifier network
2. Used only when we have two or more output neurons and is applied simultaneously to all
the output neurons
3. Turns raw numbers coming out of the pen-ultimate layer into probability values in the
output layer
4. Suppose output layer neurons emit (Op1, Op2, Opn). The raw numbers may not make
much sense. We convert that into probabilities using Softmax which becomes more
meaningful. For e.g. input belongs to cycle is 30 times more likely than sailboat, 13 times
more probable than car
Output Probability of Output
Layer belonging to Classes

.07
Op1

0.9
Entire Op2
Network

.03

Opn

38
Introduction to machine learning

Forward Propagation

1. The directed acyclic path taken by input data from input layer to get
transformed using non-linear functions into final network level outputs

2. Input data is propagated forward from the input layer to hidden layer till it
reaches final layer where predictions are emitted

3. At every layer, data gets transformed non-linearly in every neuron

4. There may be multiple hidden layers with multiple neurons in each layer

5. The last layer is the output layer which may have a softmax function (if the
network is a multi class classifier)

39
Introduction to machine learning

Forward Propagation

a. Calculate the weighted input to the hidden layer by multiplying 𝑋 by the hidden
6. Forward prop steps –

weight 𝑊ℎ

c. At output layer, repeat step b replacing 𝑋 by the hidden layer’s output


b. Apply the activation function and pass the result to the final layer

Input Layer Hidden Layer Output Layer

40
Introduction to machine learning

Forward Propagation and Matrix operations

Note: The diagram shows step function instead of ReLU in each neuron
The bias is all set to 1. The bias supplied to a neuron depends on the weight assigned to the
connector connecting bias to the neuron
41
Introduction to machine learning

Bias Term
1. Every neuron in the hidden layers are associated with a bias term. The bias
term help us to control the firing threshold in each neuron

2. It acts like the intercept in a linear equation (y = sum(mx) + c). If sum(mx) is


not crossing the threshold but the neuron needs to fire

3. bias will be adjusted to lower that neuron’s threshold to make it fire! Network
learns richer set of patterns using bias

4. The bias term is also considered as input though it does not come from data
42
Introduction to machine learning

Loss function (Mean Square Loss)

1. What is an optimization algorithm and what is its use? - Optimization algorithms helps us to minimize
(or maximize) an Objective function (another name for Error function) E(x) which is simply a
mathematical function dependent on the Model’s internal learnable parameters which are used in
computing the target values(Y) from the set of predictors(X) used in the model

2. C = ½(( wi.xi + b) – y). In this expression Xi and y come from the data and are given. What the ML
algorithm learns is the weight wi and bias b. Thus C = f( wi, b)

3. The optimizer algorithms try to estimate the values of wi and b which when used will give minimum or
maximum C. In ML we look for minimum

43
Introduction to machine learning

Relation between error and change in weights


Since part of neuron function is linear equation (before applying the non-linear transformation), the error at
each neuron can be expressed in terms of the linear equation.
+
w))x
(
1 +d
(w
Y=
c

1x + c
Actual y1 given X1
Y = w
d iction
Error e1 in pre

d(w) Predicted y1 given X1 e1 = yellow – red


w1 = y_actual - y_pred
= (w1+dm)x + c - (w1x + c)
= w1x + c + dwx – w1x -c
= dwx

Hence dw = e1 / x
The change required in m (dw) is e1/x. However, change required w.r.t another data point
may be different. To prevent jumping around with dw, we moderate the change in W by
introducing a learning rate l . Hence dw = l( e1/x)

44
Introduction to machine learning

Back Propagation
1. Back propagation is the process of learning that the neural network employs to re-
calibrate the weights and bias at every layer and every node to minimize the error in the
output layer

2. During the first pass of forward propagation, the weights and bias are random number.
The random numbers are generated within a small range say 0 – 1

3. Needless to say, the output of the first iteration is almost always incorrect. The
difference between actual value / class and predicted value / class is the error

4. All the nodes in all the preceding layers have contributed to the error and hence need to
get their share of the error and correct their weights

5. This process of allocating proportion of the error to all the nodes in the previous layer is
back propagation

6. The goal of back propagation is to adjust weights and bias in proportion to the error
contribution and in iterative process identify the optimal combination of weights

7. At each layer, at each node, gradient descent algorithm is applied to adjust the weights

45
Introduction to machine learning

Back Propagation

e1

1. Error in output node shown as e1, is contributed by node 1 ,2 and 3 of layer 2 through weights w(3,1),
w(3,2), w(3,3)

2. Proportionate error is assigned back to node 1 of hidden layer 2 is (w(3,1) / w(3,1) + w(3,2) + w(3,3)) * e1

3. The error assigned to node 1 of hidden layer 2 is proportionately sent back to hidden layer 1 neurons

4. All the nodes in all the layers re-adjust the input weights and bias to address the assigned error (for this
they use gradient descent)

5. The input layer is not neurons, they are like input parameters to a function and hence have no errors
46
Introduction to machine learning

Gradient Descent
The challenge is, all the weights in all the inputs of all the neurons need to be adjusted. It is not
manually possible to find the right combination of weights using brute force. Instead, the neural
network algorithm uses a learning function called gradient descent

1. A random combination of bias B1 and input weights


Convex error function W1 (showing only one as more than one is not
possible to visualize)
Error

2. Each combination of W1 and B1 is one particular


linear model in a neuron. That model is associated
with proportionate error e1 (red dashed line).

e1 3. Objective is to drive e1 towards 0. For which we need


to find the optimal weight (Woptimal) and bias
(Boptimal)
Bi
as w1 4. The algorithm uses gradient descent algorithm to
B1

change bias and weight form starting values of B1


and W1 towards the Boptimal, Woptimal.
ht
Weig
Note: in 3D error surface can be visualized as shown but
not in more than 3 dimensions

47
Introduction to machine learning

Gradient Descent Actual y1 given X1

Predicted y1 given X1

e1 w1

Happens at every node


Error

e1 Least error E2 is at the global minima of the


e2 convex function which only one unique
combination of weight (woptimal) and bias
Bi Bo (boptimal) will fetch us.
as Bp1tim w1
l
al tima
Wop
ht
Weig
48
Introduction to machine learning

Gradient Descent

1. Let target value for a training example X be y i.e. The data frame used for training
has value X,y

2. Let the model (represented by random m and c) predict the value for the training
example X to be yhat

3. Error in prediction is E = yhat – y. If we sum all the errors across all data points,
some will be positive some negative and thus cancel out

4. To prevent the sum of errors becoming 0, we square the error i.e. E = (y – yhat)^2.
Note: in squared expression, y – yhat or yhat – y mean the same

5. Sum of (y – yhat)^2 across all the X values is called SSE (Sum of Squared Errors)

6. Using gradient descent (descend towards the global minima). Gradient descent uses
partial derivatives i.e how the SSE changes on slightly modifying the model
parameters m and c one at a time
d(E) / d(m) = d(sum(yhat – y)^2) / d(m)
d(E) / d(c) = d(sum(yhat – y)^2) / d(c)

49
Introduction to machine learning

Gradient Descent
1. Gradient descent is a way to minimize an objective function / cost function such as Sum of
Squared Errors (SSE) that is dependent on model parameters of weight / slope and bias

2. The parameters are updated in the direction opposite to the direction of the gradient
(direction of maximum increase) of the objective function

3. In other words we change the values of weight and bias following the direction of the slope
of the surface of the error function down the hill until we reach minima

4. This movement from starting weight and bias to optimal weight and bias may not happen in
one shot. It is likely to happen in multiple iterations. The values change in steps

5. The step size can be influenced using a parameter called Learning Rate. It decides the size
of the steps i.e. the amount by which the parameters are updated. Too small learning step
will slow down the entire process while too large may lead to an infinite loop
learning step
6. The mathematical expression of gradient descent

Update Model parameter at Old Model parameter at e1 Gradient descent with


e2 learning step

50
Introduction to machine learning

Gradient Descent

Transform our error function (which is a quadratic / convex function) into a


contour graph. Gradient is always found on the input model parameters only

1. Every ring on the error function represents a


combination of coefficients (m1 and m2 in the
image) which result in same quantum of error
i.e. SSE

2. Let us convert that to a 2d contour plot. In the


contour plot, every ring represents one quantum
s of error.
Bi a

we 3. The innermost ring / bull’s eye is the


ig h combination of the coefficients that gives the
t
lease SSE

51
Introduction to machine learning
Randomly selected starting point
About the contour graph –

1. Outermost circle is highest error while innermost is the least


error circle
B1
W
1 2. A circle represents combination of paramters which result in
same error. Moving on a circle will not reduce error.

3. Objective is to start from anywhere but reach the innermost


we circle
s
Bia ig ht Gradient Descent Steps –

1. First evaluate dy(error)/d(weight) to find the direction of highest increase in


error given a unit change in weight (Blue arrow). Partial derivative w.r.t. to
Gradient vector weight

2. Next find dy(error) /d(bias) to find the direction of highest increase in error
given a unit change in bias (green arrow). Partial derivative w.r.t. to bias

3. Partial derivatives give the gradient in the given axis and gradient is a
vector

4. Add the two vectors to get the direction of gradient (black arrow) i.e.
direction of max increase in error

5. We want to decrease error, so find negative of the gradient i.e. opposite to


black arrow ( Orange arrow). The arrow tip is new value of bias and weight.

6. Recalculate the error at this combination an iterate to step 1 till movement


in any direction only increases the error
52
Introduction to machine learning

Thank You

53

You might also like