Deep Neural Networks
Deep Neural Networks
1
Introduction to machine learning
1. Artificial Neural Network (ANN) models relationships between a set of input data and
output data.
2. ANN models are based on the observed behaviour of neural nets in our brains
2
Introduction to machine learning
https://youtu.be/ySgmZOTkQA8
3
Introduction to machine learning
Machine Learning (Artificial Neural Network) /
Perceptron…
4
Introduction to machine learning
5
Introduction to machine learning
6
Introduction to machine learning
>
> =
AND OR
1. In McCulloh Pitt neurons inputs and outputs are binary. The output is only one
but inputs can be many
2. All inputs have same positive weights (not shown in the figure)
3. The inputs multiplied with corresponding weights are summed up and the result
sent to a step function.
4. The threshold of the step function is fixed (for e.g. 1 for “AND” gate with two
inputs each with weight 1)
5. The threshold has to be modified for a gate, for e.g. ‘AND’ gate depending on the
number of inputs. What would the threshold for the “AND” gate be if the number
of inputs is x1, x2, x3?
7
Introduction to machine learning
1. The objective of research in artificial neuron was to develop a computing system that could
learn to do tasks on it’s own without instruction how to do it
2. Most of the tasks that we do in our day to day life are classification. Hence, research was on
to develop an artificial neuron that could classify
3. Since computing systems are based on Boolean gates which generated two classes , it was
natural to check whether the artificial neurons can learn to mimic these gates such as the
OR, AND gate
8
Introduction to machine learning
1. Weights for the inputs are not same, can be positive or negative
2. The output can be -1, 0, 1 unlike MCP neuron where the output is only 0 or 1
3. The neuron is associated with a learning rule that modifies the weight to ensure with
same threshold, the neuron can behave like “AND” gate or “OR” gate with no need for
any threshold modification
4. This neuron learns from the data. It has intelligence to learn the pattern from the data
9
Introduction to machine learning
AND Gate
w[1] = 1 resul
w[0] = 1 t
training_data = [
(array([0,0]), 0),
(array([0,1]), 0),
(array([1,0]), 0),
(array([1,1]), 1),
]
w[1] = 1
McCullohPitt_Ros
w[0] = 1 enBlat_Ne
for x, _ in training_data:
result = dot(x, w)
print("{}: {} -> {}".format(x[:2], result, step_function(result)))
The McCulloch-Pitts model of a neuron is simple. However, this model is so simplistic that it only generates a binary output and also the
weight and threshold values are fixed.
10
Introduction to machine learning
Machine Learning (Artificial Neural Network) /
Perceptron…
** The relation between weight adjustment and errors and a learning rate will make the learning possible.
Learning is the process finding the right combination of weights that minimizes the errors in the training data set
11
Introduction to machine learning
Machine Learning (Artificial Neural Network) /
Perceptron…
9. Perceptron Limitations -
a. Papert and Minsky demonstrated that the perceptron was incapable of
handling some of the binary gates such as XOR**
b. Given that it cannot represent all possible binary gates, it could not
have been used for all possible computations hence the objective of
computer based AI was a pipe dream!
c. It was subsequently demonstrated that instead of making a neuron
intelligent, a network of neurons can be used to do what a single
neuron could not
d. This was the birth of a Artificial Neural Network
** Ref: https://alan.do/minskys-and-or-theorem-a-single-perceptron-s-limitations-490c63a02
e9f
Lab - McCullohPitt_RosenBlat_Neurons.ipynb
12
Introduction to machine learning
Machine Learning (Artificial Neural Network)
10. The processing elements of a ANN is called a node, representing the artificial
neuron. Each ANN is composed of a collection of nodes grouped in layers. A typical
structure is shown The initial layer is the input layer and the last layer is the output
layer. In between we have the hidden layers
X1
X2
Y_Pred
X3
B1 B2 B3
Bias Layer
13
Introduction to machine learning
Machine Learning (Artificial Neural Network)
11. Mathematical foundations for artificial neural networks
a. Kolmogorov theorem – any continuous function f defined on n-dimensional cube is
representable by sums and superpositions of continuous functions of only one variable
X
1
X
Y_Pred
2
X
3
Bias Layer
B B B
1 2 3
b. Cover’s theorem - states that given a set of training data that is not linearly separable, one
can with high probability transform it into a training set that is linearly separable by
projecting it into a higher-dimensional space via some non-linear transformation.
14
Introduction to machine learning
Machine Learning (Artificial Neural Network)
12. A given node will fire and feed a signal to subsequent nodes in next layer only if the
non-linear function it implements reaches a threshold. In ANN use of Sigmoid function
is more common than step function
Output ai fired
Threshold
input
15
Introduction to machine learning
Machine Learning (Artificial Neural Network)
13. The summation function can be implemented in many ways. It does not have to be
mathematical addition of the inputs
16
Introduction to machine learning
Machine Learning (Artificial Neural Network)
14. The ANN generic architecture
15. Neural net consists of multiple layers. It has two layers on the edge, one is input layer
and the other is output layer.
16. In between input and output layer, there can be many other layers. These layers are
called hidden layers
17
Introduction to machine learning
Machine Learning (Artificial Neural Network)
17. The input layer is passive, does no processing, only holds
the input data to supply it to the first hidden layer
18
Introduction to machine learning
Machine Learning (Artificial Neural Network)
18. Each node in the first hidden layer, takes all input attributes, multiplies with the
corresponding weights, adds bias and the output is transformed using non_linear
function
X3 N1Output = Sigmoid(ACC)
19. The weights for a given hidden node is pre-fixed and all the nodes in the hidden layer
have their own weights
20. The output of each node is fed to output layer nodes or another set of hidden nodes in
another hidden layer
19
Introduction to machine learning
Machine Learning (Artificial Neural Network)
21. The output value of each hidden node is sent to each output node in the output layer
20
Introduction to machine learning
Machine Learning (Artificial Neural Network)
X1
Output Node 1
X2
011 O12 O13 O14
X3
ACC = X1*WO11 + X2*WO12 +
X4 X3*WO13 + X4*WO14
N1Output = Sigmoid(ACC)
21
Introduction to machine learning
Machine Learning (Artificial Neural Network)
22. In a binary output ANN, the output node acts like a perceptron classifying the input
into one of the two classes
22
Introduction to machine learning
Machine Learning (Artificial Neural Network)
24. We can have a ANN with multiple output nodes where a given output node may or
may not get triggered given the input and the weights.
25. We can have a ANN with multiple output nodes where a given output node may or
may not get triggered given the input and the weights.
23
Introduction to machine learning
Machine Learning (Artificial Neural Network)
26. The weights required to make a neural network carry out a particular task are found
by a learning algorithm, together with examples of how the system should operate
27. The examples in vehicle identification could be a large hadoop file of serveral millions
sample segments such as bicycle, motorcycle, car, bus etc.
28. The learning algorithms calculate the appropriate weights for each classification for all
nodes at all the levels in the network
29. If we consider each input as a dimension then ANN labels different regions in the n-
dimensional space. In our example one region is cars, other region is bicycle
Car
Bycycle
24
Introduction to machine learning
Perceptrons
25
Introduction to machine learning
Perceptron
Perceptron Learning Algorithm –
1. Select random sample from training set as input. Draw the first random line (green) such that
blue triangles lie above it and red circles ones below
2. If classification is correct, do nothing. But first time many blue triangles are on wrong side!
3. If classification is incorrect, modify the weight vector w and shift the green line
4. Repeat this procedure until the entire training set is classified correctly
5. Howsoever times we run this algorithm, it will find a surface which separates the two classes
Run 1 Run n
26
Introduction to machine learning
Test
6. Convergence theorem guarantees that when the classes are linearly separable in the
training set, perceptron will find that surface which separates the two classes correctly
7. The perceptron algorithm does not guarantee it will be able to separate the two classes
correctly even when the classes are linearly separable
8. Why? Because it does not look for an optimal plane. It stops the moment it finds the
separator plane (a.k.a dichotomizer).
9. Since the planes, are passing very close to the data points in the training set, it may not
perform well in test set where the distribution of the data will be different
27
Introduction to machine learning
Perceptron Weakness
1. Perceptrons fail to handle many data distributions such as XOR where it cannot
segregate the classes
28
Introduction to machine learning
Origins of Neural Networks
1. Perceptron were replaced with artificial neuron which not only had a weighted
summation operation but also included a non-linearity function.
2. Multitude of such neurons working together could solve those problems where
individual Perceptron failed
29
Introduction to machine learning
30
Introduction to machine learning
Activation Functions
2. For the third step, there are multiple mathematical functions available but all
together are called the activation function
3. The purpose of the activation function is to act like a switch for the neuron.
Should the neuron fire or not. Also…
4. The activation function is critical to the overall functioning of the neural network.
Without it, the whole neural network will mathematically become equivalent to
one single neuron!
5. Activation function is one of the critical component that gives the neural networks
ability to deal with complex problems
31
Introduction to machine learning
4. For e.g the neuron G takes as input weight sums from D,E and
F, G’s output is scaled version of output of D, E and F
5. G_Out = 3D -2E + 1F
6. = 3(1A + 2B + 3C) – 2(-3A + 2B -1C) + 1(2A+4B-2C)
7. = 8A + 6B + 3C
G-Out
8. Thus this part of the network is like a single neuron with weights
of 8, 6, 3!!!
32
Introduction to machine learning
2. The collapse can happen when the neurons do simple addition and
multiplications of the inputs. These are called linear operations. Thus linear
operations collapse the network
3. All activations functions are non-linear transformers for exactly the same
reason. This non-linear transformation not only prevents collapse, it also
empowers the network to do complex tasks because each neuron does
something in the network totality
33
Introduction to machine learning
b. Smooth functions
I. Smooth ReLU / Exponential ReLU
II. Sigmoid / Logistic functions
III. Hyperbolic Tangent (tanh)
IV. Swish (combination of Sigmoid and ReLU)
34
Introduction to machine learning
Activation Functions -
35
Introduction to machine learning
Neurons stretch the features space through non-linear functions and achieve
Cover’s theorem
N
ACC = m1X + C1 N1Output = Sigmoid(ACC) 1
Neuron1 X N
3
N
2
36
Introduction to machine learning
Ref: https://cs.stanford.edu/people/karpathy/convnetjs//demo/classify2d.html
37
Introduction to machine learning
SoftMax Function -
1. A kind of operation applied at the output neurons of a classifier network
2. Used only when we have two or more output neurons and is applied simultaneously to all
the output neurons
3. Turns raw numbers coming out of the pen-ultimate layer into probability values in the
output layer
4. Suppose output layer neurons emit (Op1, Op2, Opn). The raw numbers may not make
much sense. We convert that into probabilities using Softmax which becomes more
meaningful. For e.g. input belongs to cycle is 30 times more likely than sailboat, 13 times
more probable than car
Output Probability of Output
Layer belonging to Classes
.07
Op1
0.9
Entire Op2
Network
.03
Opn
38
Introduction to machine learning
Forward Propagation
1. The directed acyclic path taken by input data from input layer to get
transformed using non-linear functions into final network level outputs
2. Input data is propagated forward from the input layer to hidden layer till it
reaches final layer where predictions are emitted
4. There may be multiple hidden layers with multiple neurons in each layer
5. The last layer is the output layer which may have a softmax function (if the
network is a multi class classifier)
39
Introduction to machine learning
Forward Propagation
a. Calculate the weighted input to the hidden layer by multiplying 𝑋 by the hidden
6. Forward prop steps –
weight 𝑊ℎ
40
Introduction to machine learning
Note: The diagram shows step function instead of ReLU in each neuron
The bias is all set to 1. The bias supplied to a neuron depends on the weight assigned to the
connector connecting bias to the neuron
41
Introduction to machine learning
Bias Term
1. Every neuron in the hidden layers are associated with a bias term. The bias
term help us to control the firing threshold in each neuron
3. bias will be adjusted to lower that neuron’s threshold to make it fire! Network
learns richer set of patterns using bias
4. The bias term is also considered as input though it does not come from data
42
Introduction to machine learning
1. What is an optimization algorithm and what is its use? - Optimization algorithms helps us to minimize
(or maximize) an Objective function (another name for Error function) E(x) which is simply a
mathematical function dependent on the Model’s internal learnable parameters which are used in
computing the target values(Y) from the set of predictors(X) used in the model
2. C = ½(( wi.xi + b) – y). In this expression Xi and y come from the data and are given. What the ML
algorithm learns is the weight wi and bias b. Thus C = f( wi, b)
3. The optimizer algorithms try to estimate the values of wi and b which when used will give minimum or
maximum C. In ML we look for minimum
43
Introduction to machine learning
1x + c
Actual y1 given X1
Y = w
d iction
Error e1 in pre
Hence dw = e1 / x
The change required in m (dw) is e1/x. However, change required w.r.t another data point
may be different. To prevent jumping around with dw, we moderate the change in W by
introducing a learning rate l . Hence dw = l( e1/x)
44
Introduction to machine learning
Back Propagation
1. Back propagation is the process of learning that the neural network employs to re-
calibrate the weights and bias at every layer and every node to minimize the error in the
output layer
2. During the first pass of forward propagation, the weights and bias are random number.
The random numbers are generated within a small range say 0 – 1
3. Needless to say, the output of the first iteration is almost always incorrect. The
difference between actual value / class and predicted value / class is the error
4. All the nodes in all the preceding layers have contributed to the error and hence need to
get their share of the error and correct their weights
5. This process of allocating proportion of the error to all the nodes in the previous layer is
back propagation
6. The goal of back propagation is to adjust weights and bias in proportion to the error
contribution and in iterative process identify the optimal combination of weights
7. At each layer, at each node, gradient descent algorithm is applied to adjust the weights
45
Introduction to machine learning
Back Propagation
e1
1. Error in output node shown as e1, is contributed by node 1 ,2 and 3 of layer 2 through weights w(3,1),
w(3,2), w(3,3)
2. Proportionate error is assigned back to node 1 of hidden layer 2 is (w(3,1) / w(3,1) + w(3,2) + w(3,3)) * e1
3. The error assigned to node 1 of hidden layer 2 is proportionately sent back to hidden layer 1 neurons
4. All the nodes in all the layers re-adjust the input weights and bias to address the assigned error (for this
they use gradient descent)
5. The input layer is not neurons, they are like input parameters to a function and hence have no errors
46
Introduction to machine learning
Gradient Descent
The challenge is, all the weights in all the inputs of all the neurons need to be adjusted. It is not
manually possible to find the right combination of weights using brute force. Instead, the neural
network algorithm uses a learning function called gradient descent
47
Introduction to machine learning
Predicted y1 given X1
e1 w1
Gradient Descent
1. Let target value for a training example X be y i.e. The data frame used for training
has value X,y
2. Let the model (represented by random m and c) predict the value for the training
example X to be yhat
3. Error in prediction is E = yhat – y. If we sum all the errors across all data points,
some will be positive some negative and thus cancel out
4. To prevent the sum of errors becoming 0, we square the error i.e. E = (y – yhat)^2.
Note: in squared expression, y – yhat or yhat – y mean the same
5. Sum of (y – yhat)^2 across all the X values is called SSE (Sum of Squared Errors)
6. Using gradient descent (descend towards the global minima). Gradient descent uses
partial derivatives i.e how the SSE changes on slightly modifying the model
parameters m and c one at a time
d(E) / d(m) = d(sum(yhat – y)^2) / d(m)
d(E) / d(c) = d(sum(yhat – y)^2) / d(c)
49
Introduction to machine learning
Gradient Descent
1. Gradient descent is a way to minimize an objective function / cost function such as Sum of
Squared Errors (SSE) that is dependent on model parameters of weight / slope and bias
2. The parameters are updated in the direction opposite to the direction of the gradient
(direction of maximum increase) of the objective function
3. In other words we change the values of weight and bias following the direction of the slope
of the surface of the error function down the hill until we reach minima
4. This movement from starting weight and bias to optimal weight and bias may not happen in
one shot. It is likely to happen in multiple iterations. The values change in steps
5. The step size can be influenced using a parameter called Learning Rate. It decides the size
of the steps i.e. the amount by which the parameters are updated. Too small learning step
will slow down the entire process while too large may lead to an infinite loop
learning step
6. The mathematical expression of gradient descent
50
Introduction to machine learning
Gradient Descent
51
Introduction to machine learning
Randomly selected starting point
About the contour graph –
2. Next find dy(error) /d(bias) to find the direction of highest increase in error
given a unit change in bias (green arrow). Partial derivative w.r.t. to bias
3. Partial derivatives give the gradient in the given axis and gradient is a
vector
4. Add the two vectors to get the direction of gradient (black arrow) i.e.
direction of max increase in error
Thank You
53