Unit 2
Unit 2
Neurons
Machine Learning
Deep Learning
Traditional ML Vs DL
Artificial neuron vs Biological neuron
The most fundamental unit of a deep
y
neural network is called an artificial neuron
Why is it called a neuron ? Where does
the inspiration come from ?
σ
The inspiration comes from biology (more
specifically, from the brain)
w1 w2 w3
biological neurons = neural cells = neural
x1 x2
xArtificial
3
processing units
Neuron We will first see what a biological neuron
looks like ...
Artificial neuron vs Biological neuron
The most fundamental unit of a deep
y
neural network is called an artificial neuron
Why is it called a neuron ? Where does
the inspiration come from ?
σ
The inspiration comes from biology (more
specifically, from the brain)
w1 w2 w3
biological neurons = neural cells = neural
x1 x2
xArtificial
3
processing units
Neuron We will first see what a biological neuron
looks like ...
• dendrite: receives signals from
other neurons
• synapse: point of connection to
other neurons
• soma: processes the information
• axon: transmits the output of
Biological
Neurons∗ this neuron
∗
Image adapted from
https://cdn.vectorstock.com/i/composite/12,25/neuron-cell-vector-81225.jp
g
• Of course, in reality, it is not just a
single neuron which does all this
• There is a massively parallel
interconnected net- work of neurons
• The sense organs relay information to the
lowest layer of neurons
• Some of these neurons may fire (in red)
in re- sponse to this information and in
turn relay inform- ation to other
neurons they are connected to
• These neurons may also fire (again, in red)
and the process continues eventually
resulting in a re- sponse (laughter in this
case)
• This massively parallel network also
ensures that there is division of work
• Each neuron may perform a certain role
or respond to a certain stimulus
A simplified
illustration
• The neurons in the brain are
arranged in a hierarchy
• We illustrate this with the help of
visual cortex (part of the brain)
which deals with processing
visual information
• Starting from the retina, the
information is relayed to several
layers (follow the ar- rows)
• We observe that the layers V 1, V 2
to A I T form a hierarchy (from
identifying simple visual forms to
high level objects)
25
Sample illustration of
hierarchical processing∗
Neurons
• ANN’s are built upon simple signal processing elements (Neuron) that are connected
together into a large mesh.
• You might be surprised to see how simple the calculations inside a neuron actually
are. We can identify three processing steps:
1.Each input gets scaled up or down
• When a signal comes in, it gets multiplied by a weight value that is assigned to this particular input.
That is, if a neuron has three inputs, then it has three weights that can be adjusted individually. During
the learning phase, the neural network can adjust the weights based on the error of the last test result.
2. All signals are summed up
• In the next step, the modified input signals are summed up to a single value. In this step, an offset is
also added to the sum. This offset is called bias. The neural network also adjusts the bias during the
learning phase.
• This is where the magic happens! At the start, all the neurons have random weights and random biases.
After each learning iteration, weights and biases are gradually shifted so that the next result is a bit
closer to the desired output. This way, the neural network gradually moves towards a state where the
desired patterns are “learned”.
3. Activation
• Finally, the result of the neuron’s calculation is turned into an output signal. This is done by feeding the
result to an activation function (also called transfer function).
Activation Functions
• The activation function is used as a decision making body at the output of a neuron. The neuron
learns Linear or Non-linear decision boundaries based on the activation function. It also has a
normalizing effect on the neuron output which prevents the output of neurons after several layers to
become very large, due to the cascading effect. There are three most widely used activation functions
• Sigmoid
It maps the input ( x axis ) to values between 0 and 1.
• Tanh
It is similar to the sigmoid function butmaps the input to values between -1 and 1.
• Rectified Linear Unit (ReLU)
It allows only positive values to pass through it. The negative values are mapped to zero.
Input Layer
• This is the first layer of a neural network. It is used to provide the input data or features to the network.
Output Layer
• This is the layer which gives out the predictions. The activation function to be used in this layer is
different for different problems. For a binary classification problem, we want the output to be either 0
or 1. Thus, a sigmoid activation function is used. For a Multiclass classification problem,
a Softmax ( think of it as a generalization of sigmoid to multiple classes ) is used. For a regression
problem, where the output is not a predefined category, we can simply use a linear unit.
Commonly used Terminologies in Neural
networks
The input vector All the input values of each perceptron are collectively called
the input vector of that perceptron.
The weight vector Similarly, all the weight values of each perceptron are
collectively called the weight vector of that perceptron.
• McCulloch (neuroscientist) and Pitts
McCulloch Neurons (logician) proposed a highly simplified
computational model of the neuron
(1943)
y ∈ { 0, • g aggregates the inputs and the function
1} f takes a decision based on this
aggregation
f
• The inputs can be excitatory n or
inhibitory
g(x 1 , x 2 , ..., xn ) = g(x ) = Σ x i
g
i= 1
• y = 0 if any x i is inhibitory, else
y = f (g(x )) = 1 if
x1 x 2 .. .. x n ∈ { 0, g(x ) ≥ θ
1}
= 0
McCulloch
y ∈ { 0, 1}
Neurons with
y ∈
Boolean
{ 0,
functions
y ∈ { 0,
1} 1}
θ 3 1
x1 x2 x3 x1 x2 x3 x1 x2 x3
1 0 0
x1 x2 x1 x2 x1
• The most basic form of an activation function is a simple binary function that has
only two possible results.
• This function returns 1 if the input is positive or zero, and 0 for any negative input. A
neuron whose activation function is a function like this is called a perceptron.
Threshold Logic Unit (TLU)
• In a Threshold Logic Unit (TLU) the output of the unit y in response to
a particular input pattern is calculated in two stages.
• First the activation is calculated.
• The activation a is the weighted sum of the inputs:
inputs
x1 w1
weights output
w2 activation
x2
q
y
.
. a= i=1
n
wi x i
. wn
xn
y= { 1 if a q
0 if a < q
Linear Unit
• Perceptron is Machine Learning algorithm for supervised learning of various binary
classification tasks.
• Further, Perceptron is also understood as an Artificial Neuron or neural
network unit that helps to detect certain input data computations in business
intelligence.
• Perceptron model is also treated as one of the best and simplest types of Artificial
Neural networks. However, it is a supervised learning algorithm of binary classifiers.
• Hence, we can consider it as a single-layer neural network with four main parameters,
inputs
i.e., input values, weights and Bias, net sum, and an activation function.
x1 w1 weights
w2 activation output
x2
.
y
. wn
a= i=1
n
wi x i y= a = i=1
n
wi x i
.x
n
Training ANNs
• Training set S of examples {x,t}
• x is an input vector and
• t the desired target vector
• Example: Logical And
S = {(0,0),0}, {(0,1),0}, {(1,0),0}, {(1,1),1}
• Iterative process
• Present a training example x , compute network output y , compare
output y with target t, adjust weights and thresholds
• Learning rule
• Specifies how to change the weights w and thresholds q of the
network as a function of the inputs x, output y and target t.
Perceptron Learning Rule
• w’=w + a (t-y) x
Or in components
• w’i = wi + Dwi = wi + a (t-y) xi (i=1..n+1)
With wn+1 = q and xn+1=-1
• The parameter a is called the learning rate. It determines the
magnitude of weight updates Dwi .
• If the output is correct (t=y) the weights are not changed (Dwi
=0).
• If the output is incorrect (t y) the weights wi are changed
such that the output of the TLU for the new weights w’i is
closer/further to the input xi.
Perceptron Training Algorithm
Repeat
for each training vector pair (x,t)
evaluate the output y when x is the input
if yt then
form a new weight vector w’ according
to w’=w + a (t-y) x
else
do nothing
end if
end for
Until y=t for all training vector pairs
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Convergence Theorem
The algorithm converges to the correct classification
• if the training data is linearly separable
• and a is sufficiently small
• If two classes of vectors X1 and X2 are linearly separable, the application
of the perceptron training algorithm will eventually result in a weight
vector w0, such that w0 defines a TLU whose decision hyper-plane
separates X1 and X2 (Rosenblatt 1962).
• Solution w0 is not unique, since if w0 x =0 defines a hyper-plane, so
does w’0 = k w0.
• regularize means to make things regular or acceptable
• Regularization refers to a set of different techniques that
lower the complexity of a neural network model during
training, and thus prevent the overfitting
• Regualrization penalizes the weight matrices of the
nodes
56
What is Overfitting
• The training data contains information about the
regularities in the mapping from input to output. But it
also contains sampling error.
• There will be accidental regularities because of the
particular training cases that were choosen.
• When we fit the model, It cannot tell which regularities
are real and which are caused by sampling error.
• So it fits both kinds of regularity. If the model is very
flexible it can model the sampling error really well.
• This means the model will not generalize well to unseen
data 57
Diagnosing Overfitting
58
Regularization Techniques
• L2 Regualrizartion / Ridge Regularization
• L1 Regualrizartion / Lasso Regularization
• Dropout
• Early Stopping
Salar
• y
Experien
ce
60
L1 vs L2 Regularization Methods
• L1 Regularization, also called a Lasso regression, adds the “absolute value of
magnitude” of the coefficient as a penalty term to the loss function.
• L2 Regularization, also called a Ridge regression, adds the “squared magnitude” of the
coefficient as the penalty term to the loss function.
• The key difference between these two is the penalty term.
61
Ridge Regularization
Salar
• y
Experien
ce
62
Steep slope
Salar
• y
Experien
ce
63
• Assume lamda =1
• Slope = 1.3
• Then cost = 0 + 1(1.3)2
• = 1.69
• Assume lamda =1
• Slope = 1.1
• Then cost = 0 + 1(1.3)2
• = 1.21
64
•
65
Lasso Regression
• This help in feature selection too
66
Dropout
• This is the one of the most interesting types of
regularization techniques.
• It also produces very good results and is consequently
the most frequently used regularization technique in
the field of deep learning
• To understand dropout, let’s say our neural network
structure
67
• At every iteration, it randomly selects some nodes
and removes them along with all of their incoming
and outgoing connections as shown below
68
• So each iteration has a different set of nodes and this
results in a different set of outputs. It can also be
thought of as an ensemble technique in machine
learning.
• Ensemble models usually perform better than a single
model as they capture more randomness. Similarly,
dropout also performs better than a normal neural
network model
• Due to these reasons, dropout is usually preferred
when we have a large neural network structure in
order to introduce more randomness.
69
Early stopping
70
• In the above image, we will stop training at the dotted
line since after that our model will start overfitting on
the training data
71
Why Training a Neural Network
Is Hard
• Fitting a neural network involves using a training dataset to update
the model weights to create a good mapping of inputs to outputs.
• This training process is solved using an optimization algorithm that
searches through a space of possible values for the neural network
model weights for a set of weights that results in good performance
on the training dataset.
• will discover the challenge of training a neural network framed as an
optimization problem.
Session Outcome
• Training a neural network involves using an optimization algorithm to
find a set of weights to best map inputs to outputs.
• The problem is hard, not least because the error surface is non-
convex and contains local minima, flat spots, and is highly
multidimensional.
• The stochastic gradient descent algorithm is the best general
algorithm to address this challenging problem.
Learning as Optimization
• Deep learning neural network models learn to map inputs to outputs given a
training dataset of examples.
• The training process involves finding a set of weights in the network that
proves to be good, or good enough, at solving the specific problem.
• This training process is iterative, meaning that it progresses step by step with
small updates to the model weights each iteration and, in turn, a change in the
performance of the model each iteration.
• The iterative training process of neural networks solves an optimization
problem that finds for parameters (model weights) that result in a minimum
error or loss when evaluating the examples in the training dataset.
• Optimization is a directed search procedure and the optimization problem that
we wish to solve when training a neural network model is very challenging.
Optimization problems
• The optimization algorithm iteratively steps across
this landscape, updating the weights and seeking out
good or low elevation areas.
• For simple optimization problems, the shape of the
landscape is a big bowl and finding the bottom is easy,
• So easy that very efficient algorithms can be designed
to find the best solution.
• These types of optimization problems are referred to
mathematically as convex.
Optimization problems
• The error surface we wish to navigate when
optimizing the weights of a neural network is not a
bowl shape.
• It is a landscape with many hills and valleys.
• These type of optimization problems are referred to
mathematically as non-convex.
Local Minima
82
• For increased hidden layers the amount of error
information propagated back to earlier layers is
dramatically reduced.
• Weights in hidden layers close to the output layer are
updated normally, whereas weights in hidden layers
close to the input layer are updated minimally or not
at all.
• Generally, this problem prevented the training of very
deep neural networks and was referred to as the
vanishing gradient problem
83
• Pretraining :
• add a new hidden layer to a model.
• Allow the newly added model to learn the inputs from the existing hidden layer, keeping
the weights for the existing hidden layers fixed.
• This gives the technique the name “layer-wise” as the model is trained one layer at a
time.
• Greedy algorithm:
• Breaks the problem into many components, then solve for the optimal version of each
component in isolation
• Pretraining is based on the assumption that it is easier to train a shallow network instead
of a deep network and contrives a layer-wise training process that we are always only
ever fitting a shallow model
84
85
Pre-training and fine
tuning
• Using dataset A train model M
• Pre-training:
– You have a dataset B
– Before training the model, initialize some of the
parameters of M with model trained on A
• Fine-tuning:
– You train M on B
• This is one form of transfer learning
87
• Training a deep structure is difficult due to high dependencies across
layers’ parameters , i.e. the relation between parts of pictures and
pixels.
• To resolve this problem, two things are suggested
• Adapting lower layers to feed good input to the upper layers
• Adjust upper layers to make use of that end setting of lowerr layers
88
Greedy Algorithm
• Greedy algorithms break a problem into many
components, then solve for the optimal version of
each component in isolation
95
Gradient Descent Algorithm
• A gradient measures how much the output of a
function changes if you change the inputs a little
bit."
• In mathematical terms, a gradient is a partial
derivative with respect to its inputs.
• Gradient Descent is an optimization algorithm for
finding a local minimum of a differentiable function.
• Gradient descent is simply used to find the values of
a function's parameters (coefficients) that minimize a
cost function as far as possible.
96
• the lowest point on the parabola
occurs at x = 1.
• The objective of gradient
descent algorithm is to find the
value of “x” such that “y” is
minimum
• . “y” here is termed as the
objective function that the
gradient descent algorithm
operates upon, to descend to
the lowest point
97
• Find the slope of the objective function with respect to
each parameter/feature. In other words, compute the
gradient of the function.
• Pick a random initial value for the parameters.
• Update the gradient function by plugging in the
parameter values.
• Calculate the step sizes for each feature as : step size =
gradient * learning rate.
• delta = - learning_rate * gradient
99
The learning rate should never be too high or too low for this
reason.
100
Downsides of the gradient descent
algorithm
• Consider we have 10,000 data points and 10 features.
• We need to compute the derivative 10000 * 10 = 100,000
computations per iteration.
• It is common to take 1000 iterations, in effect we have 100,000 *
1000 = 100000000 computations to complete the algorithm.
• Hence gradient descent is slow on huge data
101
Stochastic Gradient Descent (SGD)
• randomly picks one data point from the whole data set at each
iteration to reduce the computations enormously.
• Mini-batch tries to strike a balance between the goodness of gradient
descent and speed of SGD
102
Momentum
• some additional processing of the gradients to be
faster and better
• in addition to the regular gradient, it also adds on the
movement from the previous step
• sum_of_gradient = gradient + previous_sum_of_gradient *
decay_rate
• delta = -learning_rate * sum_of_gradient
• theta += delta
103
• Momentum simply moves faster
• Momentum has a shot at escaping local minima
(because the momentum may propel it out of a local
minimum
104
Adaptive Gradient
Descent(AdaGrad)
• One of the disadvantages of all the optimizers is that the learning rate is constant
for all parameters and for each cycle.
• It changes the learning rate ‘η’ for each parameter and at every time step ‘t’.
105
• Instead of keeping track of the sum of gradient,
AdaGrad for s keeps track of the sum of
gradient squared and uses that to adapt the gradient
in different directions.
• Sum_of_gradient_squared =
previous_sum_of_gradient_squared + gradient²
• delta = -learning_rate * gradient /
sqrt(sum_of_gradient_squared)
• theta += delta
106
where
• θ is the parameter to be updated,
• η is the initial learning rate,
• ε is some small quantity that used to avoid the division of zero,
• I is the identity matrix,
• gt is the gradient estimate in time-step t that we can get with
the equation
107
Root Mean Square Propagation
• AdaGrad is incredibly slow, because the sum of gradient squared only
grows and never shrinks.
• RMSprob adds a decay factor.
• sum_of_gradient_squared = previous_sum_of_gradient_squared *
decay_rate+ gradient² * (1- decay_rate)
• delta = -learning_rate * gradient / sqrt(sum_of_gradient_squared)
• theta += delta
108
Adaptive Moment estimation.
109
• Instead of adapting the parameter learning rates
based on the average first moment (the mean) as in
RMSProp, Adam also makes use of the average of the
second moments of the gradients (the uncentered
variance).
• Specifically, the algorithm calculates an exponential
moving average of the gradient and the squared
gradient, and the parameters beta1 and beta2 control
the decay rates of these moving averages.
110
• sum_of_gradient = previous_sum_of_gradient * beta1 + gradient * (1
- beta1) [Momentum]
• sum_of_gradient_squared = previous_sum_of_gradient_squared *
beta2 + gradient² * (1- beta2) [RMSProp]
• delta = -learning_rate * sum_of_gradient /
sqrt(sum_of_gradient_squared)
• theta += delta
• https://www.simplilearn.com/tutorials/deep-learning-tutorial/what-is
-deep-learning
• https://machinelearningmastery.com/what-is-deep-learning/