0% found this document useful (0 votes)
41 views54 pages

The Introduction To Neural Networks 10 4 24

Uploaded by

ssn_cse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views54 pages

The Introduction To Neural Networks 10 4 24

Uploaded by

ssn_cse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 54

UNIT V

Introduction to Neural
Networks
MOTIVATION
• Our brain uses the extremely large interconnected network of neurons for information
processing and to model the world around us. Simply put, a neuron collects inputs from other
neurons using dendrites. The neuron sums all the inputs and if the resulting value is greater than
a threshold, it fires. The fired signal is then sent to other connected neurons through the axon.
Biological Networks
1. The majority of neurons encode their
outputs or activations as a series of brief
electrical pulses (i.e. spikes or action
potentials).
2. Dendrites are the receptive zones that
receive activation from other neurons.
3. The cell body (soma) of the neuron’s
processes the incoming activations and
converts them into output activations.
4. Axons are transmission lines that send
activation to other neurons.
5. Synapses allow weighted transmission of
signals (using neurotransmitters) between
axons and dendrites to build up large neural
networks.
Networks of McCulloch-Pitts Neurons
• Artificial neurons have the same basic components as biological
neurons. The simplest ANNs consist of a set of McCulloch-Pitts
neurons labelled by indices k, i, j and activation flows between them
via synapses with strengths wki, wij:
MOTIVATION
• Humans are incredible pattern-recognition machines. Our brains
process ‘inputs’ from the world, categorize them (that’s a spider;
that’s ice-cream), and then generate an ‘output’ (run away from the
spider; taste the ice-cream). And we do this automatically and
quickly, with little or no effort.
MOTIVATION
• Neural networks loosely mimic the way our brains solve the problem:
by taking in inputs, processing them and generating an output. Like
us, they learn to recognize patterns, but they do this by training on
labelled datasets. Before we get to the learning part, let’s take a look
at the most basic of artificial neurons: the perceptron, and how it
processes inputs and produces an output.
THE PERCEPTRON
• Perceptrons were developed way back in the 1950s-60s by the scientist Frank Rosenblatt, inspired by
earlier work from Warren McCulloch and Walter Pitts. While today we use other models of artificial
neurons, they follow the general principles set by the perceptron.

Model of an artificial neuron

• As you can see, the network of nodes sends signals in one direction. This is called a feed-forward
network.
• The figure depicts a neuron connected with n other neurons and thus receives n inputs (x1, x2, ….. xn).
This configuration is called a Perceptron.
THE PERCEPTRON
• Let’s understand this better with an example. Say you bike to work. You
have two factors to make your decision to go to work: the weather must
not be bad, and it must be a weekday. The weather’s not that big a deal,
but working on weekends is a big no-no. The inputs have to be binary, so
let’s propose the conditions as yes or no questions. Weather is fine? 1 for
yes, 0 for no. Is it a weekday? 1 yes, 0 no.
• Remember, I cannot tell the neural network these conditions; it has to
learn them for itself. How will it know which information will be most
important in making its decision? It does with something called weights.
Remember when I said that weather’s not a big deal, but the weekend is?
Weights are just a numerical representation of these preferences. A higher
weight means the neural network considers that input more important
compared to other inputs.
THE PERCEPTRON
• For our example, let’s purposely set suitable weights of 2 for weather and 6 for
weekday. Now how do we calculate the output? We simply multiply the input
with its respective weight, and sum up all the values we get for all the inputs. For
example, if it’s a nice, sunny (1) weekday (1), we would do the following
calculation:

• This calculation is known as a linear combination. Now what does an 8 mean?


We first need to define the threshold value. The neural network’s output, 0 or 1
(stay home or go to work), is determined if the value of the linear combination is
greater than the threshold value. Let’s say the threshold value is 5, which means
that if the calculation gives you a number less than 5, you can stay at home, but if
it’s equal to or more than 5, then you have to go to work.
THE PERCEPTRON
• You have just seen how weights are influential in determining the
output. In this example, we set the weights to particular numbers
that make the example work, but in reality, we set the weights to
random values, and then the network adjusts those weights based on
the output errors it made using the previous weights. This is
called training the neural network.
TRAINING IN PERCEPTRONS
• Try teaching a child to recognize a bus?
• You show her examples, telling her, “This is a bus. That is not a bus,”
until the child learns the concept of what a bus is. Furthermore, if the
child sees new objects that she hasn’t seen before, we could expect
her to recognize correctly whether the new object is a bus or not.
• This is exactly the idea behind the perceptron.
TRAINING IN PERCEPTRONS
• Input vectors from a training set are presented to the perceptron one
after the other and weights are modified according to the following
equation,
• For all inputs i,
W(i) = W(i) + a*g’(sum of all inputs)*(T-A)*P(i),
where g’ is the derivative of the activation function, and a is the
learning rate
• Here, W is the weight vector. P is the input vector. T is the correct
output that the perceptron should have known and A is the output
given by the perceptron.
TRAINING IN PERCEPTRONS
• When an entire pass through all of the input training vectors is
completed without an error, the perceptron has learnt!
• At this time, if an input vector P (already in the training set) is given to
the perceptron, it will output the correct value. If P is not in the
training set, the network will respond with an output similar to other
training vectors close to P.
WHAT IS THE PERCEPTRON
ACTUALLY DOING?
• The perceptron is adding all the inputs and separating them into 2
categories, those that cause it to fire and those that don’t. That is, it
is drawing the line:
w1x1 + w2x2 = t,
where t is the threshold.
WHAT IS THE PERCEPTRON
ACTUALLY DOING?
• To make things a little simpler for training later, let’s make a small
readjustment to the above formula. Let’s move the threshold to the
other side of the inequality, and replace it with what’s known as the
neuron’s bias. Now we can rewrite the equation as:

• Effectively, bias = — threshold. You can think of bias as how easy it is


to get the neuron to output a 1 — with a really big bias, it’s very easy
for the neuron to output a 1, but if the bias is very negative, then it’s
difficult.
LIMITATION OF PERCEPTRONS
• Not every set of inputs can be divided by a line like this. Those that
can be are called linearly separable. If the vectors are not linearly
separable, learning will never reach a point where all vectors are
classified properly.
ACTIVATION FUNCTION
• A function that transforms the values or states the conditions for the
decision of the output neuron is known as an activation function.
• What does an artificial neuron do? Simply, it calculates a “weighted
sum” of its input, adds a bias and then decides whether it should be
“fired” or not.
• So consider a neuron.
ACTIVATION FUNCTION
• The value of Y can be anything ranging from -inf to +inf. The neuron
really doesn’t know the bounds of the value. So how do we decide
whether the neuron should fire or not ( why this firing pattern?
Because we learnt it from biology that’s the way brain works and
brain is a working testimony of an awesome and intelligent system ).
• We decided to add “activation functions” for this purpose. To check
the Y value produced by a neuron and decide whether outside
connections should consider this neuron as “fired” or not. Or rather
let’s say — “activated” or not.
ACTIVATION FUNCTION
• If we do not apply an Activation function, then the output signal would simply be a
simple linear function. A linear function is just a polynomial of one degree.
• A linear equation is easy to solve but they are limited in their complexity and have
less power to learn complex functional mappings from data.
• A Neural Network without Activation function would simply be a Linear
Regression Model, which has limited power and does not performs good most of
the times.
• We want our Neural Network to not just learn and compute a linear function but
something more complicated than that.
• Also, without activation function our Neural network would not be able to learn
and model other complicated kinds of data such as images, videos , audio , speech
etc. That is why we use Artificial Neural network techniques such as Deep learning
to make sense of something complicated ,high dimensional, non-linear -big
datasets, where the model has lots and lots of hidden layers in between and has
a very complicated architecture which helps us to make sense and extract
knowledge form such complicated big datasets.
ACTIVATION FUNCTION
• Activation functions are really important for a Artificial Neural
Network to learn and make sense of something really complicated
and non-linear complex functional mappings between the inputs and
response variable. They introduce non-linear properties to our
network.
• Their main purpose is to convert an input signal of a node in a A-NN
to an output signal. That output signal now is used as a input in the
next layer in the stack.
WHY DO WE NEED NON-
LINEARITIES?
• Non-linear functions are those which have degree more than one and they
have a curvature when we plot a Non-Linear function. Now we need a
Neural Network Model to learn and represent almost anything and any
arbitrary complex function which maps inputs to outputs. Neural-Networks
are considered Universal Function Approximators. It means that they can
compute and learn any function at all. Almost any process we can think of
can be represented as a functional computation in Neural Networks.

• Hence it all comes down to this, we need to apply an Activation function


f(x) so as to make the network more powerful and add ability to it to learn
something complex and complicated form data and represent non-linear
complex arbitrary functional mappings between inputs and outputs. Hence
using a non linear Activation, we are able to generate non-linear mappings
from inputs to outputs.
TYPES OF ACTIVATION FUNCTIONS*
Step function
•Activation function A = “activated” if Y > threshold else not
•Alternatively, A = 1 if Y > threshold, 0 otherwise
•Well, what we just did is a “step function”, see the below figure.

•DRAWBACK: Suppose you are creating a binary classifier. Something which should say a
“yes” or “no” ( activate or not activate ). A Step function could do that for you! That’s
exactly what it does, say a 1 or 0. Now, think about the use case where you would want
multiple such neurons to be connected to bring in more classes. Class1, class2, class3 etc.
What will happen if more than 1 neuron is “activated”. All neurons will output a 1 ( from
step function). Now what would you decide? Which class is it? Hard, complicated.
* https://towardsdatascience.com/activation-functions-and-its-types-which-is-better-a9a5310cc8f
TYPES OF ACTIVATION FUNCTIONS
Linear function
•A = cx
•A straight line function where activation is proportional to input ( which is
the weighted sum from neuron ).
•This way, it gives a range of activations, so it is not binary activation. We
can definitely connect a few neurons together and if more than 1 fires, we
could take the max and decide based on that. So that is ok too. Then what is
the problem with this?
•A = cx, derivative with respect to x is c. That means, the gradient has no
relationship with X. It is a constant gradient and the descent is going to be
on constant gradient. If there is an error in prediction, the changes made by
back propagation is constant and not depending on the change in input.
TYPES OF ACTIVATION FUNCTIONS
Sigmoid function

This looks smooth and “step function like”. What


are the benefits of this? It is nonlinear in nature.
Combinations of this function are also nonlinear!
Great. Now we can stack layers. What about non
binary activations? Yes, that too! It will give an
analog activation unlike step function. It has a
smooth gradient too.
And if you notice, between X values -2 to 2, Y values are very steep. Which means, any small changes in the
values of X in that region will cause values of Y to change significantly. That means this function has a tendency
to bring the Y values to either end of the curve.
• Looks like it’s good for a classifier considering its property? Yes ! It tends to bring
the activations to either side of the curve ( above x = 2 and below x = -2 for
example). Making clear distinctions on prediction.
• Another advantage of this activation function is, unlike linear function, the
output of the activation function is always going to be in range (0,1) compared to
(-inf, inf) of linear function. So we have our activations bound in a range. It won’t
blow up the activations then. This is great.
• Sigmoid functions are one of the most widely used activation functions today.
Then what are the problems with this?
• If you notice, towards either end of the sigmoid function, the Y values tend to
respond very less to changes in X. What does that mean? The gradient at that
region is going to be small. It gives rise to a problem of “vanishing gradients”. So
what happens when the activations reach near the “near-horizontal” part of the
curve on either sides?
• Gradient is small or has vanished ( cannot make significant change because of the
extremely small value ). The network refuses to learn further or is drastically
slow. There are ways to work around this problem and sigmoid is still very
popular in classification problems.
TYPES OF ACTIVATION FUNCTIONS
Tanh Function
•Another activation function that is used is the tanh function.

This looks very similar to sigmoid. In fact, it is a


scaled sigmoid function!
• This has characteristics similar to sigmoid that we discussed above. It
is nonlinear in nature, so great we can stack layers! It is bound to
range (-1, 1) so no worries of activations blowing up. One point to
mention is that the gradient is stronger for tanh than sigmoid
( derivatives are steeper). Deciding between the sigmoid or tanh will
depend on your requirement of gradient strength. Like sigmoid, tanh
also has the vanishing gradient problem.
• Tanh is also a very popular and widely used activation function.
Especially in time series data.
TYPES OF ACTIVATION FUNCTIONS
ReLu
•Later, comes the ReLu function,
A(x) = max(0,x)
The ReLu function is as shown above. It gives an output x if x is positive
and 0 otherwise.
• At first look, this would look like having the same problems of linear function,
as it is linear in positive axis. First of all, ReLu is nonlinear in nature. And
combinations of ReLu are also non linear! ( in fact it is a good approximator.
Any function can be approximated with combinations of ReLu). Great, so this
means we can stack layers. It is not bound though. The range of ReLu is [0,
inf). This means it can blow up the activation.
• Another point to discuss here is the sparsity of the activation. Imagine a big
neural network with a lot of neurons. Using a sigmoid or tanh will cause
almost all neurons to fire in an analog way. That means almost all activations
will be processed to describe the output of a network. In other words, the
activation is dense. This is costly. We would ideally want a few neurons in the
network to not activate and thereby making the activations sparse and
efficient.
• ReLu give us this benefit. Imagine a network with random initialized weights (
or normalized ) and almost 50% of the network yields 0 activation because of
the characteristic of ReLu ( output 0 for negative values of x ). This means a
fewer neurons are firing ( sparse activation ) and the network is lighter. ReLu
seems to be awesome! Yes it is, but nothing is flawless.. Not even ReLu.
• Because of the horizontal line in ReLu ( for negative X ), the gradient can go
towards 0. For activations in that region of ReLu, gradient will be 0 because
of which the weights will not get adjusted during descent. That means,
those neurons which go into that state will stop responding to variations in
error/ input ( simply because gradient is 0, nothing changes ). This is called
dying ReLu problem. This problem can cause several neurons to just die
and not respond making a substantial part of the network passive. There
are variations in ReLu to mitigate this issue by simply making the
horizontal line into non-horizontal component . For example, y = 0.01x for
x<0 will make it a slightly inclined line rather than horizontal line. This is
leaky ReLu. There are other variations too. The main idea is to let the
gradient be non zero and recover during training eventually.
• ReLu is less computationally expensive than tanh and sigmoid because it
involves simpler mathematical operations. That is a good point to consider
when we are designing deep neural nets.
NOW WHICH ONE DO WE USE?
• Does that mean we just use ReLu for everything we do? Or sigmoid or
tanh? Well, yes and no.
• When you know the function you are trying to approximate has certain
characteristics, you can choose an activation function which will
approximate the function faster leading to faster training process. For
example, a sigmoid works well for a classifier, because approximating a
classifier function as combinations of sigmoid is easier than maybe ReLu,
for example. Which will lead to faster training process and convergence.
You can use your own custom functions too! If you don’t know the nature
of the function you are trying to learn, then maybe you can start with
ReLu, and then work backwards. ReLu works most of the time as a general
approximator!
MULTI-LAYERED NEURAL NETWORKS
• Once a training sample is given as an input to the network, each
output node of the single layered neural network (also
called Perceptron) takes a weighted sum of all the inputs and pass
them through an activation function and comes up with an output.
The weights are then corrected using the following equation,
For all inputs i,
W(i) = W(i) + a*g’(sum of all inputs)*(T-A)*P(i),
where a is the learning rate and g’ is the derivative of the activation
function.
MULTI-LAYERED NEURAL NETWORKS
• This process is repeated by feeding the whole training set several times until the
network responds with a correct output for all the samples. The training is
possible only for inputs that are linearly separable. This is where multi-layered
neural networks come into picture.
MULTI-LAYERED NEURAL NETWORKS
• Each input from the input layer is fed up to each node in the hidden
layer, and from there to each node on the output layer. We should
note that there can be any number of nodes per layer and there are
usually multiple hidden layers to pass through before ultimately
reaching the output layer.
• But to train this network we need a learning algorithm which should
be able to tune not only the weights between the output layer and
the hidden layer but also the weights between the hidden layer and
the input layer.
BACK PROPAGATION (BACKWARD
PROPAGATION OF ERRORS)
• Backpropagation is a common method for training a neural network.
• The goal of backpropagation is to optimize the weights so that the
neural network can learn how to correctly map arbitrary inputs to
outputs.
• To tune the weights between the hidden layer and the input layer, we
need to know the error at the hidden layer, but we know the error only
at the output layer (We know the correct output from the training
sample and we also know the output predicted by the network.).
• So, the method that was suggested was to take the errors at the output
layer and proportionally propagate them backwards to the hidden
layer.
BACK PROPAGATION
• For a particular neuron in output layer
for all j { Wj,i = Wj,i + a*g’(sum of all inputs)*(T-A)*P(j) }
• This equation tunes the weights between the output layer and the hidden
layer.
• For a particular neuron j in hidden layer, we propagate the error
backwards from the output layer, thus
Error = Wj,1 * E1 + Wj,2 * E2 + …..
for all the neurons in output layer.
• Thus, for a particular neuron in hidden layer
for all k { Wk,j = Wk,j + a*g’(sum of all inputs)*(T-A)*P(k) }
• This equation tunes the weights between the hidden layer and the input
layer.
Learning by Gradient Descent Error
Minimization
Learning by Gradient Descent Error
Minimization
Practical Considerations for Gradient Descent Learning
There a number of important practical/implementational considerations
that must be taken into account when training neural networks:
1. Do we need to pre-process the training data? If so, how?
2. How many hidden units do we need?
3. Are some activation functions better than others?
4. How do we choose the initial weights from which we start the training?
5. Should we have different learning rates for the different layers?
6. How do we choose the learning rates?
7. Do we change the weights after each training pattern, or after the whole set?
8. How do we avoid flat spots in the error function?
9. How do we avoid local minima in the error function?
10. When do we stop training?
In general, the answers to these questions are highly problem dependent.
Multi-Layer Perceptrons (MLPs)
• To deal with non-linearly separable problems we can use non-monotonic
activation functions. More conveniently, we can instead extend the simple
Perceptron to a Multi-Layer Perceptron, which includes at least one hidden layer
of neurons with non-linear activations functions f(x) (such as sigmoids):

• Note that if the activation on the hidden layer were linear, the network would
be equivalent to a single layer network, and wouldn’t be able to cope with non-
linearly separable problems.
BACK PROPOGATION BY EXAMPLE
https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/

• Consider a neural network with two inputs, two hidden neurons, two
output neurons. Additionally, the hidden and output neurons will
include a bias.
In order to have some numbers to work with, here are the initial
weights, the biases, and training inputs/outputs: given inputs 0.05 and
0.10, we want the neural network to output 0.01 and 0.99.
delta

You might also like