0% found this document useful (0 votes)
9 views48 pages

Neural Network (Basics)

Neural Networks consist of interconnected neurons that process inputs and produce outputs through weighted connections. They can be categorized into layers: input, hidden, and output, and utilize activation functions to determine neuron activation. The backpropagation algorithm is a key method for training these networks by efficiently updating weights to minimize error in predictions.

Uploaded by

23bme020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views48 pages

Neural Network (Basics)

Neural Networks consist of interconnected neurons that process inputs and produce outputs through weighted connections. They can be categorized into layers: input, hidden, and output, and utilize activation functions to determine neuron activation. The backpropagation algorithm is a key method for training these networks by efficiently updating weights to minimize error in predictions.

Uploaded by

23bme020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 48

Neural Networks

Neural Networks
• Neural Networks are
networks of
interconnected neurons,
for example in human
brains.
• neurons, Neural
Artificial and Networks
computatio
are highly connected to
signals
ns
other from other
performs
neurons.
by

• combining
Outputs of these computations may be transmitted to one
or more other neurons.
• The neurons are connected together in a specific way to
perform a particular task. 2
Artificial Neural Networks (High-Level Overview)
• A neural network is a function.
• It consists of basically:
a. Neurons: which pass input values through functions and
output the result.
b. Weights: which carry values ( real-number) between
• Neurons can be categorized into
neurons.
a. Input Layer
layers:
b. Hidden
Layer
c. Output
Layer

3
Neurophysiology

• The human nervous system can be divided into three stages:


a. Receptors:
■ Convert stimuli from the external environment into electrical impulses
■ Rods and Cones of eyes,
■ Pain, touch, hot and cold receptors of skin.
b. Neural Net:
■ Receive information, process it and make appropriate decisions.
■ Brain
c. Effectors:
■ Convert electrical impulses generated by the the neural net (brain) into
responses to the external environment.
■ Muscles and glands, speech generators.
4
Basic Components of Biological Neurons
The basic components of a
biological neuron are:
● Cell Body (Soma) processes the incoming
activations and converts them into output
activations.
● Neuron Nucleus contains the genetic material
(DNA).
● Dendrites form a fine filamentary bush
each fiber thinner than an axon.
● Axon: Long thin cylinder carrying impulses
from soma to other cells
● Synapses: The junctions that allow
signal transmission b/w the axons and
5
dendrites.
Computation in Biological Neurons
● Incoming signals from
synapses are summed up
at the soma.
● On crossing a threshold, the
cell fires generating an action
potential in the axon hillock
region.

6
How do ANNs work?

An artificial neuron is an imitation of a human neuron


How do ANNs work?
• Now, let us have a look at the model of an artificial neuron.
How do ANNs work?
.........
Input xm
...
x2 x1

Processing ∑
∑= X1+X2 + ….+Xm =y

Output y
How do ANNs work?
Not all inputs are equal
xm ......... x2 x1
...
Input
w ...
weights m
..
w2 w1

Processing ∑ ∑= X1w1+X2w2 + ….
+Xmwm =y

Output y
How do ANNs work?
The signal is not passed down to the
next neuron verbatim
xm ......... x2 x1
...
Input
w
w w
...
weights m
..
2
1

Processing ∑
Transfer Function
f(vk)
(Activation Function)

Output y
The output is a function of the input, that is
affected by the weights, and the transfer
functions
The Perceptron Model
● Motivated the
by biological x
1
w1
neuron. is a
● A
element where computing
inputs are x w2 y
perceptron
associated with the weights 2

and the cell having a


threshold value. x wn
n

13
Neural network architectures

There are three fundamental classes of ANN architectures:

Single layer feed forward

architecture Multilayer feed

forward architecture Recurrent

networks architecture

Before going to discuss all these architectures, we first


discuss the mathematical details of a neuron at a single level.
To do this, let us first consider the AND problem and its
possible solution with neural network.

Soft Computing 29.01.201 14 /


The AND problem and its Neural network
The simple Boolean AND operation with two input
variables x1
and x2 is shown in the truth table.
Here, we have four input patterns: 00, 01, 10 and 11.
For the first three patterns output is 0 and for the
last pattern output is 1.

Soft Computing 29.01.201 15 /


The AND problem and its neural network
Alternatively, the AND problem can be thought as a
perception problem where we have to receive four
different patterns as input and perceive the results as 0
or 1.

00
0
10
01 1
11

x1
w1

w2

x2

Soft Computing 29.01.201 16 /


The AND problem and its neural network
A possible neuron specification to solve the AND problem
is given in the following. In this solution, when the input
is 11, the weight sum exceeds the threshold (θ = 0.9)
leading to the output 1 else it gives the output 0.

1 2

1 2

Σ
Here, y = wi xi − θ and w1 = 0.5,w2 = 0.5 and θ
= 0.9 Soft Computing 29.01.201 17 /
The Perceptron Model
● Rewrite Σ wi xi as w.x
x
● Replacethreshold = 1
w1

● b: Bias, -b
a prior inclination
x w2 y
towards some decision.
2

x wn
n

18
Activation Functions
● Activation function decide whether a neuron should be
activated or not.
● It helps the network to use the useful information and
suppress the irrelevant information.
● Usually a nonlinear function.
○ What if we choose a linear?
○ Linear classifier
○ Limited capacity to solve complex problems.

19
Activation Functions (cont’d)
● Sigmoid

○ continuously differentiable
○ ranges from 0-1
○ not symmetric around the origin
● Tanh


○ scaled version of the sigmoid
○ symmetric around the origin
○ vanishing gradient

20
Activation Functions (cont’d)
● ReLU

○ Also called piecewise linear function


because rectified function is linear for
half of the input domain and nonlinear
for the other half.
○ trivial to implement
○ sparse representation
○ avoid the problem of vanishing
gradients
○ dead neurons

21
Representation Power
● A neural network with at least one hidden layer can
approximate any function.[1]
● The representation power of network increase with more
hidden
and moreunits
hidden But, “with great power comes great overfitting”
● layers.

[1] Cybenko, George. "Approximation by superpositions of a sigmoidal function." Mathematics of control, signals and systems 2.4 22
(1989): 303-314.
Feed-forward Neural Network

a1

x1

a2 y1

x2

a3 y2

x3

a4

Input Hidde Output


n 23
Objective Function
● The function we want to minimize or maximize is called the
objective function or criterion.
● When we are minimizing it, we may also call it the cost
function, loss function, or error function.
● A loss function tells how good our current classifier is.
● Given a dataset:

24
Objective Function (cont’d)
● Mean Squared Error:
○ Mean Squared Error (MSE), or quadratic, loss function is
widely used in linear regression as the performance
measure.
○ It measures the average of the squares of the errors—that
is, the average squared difference between the
estimated values and the actual value.
○ It is always non-negative, and values closer to zero are
better.

25
Objective Function (cont’d)
● Mean Absolute Error:
○ Mean Absolute Error (MAE) is a quantity used to
measure how close forecasts or predictions are to the
eventual outcomes.
○ Both MSE and MAE are used in predictive modeling.
○ MSE has nice mathematical properties which makes it
easier to compute the gradient.

26
Objective Function (cont’d)
● Cross-entropy:
○ Coss-entropy comes from the field of information
theory and has the unit of “bits.”
○ The cross-entropy between a “true” distribution p and an
estimated distribution q is defined as:

○ Cross-entropy can be re-written in of the


terms and Kullback-Leibler divergence entropy two
between the distributions

27
Objective Function (cont’d)
● Cross-entropy:
○ Assuming a ground truth probability distribution that
is 1 at the right class and 0 everywhere else p = [0,
…,0,1,0,…0] and our computed probability is q
○ Kullback-Leibler divergence can be written as:

28
Optimization
● The goal of optimization is to find parameter (weights)
that minimizes the loss function.
● How to find such weights?
○ Random Search
■ Very bad idea.
○ Random Local Search
■ Start with a random weight w and generate random perturbations Δw
to it and if the loss at the perturbed w+Δw is lower, we will perform an
update.
■ Computationally expensive
○ Follow the Gradient
■ No need to search for a good direction.
■ We can compute the best direction along which we should change
our weight vector that is mathematically guaranteed to be the
direction of the steepest descent. 29
Optimization (cont’d)
Find w which minimizes the
chosen error function E(w)

● wA : a local minimum
● wB : a global minimum
● At point wC local gradient is
given by vector ΔE(w)
● It points in direction of
greatest rate of increase
of E(w)
● Negative gradient points to 2
7
Optimization (cont’d)
Find w which minimizes the
chosen error function E(w)

● wA : a local minimum
● wB : a global minimum
● At point wC local gradient is
given by vector ΔE(w)
● It points in direction of
greatest rate of increase
of E(w)
● Negative gradient points to 2
8
Gradient and Hessian
● First derivative of a scalar function E(w) with respect to
a vector
w=[w1,w2]T is a vector called the Gradient of E(w)

● Second derivative of a scalar function E(w) with respect to


a vector
w=[w1,w2]T is a matrix called the Hessian of E(w)

32
Gradient Descent Optimization
● Determine weights w from labeled set of training samples.
● Take a small step in the direction of the negative gradient

wnew = wold - η ΔE(wold)

● After each update, the gradient is re-evaluated for the


new weight vector and the process is repeated
● This size of steps η taken to reach the minimum or bottom
is called Learning Rate.

33
Gradient Descent Variants
● Batch gradient descent:
○ Vanilla gradient descent, aka batch gradient descent,
computes the gradient of the cost function w.r.t. to the
parameters w for the entire training dataset.
○ wnew = wold - η ΔE(wold)

○ Guaranteed to converge to global minimum for


convex error surfaces and to a local minimum for
non-convex surfaces.
○ Need to calculate the gradients for the whole dataset to
perform just one update.
○ Very slow and is intractable for datasets that don't fit in 34
memory.
Gradient Descent Variants
● Stochastic gradient descent:
○ Stochastic gradient descent
(SGD) in contrast performs a
parameter update for each
training example, say (xi , yi )
○ wnew = wold - η ΔE(wold:; xi ; yi )
○ Much faster (avoid
redundancy as exist in Batch
gradient descent)
○ While slowly decreasing the
learning rate, SGD shows the
same convergence behaviour
as batch gradient descent.
○ It performs frequent updates 3
2
Gradient Descent Variants
● Mini-batch gradient descent:
○ Performs update for every mini-batch of n examples.
○ wnew = wold - η ΔE(wold:; xi:i+n ; yi:i+n)
○ Reduces variance of updates.
○ Algorithm of choice
○ Mini-batch size is a hyperparameter. Common sizes are
50-256.

36
Backpropagation Algorithm
● Backpropagation algorithm is used to train artificial neural
networks, it can update the weights very efficiently.
● It is a computationally efficient approach to compute the
derivatives of a complex cost function.
● Goal is to use those derivatives to learn the weight
coefficients for parameterizing a multi-layer artificial neural
network.
● It compute the gradient of a cost function with respect to all
the weights in the network, so that the gradient is fed to the
gradient descent method which in turn uses it to update the
weights in order to minimize the cost function.
37
Backpropagation Algorithm (cont’d)
● Chain Rule:
○ Single Path

x y z
○ Multiple Path

y
1

x z
y
3
2 5
Backpropagation Algorithm (cont’d)
● The total error in the
network for a single input is
wij wjk
given by the following
equation

Input Hidden Output


(i) (j) (k)
39
Backpropagation Algorithm (cont’d)
● There are two sets of
weights in our network:
wij wjk
○ wij : from the input to the
layer.
hidden
○ wjk : from the hidden to the
output
layer.
● We want to adjust the
network’s
weights to reduce this
overall error. Input Hidden Output
○ (i) (j) (k)
40
Backpropagation Algorithm (cont’d)
● Backpropagation – wij wjk
for outermost layer
○ outermost layer
parameters
directly affect the value of
the error function.
○ only term
one of the
summati
non-zero E will
derivative: Input Hidden Output
on
associatedwith the have
(i) (j) (k)
particular weight weaare the
considering. one
41
Backpropagation Algorithm (cont’d)
● Backpropagation – wij wjk
for outermost layer

Input Hidden Output


(i) (j) (k)

42
Backpropagation Algorithm (cont’d)
● Backpropagation – wij wjk
for outermost layer

Output
Input
Hidden (k)
(i)
For sigmoid activation (j)
function

43
Backpropagation Algorithm (cont’d)
● Backpropagation – for wij wjk
hidden layer

Output
Input
Hidden (k)
(i)
(j)

44
Backpropagation Algorithm (cont’d)
● Backpropagation – for wij wjk
hidden layer

Input Hidden Output


(i) (j) (k)

45
Backpropagation Algorithm (cont’d)
● Backpropagation – for wij wjk
hidden layer

Output
Input
Hidden (k)
(i)
(j)

46
References
● https://www.cs.swarthmore.edu/~meeden/cs81/s10/BackPropDeriv.pdf
● https://cs224d.stanford.edu/lecture_notes/notes3.pdf
● http://www.cs.cmu.edu/~ninamf/courses/315sp19/lectures/3_29-NNs.pdf
● https://cedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.1-FeedFor.pdf
● http://ruder.io/optimizing-gradient-descent/
● http://www.cs.cmu.edu/~ninamf/courses/315sp19/lectures/Perceptron-01-25
-2019.pdf
● http://www.cs.cornell.edu/courses/cs5740/2016sp/resources/backprop.pdf

47
Thank
You!

48

You might also like