Neural Network (Basics)
Neural Network (Basics)
Neural Networks
• Neural Networks are
networks of
interconnected neurons,
for example in human
brains.
• neurons, Neural
Artificial and Networks
computatio
are highly connected to
signals
ns
other from other
performs
neurons.
by
• combining
Outputs of these computations may be transmitted to one
or more other neurons.
• The neurons are connected together in a specific way to
perform a particular task. 2
Artificial Neural Networks (High-Level Overview)
• A neural network is a function.
• It consists of basically:
a. Neurons: which pass input values through functions and
output the result.
b. Weights: which carry values ( real-number) between
• Neurons can be categorized into
neurons.
a. Input Layer
layers:
b. Hidden
Layer
c. Output
Layer
3
Neurophysiology
6
How do ANNs work?
Processing ∑
∑= X1+X2 + ….+Xm =y
Output y
How do ANNs work?
Not all inputs are equal
xm ......... x2 x1
...
Input
w ...
weights m
..
w2 w1
Processing ∑ ∑= X1w1+X2w2 + ….
+Xmwm =y
Output y
How do ANNs work?
The signal is not passed down to the
next neuron verbatim
xm ......... x2 x1
...
Input
w
w w
...
weights m
..
2
1
Processing ∑
Transfer Function
f(vk)
(Activation Function)
Output y
The output is a function of the input, that is
affected by the weights, and the transfer
functions
The Perceptron Model
● Motivated the
by biological x
1
w1
neuron. is a
● A
element where computing
inputs are x w2 y
perceptron
associated with the weights 2
13
Neural network architectures
networks architecture
00
0
10
01 1
11
x1
w1
w2
x2
1 2
1 2
Σ
Here, y = wi xi − θ and w1 = 0.5,w2 = 0.5 and θ
= 0.9 Soft Computing 29.01.201 17 /
The Perceptron Model
● Rewrite Σ wi xi as w.x
x
● Replacethreshold = 1
w1
● b: Bias, -b
a prior inclination
x w2 y
towards some decision.
2
x wn
n
18
Activation Functions
● Activation function decide whether a neuron should be
activated or not.
● It helps the network to use the useful information and
suppress the irrelevant information.
● Usually a nonlinear function.
○ What if we choose a linear?
○ Linear classifier
○ Limited capacity to solve complex problems.
19
Activation Functions (cont’d)
● Sigmoid
○ continuously differentiable
○ ranges from 0-1
○ not symmetric around the origin
● Tanh
○
○
○ scaled version of the sigmoid
○ symmetric around the origin
○ vanishing gradient
20
Activation Functions (cont’d)
● ReLU
21
Representation Power
● A neural network with at least one hidden layer can
approximate any function.[1]
● The representation power of network increase with more
hidden
and moreunits
hidden But, “with great power comes great overfitting”
● layers.
[1] Cybenko, George. "Approximation by superpositions of a sigmoidal function." Mathematics of control, signals and systems 2.4 22
(1989): 303-314.
Feed-forward Neural Network
a1
x1
a2 y1
x2
a3 y2
x3
a4
24
Objective Function (cont’d)
● Mean Squared Error:
○ Mean Squared Error (MSE), or quadratic, loss function is
widely used in linear regression as the performance
measure.
○ It measures the average of the squares of the errors—that
is, the average squared difference between the
estimated values and the actual value.
○ It is always non-negative, and values closer to zero are
better.
25
Objective Function (cont’d)
● Mean Absolute Error:
○ Mean Absolute Error (MAE) is a quantity used to
measure how close forecasts or predictions are to the
eventual outcomes.
○ Both MSE and MAE are used in predictive modeling.
○ MSE has nice mathematical properties which makes it
easier to compute the gradient.
26
Objective Function (cont’d)
● Cross-entropy:
○ Coss-entropy comes from the field of information
theory and has the unit of “bits.”
○ The cross-entropy between a “true” distribution p and an
estimated distribution q is defined as:
27
Objective Function (cont’d)
● Cross-entropy:
○ Assuming a ground truth probability distribution that
is 1 at the right class and 0 everywhere else p = [0,
…,0,1,0,…0] and our computed probability is q
○ Kullback-Leibler divergence can be written as:
28
Optimization
● The goal of optimization is to find parameter (weights)
that minimizes the loss function.
● How to find such weights?
○ Random Search
■ Very bad idea.
○ Random Local Search
■ Start with a random weight w and generate random perturbations Δw
to it and if the loss at the perturbed w+Δw is lower, we will perform an
update.
■ Computationally expensive
○ Follow the Gradient
■ No need to search for a good direction.
■ We can compute the best direction along which we should change
our weight vector that is mathematically guaranteed to be the
direction of the steepest descent. 29
Optimization (cont’d)
Find w which minimizes the
chosen error function E(w)
● wA : a local minimum
● wB : a global minimum
● At point wC local gradient is
given by vector ΔE(w)
● It points in direction of
greatest rate of increase
of E(w)
● Negative gradient points to 2
7
Optimization (cont’d)
Find w which minimizes the
chosen error function E(w)
● wA : a local minimum
● wB : a global minimum
● At point wC local gradient is
given by vector ΔE(w)
● It points in direction of
greatest rate of increase
of E(w)
● Negative gradient points to 2
8
Gradient and Hessian
● First derivative of a scalar function E(w) with respect to
a vector
w=[w1,w2]T is a vector called the Gradient of E(w)
32
Gradient Descent Optimization
● Determine weights w from labeled set of training samples.
● Take a small step in the direction of the negative gradient
33
Gradient Descent Variants
● Batch gradient descent:
○ Vanilla gradient descent, aka batch gradient descent,
computes the gradient of the cost function w.r.t. to the
parameters w for the entire training dataset.
○ wnew = wold - η ΔE(wold)
36
Backpropagation Algorithm
● Backpropagation algorithm is used to train artificial neural
networks, it can update the weights very efficiently.
● It is a computationally efficient approach to compute the
derivatives of a complex cost function.
● Goal is to use those derivatives to learn the weight
coefficients for parameterizing a multi-layer artificial neural
network.
● It compute the gradient of a cost function with respect to all
the weights in the network, so that the gradient is fed to the
gradient descent method which in turn uses it to update the
weights in order to minimize the cost function.
37
Backpropagation Algorithm (cont’d)
● Chain Rule:
○ Single Path
x y z
○ Multiple Path
y
1
x z
y
3
2 5
Backpropagation Algorithm (cont’d)
● The total error in the
network for a single input is
wij wjk
given by the following
equation
42
Backpropagation Algorithm (cont’d)
● Backpropagation – wij wjk
for outermost layer
Output
Input
Hidden (k)
(i)
For sigmoid activation (j)
function
43
Backpropagation Algorithm (cont’d)
● Backpropagation – for wij wjk
hidden layer
Output
Input
Hidden (k)
(i)
(j)
44
Backpropagation Algorithm (cont’d)
● Backpropagation – for wij wjk
hidden layer
45
Backpropagation Algorithm (cont’d)
● Backpropagation – for wij wjk
hidden layer
Output
Input
Hidden (k)
(i)
(j)
46
References
● https://www.cs.swarthmore.edu/~meeden/cs81/s10/BackPropDeriv.pdf
● https://cs224d.stanford.edu/lecture_notes/notes3.pdf
● http://www.cs.cmu.edu/~ninamf/courses/315sp19/lectures/3_29-NNs.pdf
● https://cedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.1-FeedFor.pdf
● http://ruder.io/optimizing-gradient-descent/
● http://www.cs.cmu.edu/~ninamf/courses/315sp19/lectures/Perceptron-01-25
-2019.pdf
● http://www.cs.cornell.edu/courses/cs5740/2016sp/resources/backprop.pdf
47
Thank
You!
48