0% found this document useful (0 votes)
28 views100 pages

Artificial Neural Networks

Uploaded by

Raahil Rai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views100 pages

Artificial Neural Networks

Uploaded by

Raahil Rai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

Artificial Neural

Networks
ANN

• Topics
• Perceptron Model to Neural Networks
• Activation Functions
• Cost Functions
• Feed Forward Networks
• BackPropagation
Perceptron Model
Perceptron model

• To begin understanding deep learning,


• Single Biological Neuron
• Perceptron
• Multi-layer Perceptron Model
• Deep Learning Neural Network
Perceptron model

• Stained Neurons in cerebral cortex


Perceptron model
• Illustration of biological neurons
Perceptron model

• A perceptron was a form of neural network introduced in


1958 by Frank Rosenblatt.
• "...perceptron may eventually be able to learn, make
decisions, and translate languages."
Perceptron model

• Marvin Minsky and Seymour Papert's (1969)- book


Perceptrons.
• Suggested severe limitations to what perceptrons could do.
• Marked the beginning of AI Winter
Perceptron model

Inputs Output
Perceptron model

x1
Inputs Output

x2
Perceptron model

x1
Inputs f(X) Output

x2
Perceptron model

x1
y
Inputs f(X) Output

x2
Perceptron model
• If f(X) is just a sum, then y=x1+x2

x1
y
Inputs f(X) Output

x2
Perceptron model
• adjust some parameter in order to “learn”

x1
y
Inputs f(X) Output

x2
Perceptron model
• add an adjustable weight

w1
x1
y
Inputs f(X) Output
w2

x2
Perceptron model
• y = x1w1 + x2w2

w1
x1
y
Inputs f(X) Output
w2

x2
Perceptron model
• update the weights to effect y

w1
x1
y
Inputs f(X) Output
w2

x2
Perceptron model
● what if an x is zero? w won’t change anything!

w1
x1
y
Inputs f(X) Output
w2

x2
Perceptron model
• add in a bias term b to the inputs

w1
x1
y
Inputs f(X) Output
w2

x2
Perceptron model

*w1 + b
x1
y
Inputs f(X) Output

x2 *w2 + b
Perceptron model
• y = (x1w1 + b) + (x2w2 + b)

*w1 + b
x1
y
Inputs f(X) Output

x2 *w2 + b
Perceptron model
• expand this to a generalization:

x1
*w1 + b y
Inputs f(X) Output
x2 *w2 + b

xn *wn + b
Perceptron model

• Modeled a biological neuron as a simple perceptron


• Mathematically generalization:
Neural Networks
Neural Networks

• To build a network of perceptrons, we can connect layers of


perceptrons - multi-layer perceptron model
• The outputs of one perceptron are directly fed as inputs to
another perceptron.
Neural Networks

• The first layer is the input layer


Neural Networks

• The last layer is the output layer.


• Note: This last layer can be more than one neuron
Neural Networks
• Layers in between the input and output layers are the hidden layers.
• Hidden layers are difficult to interpret
• Neural Networks become “deep neural networks” if then contain 2
or more hidden layers.
Neural Networks
Neural Networks

• Neural network framework can be used to approximate any


function.
• Zhou Lu and later on Boris Hanin proved mathematically
that Neural Networks can approximate any convex
continuous function.
Activation Functions
Neural Networks

• Recall
• x*w + b
• w implies how much weight or strength to give the
incoming input
• b offset value, making x*w have to reach a certain
threshold before having an effect
Neural Networks

• For example if b= -10


• x*w + b
• Then the effects of x*w won’t really start to overcome
the bias until their product surpasses 10.
• After that, then the effect is solely based on the value of
w.
Neural Networks

• Set boundaries for the output value:


• x*w + b
• z = x*w + b
• Pass z through some activation function to limit its
value.
Perceptron model
• Recall our simple perceptron has an f(X)
• If we had a binary classification problem, we would want an output of
either 0 or 1.
w1
x1
y
Inputs f(X) Output
w2
+b • z = wx + b
x2 • In this context, we’ll then refer to
activation functions as f(z).
• Often see these variables capitalized
f(Z) or X to denote a tensor input
consisting of multiple values.
Deep Learning
• The most simple networks rely on a basic step function that outputs
0 or 1.
• Regardless of the values, this always outputs 0 or 1.
• Useful for classification (0 or 1 class).

1
Output very “strong” function

0
0
z = wx + b
Deep Learning

• Immediate cut off that splits between 0 and 1.

1
Output

0
0
z = wx + b
Deep Learning

1
Output

0
0
z = wx + b
Deep Learning

• sigmoid function

1
Output

0
0
z = wx + b
Deep Learning

• Hyperbolic Tangent: tanh(z)


• Outputs between -1 and 1 instead of 0 to 1

1
Output

-1
0
Deep Learning

• Rectified Linear Unit (ReLU): This is actually a relatively


simple function: max(0,z)
• good performance

Output

0
z = wx + b
Multi-Class
Activation Functions
Deep Learning

• There are 2 main types of multi-class situations


§ Non-Exclusive Classes
o A data point can have multiple classes/categories assigned to it
o Photos can have multiple tags (e.g. beach, family, vacation, etc…)
§ Mutually Exclusive Classes
o Only one class per data point.
o Photos can be categorized as being in grayscale (black and
white) or full color photos
• Organizing Multiple Classes
§ 1 output node per class.
Neural Networks

• This single node could output a continuous regression value


or binary classification (0 or 1).
Multiclass Classification

● Organizing for Multiple Classes

Class One

Class Two

Hidden Layers
Class N
Deep Learning

• Organizing Multiple Classes


• We can’t just have categories like “red”, “blue”, “green”,
etc...
• Instead we use one-hot encoding
Data Point 1 RED

Data Point 2 GREEN

Data Point 3 BLUE

... ...

Data Point N RED


Deep Learning

• Mutually Exclusive Classes


RED GREEN BLUE
Data Point 1 RED
Data Point 1 1 0 0
Data Point 2 GREEN
Data Point 2 0 1 0
Data Point 3 BLUE
Data Point 3 0 0 1
... ...
... ... ... ...
Data Point N RED
Data Point N 1 0 0
Deep Learning

• Non-Exclusive Classes
A B C
Data Point 1 A,B
Data Point 1 1 1 0
Data Point 2 A
Data Point 2 1 0 0
Data Point 3 C,B
Data Point 3 0 1 1
... ...
... ... ... ...
Data Point N B
Data Point N 0 1 0
Deep Learning

• Non-exclusive
• Sigmoid function
• Each neuron will output a value between 0 and 1,
indicating the probability of having that class
assigned to it.
Multiclass Classification

• Sigmoid Function for Non-Exclusive Classes


1
Class One 0.8
0

1
Class Two 0.2
0

Hidden Layers
1
Class N 0.3
0
Deep Learning

• Mutually Exclusive Classes


• But what do we do when each data point can only have a
single class assigned to it?
• softmax function
Deep Learning

• Mutually Exclusive Classes


• Softmax function calculates the probabilities distribution
of the event over K different events.
• This function will calculate the probabilities of each
target class over all possible target classes.
• The range will be 0 to 1, and the sum of all the
probabilities will be equal to one.
• The model returns the probabilities of each class and the
target class chosen will have the highest probability.
Deep Learning

• Mutually Exclusive Classes


• If we use softmax for multi-class problems you get this
type of output:
• [Red , Green , Blue]
• [ 0.1 , 0.6 , 0.3 ]
Cost Functions and
Gradient Descent
Deep Learning

• The output 𝒚! is the model’s estimation of what it predicts


the label to be.
• So after the network creates its prediction, how do we
evaluate it?
• And after the evaluation how can we update the network’s
weights and biases?
Deep Learning

• First question
• We need to take the estimated outputs of the network
and then compare them to the real values of the label.
• The cost function (often referred to as a loss function)
must be an average so it can output a single value.
Terminology

• We’ll use the following variables:


• y to represent the true value
• a to represent neuron’s prediction
• In terms of weights and bias:
• w*x + b = z
• Pass z into activation function σ(z) = a
Deep Learning

• One very common cost function is the quadratic cost


function:

• calculate the difference between the real values y(x)


against our predicted values a(x)
• squaring this does 2 useful things for us, keeps everything
positive and punishes large errors
Deep Learning

• Cost function

• W is our neural network's weights, B is our neural


network's biases, Sr is the input of a single training
sample, and Er is the desired output of that training
sample.
Deep Learning

• information was encoded in the simplified notation.


• The a(x) holds information about weights and biases.
Deep Learning

• In a real case, this means we have some cost function C


dependent lots of weights!
• C(w1,w2,w3,....wn)
• How do we find out which weights lead us to the
lowest cost?
Deep Learning

• For simplicity, let’s imagine we only had one weight in our


cost function w.
• We want to minimize our loss/cost (overall error).
• Which means we need to figure out what value of w
results in the minimum of C(w)
Deep Learning
• “simple” function C(w)
• What value of w minimizes the cost?
C(w)

w
Deep Learning

• What value of w minimizes our cost?


• we could take a derivative and solve for 0 C(w)

wmin
• The real cost function will be very complex!
• n-dimensional
• use gradient descent to solve this problem
C(w)

w
Gradient Descent
• Calculate the slope at a point
C(w)

wmin
Gradient Descent
• Calculate the slope at a point
C(w)

wmin
Deep Learning
• Move in the downward direction of the slope.
C(w)

wmin
Deep Learning
• Move in the downward direction of the slope.
C(w)

wmin
Deep Learning
• Move in the downward direction of the slope.
C(w)

wmin
Deep Learning
• Until we converge to zero, indicating a minimum.
C(w)

wmin
Deep Learning
• We could have changed our step size to find the next point!
C(w)

wmin
Deep Learning
• Smaller steps sizes take longer to find the minimum.
C(w)

wmin
Deep Learning
• Larger steps are faster, but we risk overshooting the
minimum!
C(w)

wmin
Deep Learning
• This step size is known as the learning rate.
C(w)

wmin
Deep Learning

• The learning rate shown in the illustrations was


constant (each step size was equal)
• Adapt the step size
• start with larger steps, then go smaller as we realize the
slope gets closer to zero.
• This is known as adaptive gradient descent.
Deep Learning
• In 2015, Kingma and Ba published their paper: “Adam: A
Method for Stochastic Optimization“.
• Adam is a much more efficient way of searching for these
minimums
Deep Learning

• Realistically we’re calculating this descent in an n-dimensional space


for all our weights.
• When dealing with these N-dimensional vectors (tensors), the
notation changes from derivative to gradient.
• This means we calculate

• ∇C(w1,w2,...wn)
Deep Learning
• For classification problems, we often use the cross
entropy loss function.
• The assumption is that your model predicts a probability
distribution p(y=i) for each class i=1,2,…,C.
• For a binary classification this results in:

• For M number of classes > 2


Backpropagation
Backpropagation

• Let’s begin with a very simple network, where each layer


only has 1 neuron
Backpropagation

• Each input will receive a weight and bias

w1 +b1 w2 +b2 w3 +b3


Backpropagation

• This means we have:


• C(w1,b1,w2,b2,w3,b3)
• We’ve already seen how this process propagates forward.
• Let’s start at the end to see the backpropagation.

w1 +b1 w2 +b2 w3 +b3


Backpropagation

• Let’s say we have L layers, then our notation becomes:

L-n L-2 L-1 L


Backpropagation

• Focusing on these last two layers, let’s define z=wx+b


• Then applying an activation function we’ll state: a = σ(z)

L-1 L
Backpropagation

• This means we have:


• zL = wL aL-1 +bL
• aL = σ(zL)
• C0(...) =(aL - y)2

L-1 L
Backpropagation

• We want to understand how sensitive is the cost function


to changes in w:

L-1 L
Backpropagation

• Using the relationships we already know along with the


chain rule:

L-1 L
Backpropagation

• We can calculate the same for the bias terms:

L-1 L
Backpropagation
!"
• Partial derivative :
!#
• How quickly the cost changes when we change the weights

&
• 𝑤$% : weight for the connection from the 𝑘'( neuron in 𝑙 − 1 layer to the 𝑗'(
neuron in 𝑙'( layer
Backpropagation
• Activation 𝑎$& of 𝑗'( neuron in 𝑙'( layer is related to the activations in 𝑙 − 1'(
layer by the following equation:
Assumptions about the cost function
• First assumption
• Average
• Second assumption
• Function of output from the neural network
Four Fundamental Equations
Equation 1: Error in the output layer
Equation 2: Error in terms of error in the next layer

Equation 3: Rate of change of the cost w.r.t. any bias in the network

Equation 4: Rate of change of the cost w.r.t. any weight in the network
Learning Process

• Step 1: Using input x set the activation function a for the


input layer.
• z = w x +b
• a = σ(z)
• This resulting a then feeds into the next layer (and so on).
• Step 2: For each layer, compute:
• zL = wL aL-1 +bL
• aL = σ(zL)
Deep Learning

• Step 3: We compute our error vector:


• δL=∇aC⊙σ′(zL) Hadamard Product
Deep Learning

• Step 3: We compute our error vector:


• δL=∇aC⊙σ′(zL)
• ∇aC=(aL−y)
• Expressing the rate of change of C with
respect to the output activations
• Now let’s write out our error term for a layer in terms of
the error of the next layer (since we’re moving
backwards).
• Step 4: Backpropagate the error:
• For each layer: L−1,L−2,… we compute
• δl=(wl+1)Tδl+1⊙σ′(zl) - generalized error for any
layer l
• (wl+1)T is the transpose of the weight matrix of l+1
layer

You might also like