0% found this document useful (0 votes)
19 views26 pages

UNIT1 Perceptron MLP

The document discusses the perceptron model, introduced by Minsky-Papert, which improves upon the McCulloch-Pitts neuron by incorporating numerical weights for inputs and a learning mechanism. It explains the perceptron's ability to perform binary classification and its limitations with non-linearly separable data, leading to the development of the Multilayer Perceptron. Additionally, it covers various activation functions, including sigmoid, tanh, ReLU, and softmax, highlighting their advantages and disadvantages in neural network applications.

Uploaded by

BENAZIR AE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views26 pages

UNIT1 Perceptron MLP

The document discusses the perceptron model, introduced by Minsky-Papert, which improves upon the McCulloch-Pitts neuron by incorporating numerical weights for inputs and a learning mechanism. It explains the perceptron's ability to perform binary classification and its limitations with non-linearly separable data, leading to the development of the Multilayer Perceptron. Additionally, it covers various activation functions, including sigmoid, tanh, ReLU, and softmax, highlighting their advantages and disadvantages in neural network applications.

Uploaded by

BENAZIR AE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Perceptron

 Proposed by Minsky-Papert
 More general computational model than
McCulloch-Pitts neuron.
 It overcomes some of the limitations of the M-
P neuron by introducing the concept of
numerical weights (a measure of importance)
for inputs, and a mechanism for learning
those weights.
 Inputs are no longer limited to boolean values
like in the case of an M-P neuron, it supports
real inputs as well which makes it more useful
and generalized.
Now, this is very similar to an M-P neuron but we take a weighted sum of the inputs and set the output as
one only when the sum is more than an arbitrary threshold (theta).

 Instead of hand coding the thresholding


parameter thetha, we add it as one of the
inputs, with the weight -theta like shown
below, which makes it learn-able.

EXAMPLE: Consider the task of predicting whether you would watch a random game of football on TV or not
using the behavioral data available. And let's assume your decision is solely dependent on 3 binary inputs
(binary for simplicity).
Here,
 w_0 is called the bias because it represents the prior
(prejudice).
A football freak may have a very low threshold and may watch any
football game irrespective of the league, club or importance of the
game [theta = 0]. On the other hand, a selective viewer may only
watch a football game that is a premier league game, featuring
Man United game and is not friendly [theta = 2].
 So, weights and the bias will depend on the data.
Based on the data, if needed the model may have to give a lot of
importance (high weight) to the isManUnitedPlaying input and
penalize the weights of other inputs.
Perceptron vs McCulloch-Pitts Neuron
What kind of functions can be implemented using a perceptron? How different is it from McCulloch-Pitts
neurons?

 A perceptron also separates the input space into two halves,


positive and negative.
 All the inputs that produce an output 1 lie on one side
(positive half space) and all the inputs that produce an output
0 lie on the other side (negative half space).
 In other words, a single perceptron can only be used to
implement linearly separable functions, just like the M-P
neuron.
 The weights, including the threshold can be learned and the
inputs can be real values.

Perceptron for Binary Classification


With this discrete output, controlled by the activation function,
the perceptron can be used as a binary classification model,
defining a linear decision boundary. It finds the
separating hyperplane that minimizes the distance between
misclassified points and the decision boundary
EXAMPLE: OR Function
The above ‘possible solution’ was obtained by solving the linear system of equations on the left. It is clear
that the solution separates the input space into two spaces, negative and positive half spaces.

Perceptron Limitation: The perceptron's harsh thresholding logic


results in sudden binary decisions (e.g., 0.49 -> "No", 0.51 ->
"Yes"), which may not suit real-world applications requiring
smoother decision transitions.

Sigmoid Function Advantage: Sigmoid neurons introduce a


smoother, continuous, and differentiable "S"-shaped output that
maps inputs to probabilities between 0 and 1, providing gradual
decision changes instead of sharp transitions.
REAL TIME EXAMPLE:
Setting Up The Problem

We are going to use a perceptron to estimate if I will be watching a


movie based on historical data with the above-mentioned inputs.
The data has positive and negative examples, positive being the
movies I watched i.e., 1. Based on the data, we are going to learn
the weights using the perceptron learning algorithm. For visual
simplicity, we will only assume two-dimensional input.
Perceptron Learning Algorithm
Our goal is to find the w vector that can perfectly classify positive
inputs and negative inputs in our data.

 We initialize w with some random vector.


 Iterate over all the examples in the data, (P U N) both positive
and negative examples.
 If an input x belongs to P, w.x >= 0
 If x belongs to N, w.x < 0
Case 1: When x belongs to P and its dot product w.x < 0
Case 2: When x belongs to N and its dot product w.x ≥ 0
Only for these cases, we are updating our randomly initialized w.
Otherwise, we don’t touch w at all because Case 1 and Case 2 are
violating the very rule of a perceptron.
Why Would The Specified Update Rule Work?
We have already established that when x belongs to P, we
want w.x > 0, basic perceptron rule. What we also mean by that is
that when x belongs to P, the angle between w and x should be
_____ than 90 degrees. Fill in the blank.
Answer: The angle between w and x should be less than 90
because the cosine of the angle is proportional to the dot product.

So whatever the w vector may be, as long as it makes an angle less


than 90 degrees with the positive example data vectors (x E P) and
an angle more than 90 degrees with the negative example data
vectors (x E N). So ideally, it should look something like this:

So we now strongly believe that the angle between w and x should


be less than 90 when x belongs to P class and the angle between
them should be more than 90 when x belongs to N class. Here’s
why the update works:
So when we are adding x to w, which we do when x belongs to P
and w.x < 0
In short,
Case 1: positive input x belongs to P
w=w+x
cos(α) > 0
α < 0 i.e., α < 90
Case 2: negative input x belongs to N
w=w-x
cos(α) < 0
α > 0 i.e., α > 90
XOR Function — Can’t Do!
 Look at a non-linear boolean function i.e., you cannot draw a
line to separate positive inputs from the negative ones.

 Fourth equation contradicts the second and the third


equation.
 i.e., there are no perceptron solutions for non-linearly
separated data.
 So, a single perceptron cannot learn to separate the data that
are non-linear in nature.
MULTILAYER PERCEPTRON
 The Multilayer Perceptron was developed to tackle this
limitation.
 It is a neural network where the mapping between inputs and
output is non-linear.
 A Multilayer Perceptron has input and output layers, and one
or more hidden layers with many neurons stacked together.
 And while in the Perceptron the neuron must have an
activation function that imposes a threshold, like ReLU or
sigmoid, neurons in a Multilayer Perceptron can use any
arbitrary activation function.
EXAMPLE:
XOR:
 XOR(A,B) = (A+B)*(AB)’
 Complex relations can be broken into simpler functions and
combined.
ACTIVATION FUNCTIONS
 Extremely important feature of the Artificial Neural
Network.
 Decides whether a neuron should be activated or not.
 It limits the output signal to a finite value.
 Activation Function does the non-linear transformation to
the input making it capable to learn more complex relation
between input and output.
 It make the network capable of learning more complex
pattern.
 Without an activation function, the neural network is just a
linear regression model as it performs only summation of
product of input and weights.
Eg. In the below image 2 requires a complex relation which is curve unlike a simple linear relation in image
1.

Fig. Illustrating the need of Activation Function for a complex problem. Activation function must be
efficient and it should reduce the computation time because the neural network sometimes trained on
millions of data points.
Types of AF:
The Activation Functions can be basically divided into 3 types-
1. Binary step Activation Function
2. Linear Activation Function
3. Non-linear Activation Functions
1. Binary Step Function
 A binary step function is a threshold-based activation
function.
 If the input value is above or below a certain threshold, the
neuron is activated and sends exactly the same signal to the
next layer.
 We decide some threshold value to decide output that
neuron should be activated or deactivated.
 It is very simple and useful to classify binary problems or
classifier.
Eg. f(x) = 1 if x > 0 else 0 if x <= 0

2. Linear or Identity Activation Function


 Function is a line or linear.
 Output of the functions will not be confined between any
range.

Fig: Linear Activation Function

Equation: f(x) = x
Range : (-infinity to infinity)
 It doesn’t help with the complexity or various parameters of
usual data that is fed to the neural networks
3. Non-linear Activation Function
 Most used activation functions.
 Nonlinearity helps to makes the graph look something like
this.

Fig: Non-linear Activation Function

Derivative or Differential: Change in y-axis w.r.t. change in x-


axis.It is also known as slope.
Monotonic function: A function which is either entirely non-
increasing or non-decreasing.
 The Nonlinear Activation Functions are mainly divided on
the basis of their range or curves-
Advantages of Non-linear function over the Linear function:
 Differential is possible in all the non -linear function.
 Stacking of network is possible, which helps us in creating
deep neural nets.
 It makes it easy for the model to generalize.

Sigmoid (Logistic AF) (σ):


 The main reason why we use sigmoid function is it exists
between 0 to 1.
 It is especially used for models where we have to predict the
probability as output.
 Since probability of anything exists only between the range
of 0 and 1, sigmoid is the right choice.

Fig: Sigmoid Function (S-shaped Curve)

 The function is differentiable and monotonic. But function


derivative is not monotonic.
 The logistic sigmoid function can cause a neural network to
get stuck at the training time.
Advantages
1. Easy to understand and apply
2. Easy to train on small dataset
3. Smooth gradient, preventing “jumps” in output values.
4. Output values bound between 0 and 1, normalizing the
output of each neuron.
Disadvantages:
 Vanishing gradient—for very high or very low values of X,
there is almost no change to the prediction, causing a
vanishing gradient problem.
 This can result in the network refusing to learn further, or
being too slow to reach an accurate prediction.
 Outputs not zero centered.
 Computationally expensive
TanH (Hyperbolic Tangent AF):
 TanH is also like logistic sigmoid but in better way.
 The range of the TanHfunction is from -1 to +1.
 TanH is often preferred over the sigmoid neuron because it
is zero centred.
 The advantage is that the negative inputs will be mapped
strongly negative and the zero inputs will be mapped near
zero in tanh graph.

tanh(x) = 2 * sigmoid(2x) - 1

Fig. Sigmoid Vs Tanh

The function is differentiable and monotonic. But function


derivative is not monotonic.
Advantages
 Zero centered—making it easier to model inputs that have
strongly negative, neutral, and strongly positive values.
Disadvantages
 Like the Sigmoid function is also suffers from vanishing
gradient problem
 hard to train on small datasets
ReLU(Rectified Linear Unit):
 The ReLU is the most used activation function.
 Used in almost all convolution neural networks in hidden
layers only.
 The ReLU is half rectified (from bottom).
f(z) = 0, if z < 0
= z, otherwise
R(z) = max(0,z)
 The range is 0 to inf.

Advantages
 Avoids vanishing gradient problem.
 Computationally efficient—allows the network to converge
very quickly
 Non-linear—although it looks like a linear function, ReLU
has a derivative function and allows for backpropagation
Disadvantages
 Can only be used with a hidden layer
 hard to train on small datasets and need much data for
learning non- linear behavior.
 The Dying ReLU problem—when inputs approach zero, or
are negative, the gradient of the function becomes zero,
the network cannot perform backpropagation and cannot
learn.
 The function and its derivative both are monotonic.
 All the negative values are converted into zero, and this
conversion rate is so fast that neither it can map nor fit into
data properly which creates a problem.
Leaky ReLU Activation Function
 We needed the Leaky ReLU activation function to solve the
Dying ReLU‟ problem.
 Leaky ReLU we do not make all negative inputs to zero but
to a value near to zero which solves the major issue of ReLU
activation function.
R(z) = max (0.1*z, z)

Advantages
 Prevents dying ReLU problem—this variation of ReLU has a
small positive slope in the negative area, so it does enable
backpropagation, even for negative input values
 Otherwise like ReLU
Disadvantages
 Results not consistent—leaky ReLU does not provide
consistent predictions for negative input values.
Softmax:

 Sigmoid not able to handle more than two cases (class


label).
 Softmax can handle multiple cases.
 Softmax function squeeze the output for each class
between 0 and 1 with sum of them is 1.
 It is ideally used in the final output layer of the classifier,
where we are actually trying to attain the probabilities.
 Softmax produces multiple outputs for an input array.
 For this reason, we can build neural network models that
can classify more than 2 classes instead of binary class
solution.

Sigma = softmax
zi = input vector
e^{zi}} = standard exponential function for input vector
K = number of classes in the multi-class classifier
e^{zj} = standard exponential function for output vector
Advantages
 Able to handle multiple classes only one class in other
activation functions—normalizes the outputs for each
class between 0 and 1with the sum of the probabilities
been equal to 1, and divides by their sum, giving the
probability of the input value being in a specific class.
 Useful for output neurons—typically Softmax is used
only for the output layer, for neural networks that need
to classify inputs into multiple categories.

You might also like