Deep Leaning
Deep Leaning
In this case, the decision boundary equation is x_1 + x_2 =2. Here, all
the input points that lie ON or ABOVE, just (1,1), output 1 when
passed through the AND function M-P neuron.
OR Function With 3 Inputs
Lets just generalize this by looking at a 3 input OR function M-P unit. In this
case, the possible inputs are 8 points — (0,0,0), (0,0,1), (0,1,0), (1,0,0),
(1,0,1),… you got the point(s). We can map these on a 3D graph and this time
we draw a decision boundary in 3 dimensions.
The plane that satisfies the decision boundary equation x_1 + x_2 + x_3 = 1
is shown below:
Take your time and convince yourself by looking at the above plot that all the points that
lie ON or ABOVE that plane (positive half space) will result in output 1 when passed
through the OR function M-P unit and all the points that lie BELOW that plane
(negative half space) will result in output 0.
Limitations Of M-P Neuron
1. What about non-boolean (say, real) inputs?
2. Do we always need to hand code the threshold?
3. Are all inputs equal? What if we want to assign more importance to some inputs?
4. What about functions which are not linearly separable? Say XOR function.
Conclusion
1. In this Topic, we briefly looked at biological neurons.
2. We then established the concept of MuCulloch-Pitts neuron, the first ever mathematical
model of a biological neuron.
3. We represented a bunch of Boolean functions using the M-P neuron.
4. We also tried to get a geometric intuition of what is going on with the model, using 3D plots.
5. In the end, we also established a motivation for a more generalized model, the one and only
artificial neuron/perceptron model.
Perceptron model
.
Further, Perceptron is also understood as an Artificial Neuron or neural network unit
that helps to detect certain input data computations in business intelligence.
Basic Components of Perceptron
Input Nodes or Input Layer:
This is the primary component of Perceptron which accepts the initial data into the system
for further processing. Each input node contains a real numerical value.
Weight and Bias:
Weight parameter represents the strength of the connection between units. This is another
most important parameter of Perceptron components. Weight is directly proportional to the
strength of the associated input neuron in deciding the output. Further, Bias can be
considered as the line of intercept in a linear equation.
Activation Function:
These are the final and important components that help to determine whether the neuron
will fire or not. Activation Function can be considered primarily as a step function.
How does Perceptron work?
In Machine Learning, Perceptron is considered as a single-layer neural network that
consists of four main parameters named input values (Input nodes), weights and Bias, net
sum, and an activation function.
The perceptron model begins with the multiplication of all input values and their weights,
then adds these values together to create the weighted sum.
Then this weighted sum is applied to the activation function 'f' to obtain the desired
output.
This activation function is also known as the step function and is represented by 'f'.
Types of Perceptron Models
Based on the layers, Perceptron models are divided into two types. These are as
follows:
1. Single-layer Perceptron Model
2. Multi-layer Perceptron model
1. Single Layer Perceptron Model:
This is one of the easiest Artificial neural networks (ANN) types. A single-layered
perceptron model consists feed-forward network and also includes a threshold transfer
function inside the model. The main objective of the single-layer perceptron model is to
analyze the linearly separable objects with binary outcomes. In a single layer perceptron
model, its algorithms do not contain recorded data, so it begins with inconstantly allocated
input for weight parameters. Further, it sums up all inputs (weight). After adding all inputs,
if the total sum of all inputs is more than a pre-determined value, the model gets activated
and shows the output value as +1.
Multi-Layered Perceptron Model
Like a single-layer perceptron model, a multi-layer perceptron model also has the same
model structure but has a greater number of hidden layers.
1.Forward Stage: Activation functions start from the input layer in the forward stage
and terminate on the output layer.
2.Backward Stage: In the backward stage, weight and bias values are modified as per
the model's requirement. In this stage, the error between actual output and demanded
originated backward on the output layer and ended on the input layer.
Advantages of Multi-Layer Perceptron
Linear and non-linear separability. (A) shows two classes of data that can be separated by
a single straight line (i.e. they are 'linearly separable'). (B) shows two classes of data that
require a more complex classification topology (i.e. they are not linearly separable).
Neural networks are capable of solving the complex topography in (B), whereas more
traditional statistical methods (such as binary logistic regression, the obvious comparator in
this case) are not.
What is Deep Leaning?
It is technique in which machines are trained to the
mimic the behavior of the human brain i.e. act or
behave like human being.
History of deep leaning
1943- Perceptron
(Threshold logic) (Warren
McCulloch and Walter
Pitts)
Threshold Logic
1985- Hinton and
Rumelhart, Williams
Hidden Layer
As the name suggests, the nodes of this layer are not exposed. They provide an abstraction
to the neural network. The hidden layer performs all kinds of computation on the features
entered through the input layer and transfers the result to the output layer.
Output Layer
It’s the final layer of the network that brings the information learned through the hidden
layer and delivers the final value as a result.
Note: All hidden layers usually use the same activation function. However, the output layer
will typically use a different activation function from the hidden layers. The choice
depends on the goal or type of prediction made by the model.
Feedforward vs. Backpropagation
When learning about neural networks, you will come across two essential terms describing
the movement of information—feedforward and backpropagation.
Feedforward Propagation –
The flow of information occurs in the forward direction. The input is used to calculate
some intermediate function in the hidden layer, which is then used to calculate an output.
In the feedforward propagation, the Activation Function is a mathematical “gate” in
between the input feeding the current neuron and its output going to the next layer.
Backpropagation -
The weights of the network connections are repeatedly adjusted to minimize the difference
between the actual output vector of the net and the desired output vector.
To put it simple word
Backpropagation aims to minimize the cost function by adjusting the network’s weights
and biases. The cost function gradients determine the level of adjustment with respect to
parameters like activation function, weights, bias, etc.
Why do Neural Networks Need an Activation Function?
So we know what Activation Function is and what it does, but—
Mathematically,
net input =sum{(weight * input)+bias}
Types of Neural Networks Activation Functions
1. Binary Step Function
Binary step function depends on a threshold value that decides whether a neuron should be
activated or not.
The input fed to the activation function is compared to a certain threshold; if the input is
greater than it, then the neuron is activated, else it is deactivated, meaning that its output is
not passed on to the next hidden layer.
Mathematically it can be represented as
The limitations of binary step function:
2. The gradient of the step function is zero, which causes a problems in the
backpropagation process.
Non-Linear Neural Networks Activation Functions
The larger the input (more positive), the closer the output value will be to 1.0, whereas
the smaller the input (more negative), the closer the output will be to 0.0, as shown
below.
Mathematically it can be represented as: Why sigmoid/logistic activation function is one
of the most widely used functions? These are
the advantages
I. It is commonly used for models where we
have to predict the probability as an output.
Since probability of anything exists only
between the range of 0 and 1, sigmoid is the
right choice because of its range.
II. The function is differentiable and provides
a smooth gradient, i.e., preventing jumps in
output values. This is represented by an S-
shape of the sigmoid activation function.
The limitations of sigmoid function are discussed below:
The output of the tanh activation function is Zero centered; hence we can easily map the
output values as strongly negative, neutral, or strongly positive.
Usually used in hidden layers of a neural network as its values lie between -1 to;
therefore, the mean for the hidden layer comes out to be 0 or very close to it. It helps in
centering the data and makes learning for the next layer much easier.
Have a look at the gradient of the tanh activation function to understand its limitations.
Although it gives an impression of a linear function, ReLU has a derivative function and
allows for backpropagation while simultaneously making it computationally efficient.
The main catch here is that the ReLU function does not activate all the neurons at the same
time. The neurons will only be deactivated if the output of the linear transformation is less
than 0.
Mathematically it can be represented as
By making this minor modification for negative input values, the gradient of the left
side of the graph comes out to be a non-zero value. Therefore, we would no longer
encounter dead neurons in that region
But—
Let’s suppose we have five output values of 0.8, 0.9, 0.7, 0.8, and 0.6, respectively. How
can we move forward with it?
The above values don’t make sense as the sum of all the classes/output probabilities should
be equal to 1.
You see, the Softmax function is described as a combination of multiple sigmoid.
It is most commonly used as an activation function for the last layer of the neural
network in the case of multi-class classification.
Assume that you have three classes, meaning that there would be three neurons in the
output layer. Now, suppose that your output from the neurons is [1.8, 0.9, 0.68].
Applying the Softmax function over these values to give a probabilistic view will result in
the following outcome: [0.58, 0.23, 0.19].
The function returns 1 for the largest probability index while it returns 0 for the other
two array indexes.
Why are deep neural networks hard to train?
There are two challenges you might encounter when training your deep neural networks.
1.Vanishing Gradients
Like the sigmoid function, certain activation functions squish an ample input space into a
small output space between 0 and 1. Therefore, a large change in the input of the sigmoid
function will cause a small change in the output. Hence, the derivative becomes small.
For shallow networks with only a few layers that use these activations, this isn’t a big
problem. However, when more layers are used, it can cause the gradient to be too small
for training to work effectively.
2. Exploding Gradients
Exploding gradients are problems where significant error gradients accumulate and
result in very large updates to neural network model weights during training. An
unstable network can result when there are exploding gradients, and the learning
cannot be completed. The values of the weights can also become so large as to
overflow and result in something called NaN values.
How to choose the right Activation Function?
We need to match your activation function for your output layer based on the type of
prediction problem that you are solving—specifically, the type of predicted variable.
Here's what you should keep in mind.
As a rule of thumb, you can begin with using the ReLU activation function and then
move over to other activation functions if ReLU doesn’t provide optimum results.
The activation function used in hidden layers is typically chosen based on the type of neural
network architecture.
Advantage
I. Easy to interpret. Disadvantage
II. Always differential because of ▪ Error unit in the square. because the unit in
the square. the square is not understood properly.
III. Only one local minima. ▪ Not robust to outlier
Advantage Disadvantage
1. Intuitive and easy 1. Graph, not differential. we can not use
2. Error Unit Same as the output column. gradient descent directly, then we can sub
3. Robust to outlier gradient calculation.
Disadvantage
Its main disadvantage is the associated complexity. In order to maximize model accuracy,
the hyperparameter δ will also need to be optimized which increases the training
requirements.
B. Classification Loss
It is used in binary classification problems like two classes. example a person has covid or not
or my article gets popular or not. Binary cross entropy compares each of the predicted
probabilities to the actual class output which can be either 0 or 1. It then calculates the score
that penalizes the probabilities based on the distance from the expected value. That means how
close or far from the actual value.
yi – actual values
yihat – Neural Network prediction
Advantage –
A cost function is a differential.
Disadvantage –
Multiple local minima
Not intuitive
Note – In classification at last neuron use sigmoid activation function.
2. Categorical Cross Entropy
Categorical Cross entropy is used for Multiclass classification with support of probability
CE= Distance of actual probability from the predicated probability (actual output –
predicated output
1. The cost function (J) indicates how accurately the model performs.
2. It tells us how far-off our predicted output values are from our actual values.
3. It is also known as the error. Because the cost function quantifies the error, we aim to
minimize the cost function.
What we want is to reduce the output error.
Since the weights affect the error, we will need
to readjust the weights. We have to adjust the
weights such that we have a combination of
weights that minimizes the cost function.
Backpropagation uses the chain rule to calculate the gradient of the cost function. The
chain rule involves taking the derivative. This involves calculating the partial derivative
of each parameter. These derivatives are calculated by differentiating one weight and
treating the other(s) as a constant. As a result of doing this, we will have a gradient.
Since we have calculated the gradients, we will be able to adjust the weights.
Significance of Gradient Descent
The weights are adjusted using a process called gradient descent. Gradient descent is
an optimization algorithm that is used to find the weights that minimize the cost
function. Minimizing the cost function means getting to the minimum point of the cost
function. So, gradient descent aims to find a weight corresponding to the cost function’s
minimum point. To find this weight, we must navigate down the cost function until we
find its minimum point.
But first, to navigate the cost function, we need two things:
1. The direction in which to navigate and
2. The size of the steps for navigating.
The Direction
The direction for navigating the cost function is found using the gradient.
The Gradient
To know in which direction to navigate, gradient descent uses backpropagation. More
specifically, it uses the gradients calculated through backpropagation. These gradients are used
for determining the direction to navigate to find the minimum point. Specifically, we aim to
find the negative gradient. This is because a negative gradient indicates a decreasing slope. A
decreasing slope means that moving downward will lead us to the minimum point. For
example:
The Step Size
The step size for navigating the cost function is
determined using the learning rate.
Learning Rate
The learning rate is a tuning parameter that
determines the step size at each iteration of
gradient descent. It determines the speed at
which we move down the slope.
The step size plays an important part in ensuring a balance between optimization time and
accuracy. The step size is measured by a parameter alpha (α). A small α means a small
step size, and a large α means a large step size.
1. If the step sizes are too large, we could miss the minimum point completely. This can
yield inaccurate results.
2. If the step size is too small, the optimization process could take too much time. This
will lead to a waste of computational power.
The step size is evaluated and updated according to the
behavior of the cost function. The higher the gradient of the
cost function, the steeper the slope and the faster a model can
learn (high learning rate). A high learning rate results in a
higher step value, and a lower learning rate results in a lower
step value. If the gradient of the cost function is zero, the
model stops learning.
Descending the Cost Function
Navigating the cost function consists of adjusting the weights.
The weights are adjusted using the following formula:
•Batch – It denotes the number of samples to be taken to for updating the model
parameters.
•Learning rate – It is a parameter that provides the model a scale of how much
model weights should be updated.
•Cost Function/Loss Function – A cost function is used to calculate the cost, which
is the difference between the predicted value and the actual value.
•Weights/ Bias – The learnable parameters in a model that controls the signal
between two neurons.
Gradient decent
Gradient Descent is an optimizing algorithm used in Machine/ Deep Learning algorithms.
The goal of Gradient Descent is to minimize the objective convex function f(x) using
iteration. In simple word how to reach on global minima with iteration.
Gradient Descent on Cost function.
Steps to implement Gradient Descent
1. Randomly initialize values.
2. Update values.
➢ There is need to have a better options than using gradient descent on massive data.
➢ To tackle the challenges large datasets , we have stochastic gradient descent, a popular
approach among optimizers in deep learning.
➢ The term stochastic denotes the element of randomness upon which the algorithm
relies.
➢ In stochastic gradient descent, instead of processing the entire dataset during each
iteration, we randomly select batches of data.
➢ This implies that only a few samples from the dataset are considered at a time,
allowing for more efficient and computationally feasible optimization in deep learning
models.
➢ The procedure is first to select the initial parameters w and learning rate n. Then
randomly shuffle the data at each iteration to reach an approximate minimum.
➢ Since we are not using the whole dataset but the batches of it for each iteration, the path
taken by the algorithm is full of noise as compared to the gradient descent algorithm.
Thus, SGD uses a higher number of iterations to reach the local minima.
➢ Due to an increase in the number of iterations, the overall computation time increases.
But even after increasing the number of iterations, the computation cost is still less than
that of the gradient descent optimizer.
➢ So the conclusion is if the data is enormous and computational time is an essential factor,
stochastic gradient descent should be preferred over batch gradient descent algorithm.
Mini Batch Gradient Descent
➢ In this variant of gradient descent, instead of taking all the training data, only a subset of
the dataset is used for calculating the loss function.
➢ Since we are using a batch of data instead of taking the whole dataset, fewer iterations
are needed. That is why the mini-batch gradient descent algorithm is faster than both
stochastic gradient descent and batch gradient descent algorithms.
➢ This algorithm is more efficient and robust than the earlier variants of gradient descent.
As the algorithm uses batching, all the training data need not be loaded in the memory,
thus making the process more efficient to implement.
➢ Moreover, the cost function in mini-batch gradient descent is noisier than the batch
gradient descent algorithm but smoother than that of the stochastic gradient descent
algorithm. Because of this, mini-batch gradient descent is ideal and provides a good
balance between speed and accuracy.
➢ Despite all that, the mini-batch gradient descent algorithm has some downsides too.
It needs a hyperparameter that is “mini-batch-size”, which needs to be tuned to
achieve the required accuracy.
➢ Although, the batch size of 32 is considered to be appropriate for almost every case.
Also, in some cases, it results in poor final accuracy. Due to this, there needs a rise
to look for other alternatives too.