Unit-5: Introduction To Deep Learning: Artificial Neural Networks
Unit-5: Introduction To Deep Learning: Artificial Neural Networks
In the majority of neural networks, units are interconnected from one layer to another. Each
of these connections has weights that determine the influence of one unit on another unit. As
the data transfers from one unit to another, the neural network learns more and more about the
data, which eventually results in an output from the output layer.
Dendrites from Biological Neural Network represent inputs in Artificial Neural Networks,
cell nucleus represents Nodes, synapse represents Weights, and Axon represents Output.
Relationship between Biological neural network and artificial neural network:
In order to define a neural network that consists of a large number of artificial neurons, which
are termed units arranged in a sequence of layers. Lets us look at various types of layers
available in an artificial neural network.
Artificial Neural Network primarily consists of three layers:
• Input Layer: As the name suggests, it accepts inputs in several different formats
provided by the programmer.
• Hidden Layer: The hidden layer presents in-between input and output layers. It
performs all the calculations to find hidden features and patterns
• Output Layer: The input goes through a series of transformations using the hidden
layer, which finally results in output that is conveyed using this layer.
Perceptron EX-OR problem:
The XOR, or "exclusive OR", problem is a classic problem in the field of artificial
intelligence and machine learning. It is a problem that cannot be solved by a single layer
perceptron, and therefore requires a multi-layer perceptron or a deep learning model. This
Answer aims to provide a comprehensive understanding of the XOR problem and how it can
be solved using a neural network.
The XOR function
The XOR function is a binary function that takes two binary inputs and returns a binary output.
The output is true if the number of true inputs is odd, and false otherwise. In other words, it
returns true if exactly one of the inputs is true, and false otherwise.
The following table shows the truth table for the XOR function:
In the illustration, the circle is drawn when both x and y are the same, and the diamond is for
when they are different. But as shown in the figure, we cannot separate the circles and
diamonds by drawing a line. Hence the XOR function is not linearly separable. This is where
the XOR problem in neural networks arises. A single-layer perceptron, due to its linear nature,
fails to model the XOR function.
Overcoming the XOR problem
The XOR problem can be overcome by using a multi-layer perceptron (MLP), also known as
a neural network. An MLP consists of multiple layers of perceptrons, allowing it to model
more complex, non-linear functions. In following structure, the first layer is the input layer.
The second layer (hidden layer) transforms the original non-linearly separable problem into a
linearly separable one, which the third layer (output layer) can then solve.
Feedforward Propagation:
Feedforward propagation is the process in a neural network where the input data is passed
through the network's layers, from the input layer through the hidden layers to the output layer,
to generate a prediction or output. Each layer in the network performs a transformation on the
input data using weights and biases associated with the connections between neurons in
adjacent layers, typically followed by an activation function.
Here's a basic outline of feedforward propagation in a neural network:
Input Layer: The input data is fed into the input layer neurons. Each neuron in the input layer
corresponds to a feature in the input data.
Hidden Layers: The input data is then passed through one or more hidden layers. Each neuron
in a hidden layer receives input from all neurons in the previous layer, applies a weighted sum
of inputs, adds a bias term, and then applies an activation function (like ReLU, sigmoid, or
tanh) to produce an output.
Output Layer: The final hidden layer output is passed to the output layer, which processes
the inputs in a similar way as the hidden layers but typically uses a different activation function
(e.g., softmax for classification or linear for regression) to produce the final output of the
network.
Output: The output of the output layer is the prediction or output of the neural network for
the given input data. During the training phase, the weights and biases in the network are
adjusted based on the error between the predicted output and the actual target output, using
techniques like backpropagation and optimization algorithms like gradient descent, to
minimize the error and improve the network's performance.
Back Propagation:
Backpropagation is the process used in training neural networks to update the weights of the
network in order to minimize the error between the predicted output and the actual target
output. It is essentially a way of calculating the gradient of the loss function with respect to
the weights of the network, which is then used to update the weights using an optimization
algorithm like gradient descent.
Here's a basic outline of the backpropagation process:
Forward Pass: During the forward pass (feedforward propagation), the input data is passed
through the network, and the output is calculated.
Calculate Loss: The output of the network is compared to the actual target output, and a loss
function is calculated to measure the difference between them. Common loss functions
include mean squared error (MSE) for regression problems and cross-entropy loss for
classification problems.
Backward Pass (Backpropagation): The goal of backpropagation is to calculate the gradient
of the loss function with respect to each weight in the network. This is done using the chain
rule of calculus to propagate the error backwards through the network.
Update Weights: Once the gradients have been calculated, the weights of the network are
updated using an optimization algorithm like gradient descent. The weights are updated in the
opposite direction of the gradient in order to minimize the loss function.
Repeat: Steps 1-4 are repeated for each batch of training data until the model converges and
the weights have been optimized. Backpropagation is a key component of training neural
networks and allows them to learn complex patterns in data by iteratively adjusting the
weights of the network to minimize the error.
LOSSES:
In artificial neural networks (ANNs), losses are used to measure the difference between the
predicted output and the actual target. The choice of loss function depends on the task at hand
(e.g., regression, classification) and the network's output (e.g., scalar, vector, matrix).
• Mean Squared Error (MSE): Commonly used for regression tasks, MSE calculates
the average of the squared differences between predicted and actual values. It
penalizes large errors more than small ones.
• Binary Cross-Entropy: Used for binary classification tasks, this loss function
measures the difference between two probability distributions (predicted and actual)
for a binary outcome.
• Categorical Cross-Entropy: Used for multiclass classification tasks, categorical
cross-entropy calculates the difference between predicted and actual class
probabilities across all classes.
• Sparse Categorical Cross-Entropy: Similar to categorical cross-entropy but used
when the target labels are integers (e.g., 0, 1, 2) instead of one-hot encoded vectors.
• Kullback-Leibler Divergence (KL Divergence): Measures how one probability
distribution diverges from a second, expected probability distribution. It's often used
in scenarios where you have a target distribution and want to measure how well your
model captures it.
Activation Function:
1. Sigmoid / Logistic Activation Function
This function takes any real value as input and outputs values in the range of 0 to 1. The larger
the input (more positive), the closer the output value will be to 1.0, whereas the smaller the
input (more negative), the closer the output will be to 0.0, as shown below.
Mathematically it can be represented as:
It is commonly used for models where we have to predict the probability as an output. Since
probability of anything exists only between the range of 0 and 1, sigmoid is the right choice
because of its range.
The function is differentiable and provides a smooth gradient, i.e., preventing jumps in output
values. This is represented by an S-shape of the sigmoid activation function.
3. ReLU Function
ReLU stands for Rectified Linear Unit. Although it gives an impression of a linear function,
ReLU has a derivative function and allows for backpropagation while simultaneously making
it computationally efficient.
The main catch here is that the ReLU function does not activate all the neurons at the same
time.
The neurons will only be deactivated if the output of the linear transformation is less than 0.
Mathematically it can be represented as:
The advantages of using ReLU as an activation function are as follows:
• Since only a certain number of neurons are activated, the ReLU function is far more
computationally efficient when compared to the sigmoid and tanh functions.
• ReLU accelerates the convergence of gradient descent towards the global minimum
of the loss function due to its linear, non-saturating property.
Leaky ReLU is an improved version of ReLU function to solve the Dying ReLU problem as
it has a small positive slope in the negative area.
Parametric ReLU is another variant of ReLU that aims to solve the problem of gradient’s
becoming zero for the left half of the axis.
This function provides the slope of the negative part of the function as an argument a. By
performing backpropagation, the most appropriate value of a is learnt.
Exponential Linear Unit, or ELU for short, is also a variant of ReLU that modifies the slope
of the negative part of the function.
ELU uses a log curve to define the negativ values unlike the leaky ReLU and Parametric
ReLU functions with a straight line.
In artificial neural networks (ANNs), hyperparameters are parameters that are set before the
learning process begins. They control aspects of the learning process such as the network
architecture, the optimization algorithm, and the training process. Here are some basic
hyperparameters in ANNs:
1. Number of hidden layers: The number of layers in the neural network, not including the
input and output layers. More layers can potentially capture more complex patterns in the data
but can also lead to overfitting.
2. Number of neurons per hidden layer: The number of neurons (nodes) in each hidden layer.
A larger number of neurons can increase the model's capacity to learn complex patterns but
can also lead to overfitting.
3. Activation function: The function applied to the output of each neuron in the network.
Common activation functions include ReLU (Rectified Linear Unit), sigmoid, and tanh. The
choice of activation function can affect the model's ability to learn and the speed of
convergence during training.
4. Learning rate: The size of the step taken during the optimization process (e.g., gradient
descent) to update the weights of the network. A larger learning rate can lead to faster
convergence but may cause the model to overshoot the optimal weights, while a smaller
learning rate can lead to slower convergence but may result in more stable training.
Batch size: The number of data points used in each iteration of the training process. Larger
batch sizes can lead to faster training but may result in less noise in the gradient estimates,
while smaller batch sizes can lead to slower training but may result in more accurate gradient
estimates.
6. Epochs: The number of times the entire dataset is passed through the network during
training. One epoch is completed when the model has seen all the training data once. Training
for more epochs can lead to better performance but may also increase the risk of overfitting.
7. Optimizer: The algorithm used to update the weights of the network during training.
Common optimizers include stochastic gradient descent (SGD), Adam, and RMSprop.
8. Regularization: Techniques used to prevent overfitting, such as L1 or L2 regularization,
dropout, or early stopping. These techniques introduce additional hyper parameters that
control the strength of regularization.
These are just a few examples of hyper parameters in ANNs. The choice of hyper parameters
can significantly impact the performance of the model, and it often requires experimentation
and tuning to find the optimal set of hyperparameters for a specific problem.