0% found this document useful (0 votes)
16 views14 pages

Unit-5: Introduction To Deep Learning: Artificial Neural Networks

Uploaded by

yuvraj120555
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views14 pages

Unit-5: Introduction To Deep Learning: Artificial Neural Networks

Uploaded by

yuvraj120555
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

UNIT-5: INTRODUCTION TO DEEP LEARNING

 Artificial Neural Networks:


Artificial Neural Networks contain artificial neurons which are called units. These units are
arranged in a series of layers that together constitute the whole Artificial Neural Network in a
system. A layer can have only a dozen units or millions of units as this depends on how the
complex neural networks will be required to learn the hidden patterns in the dataset.
Commonly, Artificial Neural Network has an input layer, an output layer as well as hidden
layers. The input layer receives data from the outside world, which the neural network needs
to analyze or learn about. Then this data passes through one or multiple hidden layers that
transform the input into data that is valuable for the output layer. Finally, the output layer
provides an output in the form of a response of the Artificial Neural Networks to input data
provided.

In the majority of neural networks, units are interconnected from one layer to another. Each
of these connections has weights that determine the influence of one unit on another unit. As
the data transfers from one unit to another, the neural network learns more and more about the
data, which eventually results in an output from the output layer.

Dendrites from Biological Neural Network represent inputs in Artificial Neural Networks,
cell nucleus represents Nodes, synapse represents Weights, and Axon represents Output.
Relationship between Biological neural network and artificial neural network:

Biological Neural Network Artificial Neural Network


Dendrites Inputs
Cell nucleus Nodes
Synapse Weights
Axon Output

 The architecture of an artificial neural network:

In order to define a neural network that consists of a large number of artificial neurons, which
are termed units arranged in a sequence of layers. Lets us look at various types of layers
available in an artificial neural network.
Artificial Neural Network primarily consists of three layers:
• Input Layer: As the name suggests, it accepts inputs in several different formats
provided by the programmer.
• Hidden Layer: The hidden layer presents in-between input and output layers. It
performs all the calculations to find hidden features and patterns
• Output Layer: The input goes through a series of transformations using the hidden
layer, which finally results in output that is conveyed using this layer.
 Perceptron EX-OR problem:
The XOR, or "exclusive OR", problem is a classic problem in the field of artificial
intelligence and machine learning. It is a problem that cannot be solved by a single layer
perceptron, and therefore requires a multi-layer perceptron or a deep learning model. This
Answer aims to provide a comprehensive understanding of the XOR problem and how it can
be solved using a neural network.
The XOR function
The XOR function is a binary function that takes two binary inputs and returns a binary output.
The output is true if the number of true inputs is odd, and false otherwise. In other words, it
returns true if exactly one of the inputs is true, and false otherwise.
The following table shows the truth table for the XOR function:

Input A Input B Output (A XOR B)


0 0 0
0 1 1
1 0 1
1 1 0

Linear separability of points


Linear separability is a concept in machine learning that refers to the ability to distinguish
between classes of data points using a straight line (in two dimensions) or a hyperplane (in
higher dimensions). If two classes of points can be perfectly separated by such a line or
hyperplane, they are considered linearly separable. This concept is fundamental to
understanding the limitations of single-layer perceptrons, which can only model linearly
separable functions.
Consider the data points:
The XOR problem
The XOR function is not linearly separable, which means we cannot draw a single straight
line to separate the inputs that yield different outputs.
This is illustrated in the following diagram:

In the illustration, the circle is drawn when both x and y are the same, and the diamond is for
when they are different. But as shown in the figure, we cannot separate the circles and
diamonds by drawing a line. Hence the XOR function is not linearly separable. This is where
the XOR problem in neural networks arises. A single-layer perceptron, due to its linear nature,
fails to model the XOR function.
Overcoming the XOR problem
The XOR problem can be overcome by using a multi-layer perceptron (MLP), also known as
a neural network. An MLP consists of multiple layers of perceptrons, allowing it to model
more complex, non-linear functions. In following structure, the first layer is the input layer.
The second layer (hidden layer) transforms the original non-linearly separable problem into a
linearly separable one, which the third layer (output layer) can then solve.
 Feedforward Propagation:
Feedforward propagation is the process in a neural network where the input data is passed
through the network's layers, from the input layer through the hidden layers to the output layer,
to generate a prediction or output. Each layer in the network performs a transformation on the
input data using weights and biases associated with the connections between neurons in
adjacent layers, typically followed by an activation function.
Here's a basic outline of feedforward propagation in a neural network:
Input Layer: The input data is fed into the input layer neurons. Each neuron in the input layer
corresponds to a feature in the input data.
Hidden Layers: The input data is then passed through one or more hidden layers. Each neuron
in a hidden layer receives input from all neurons in the previous layer, applies a weighted sum
of inputs, adds a bias term, and then applies an activation function (like ReLU, sigmoid, or
tanh) to produce an output.
Output Layer: The final hidden layer output is passed to the output layer, which processes
the inputs in a similar way as the hidden layers but typically uses a different activation function
(e.g., softmax for classification or linear for regression) to produce the final output of the
network.
Output: The output of the output layer is the prediction or output of the neural network for
the given input data. During the training phase, the weights and biases in the network are
adjusted based on the error between the predicted output and the actual target output, using
techniques like backpropagation and optimization algorithms like gradient descent, to
minimize the error and improve the network's performance.

 Back Propagation:
Backpropagation is the process used in training neural networks to update the weights of the
network in order to minimize the error between the predicted output and the actual target
output. It is essentially a way of calculating the gradient of the loss function with respect to
the weights of the network, which is then used to update the weights using an optimization
algorithm like gradient descent.
Here's a basic outline of the backpropagation process:
Forward Pass: During the forward pass (feedforward propagation), the input data is passed
through the network, and the output is calculated.
Calculate Loss: The output of the network is compared to the actual target output, and a loss
function is calculated to measure the difference between them. Common loss functions
include mean squared error (MSE) for regression problems and cross-entropy loss for
classification problems.
Backward Pass (Backpropagation): The goal of backpropagation is to calculate the gradient
of the loss function with respect to each weight in the network. This is done using the chain
rule of calculus to propagate the error backwards through the network.
Update Weights: Once the gradients have been calculated, the weights of the network are
updated using an optimization algorithm like gradient descent. The weights are updated in the
opposite direction of the gradient in order to minimize the loss function.
Repeat: Steps 1-4 are repeated for each batch of training data until the model converges and
the weights have been optimized. Backpropagation is a key component of training neural
networks and allows them to learn complex patterns in data by iteratively adjusting the
weights of the network to minimize the error.

 LOSSES:
In artificial neural networks (ANNs), losses are used to measure the difference between the
predicted output and the actual target. The choice of loss function depends on the task at hand
(e.g., regression, classification) and the network's output (e.g., scalar, vector, matrix).
• Mean Squared Error (MSE): Commonly used for regression tasks, MSE calculates
the average of the squared differences between predicted and actual values. It
penalizes large errors more than small ones.
• Binary Cross-Entropy: Used for binary classification tasks, this loss function
measures the difference between two probability distributions (predicted and actual)
for a binary outcome.
• Categorical Cross-Entropy: Used for multiclass classification tasks, categorical
cross-entropy calculates the difference between predicted and actual class
probabilities across all classes.
• Sparse Categorical Cross-Entropy: Similar to categorical cross-entropy but used
when the target labels are integers (e.g., 0, 1, 2) instead of one-hot encoded vectors.
• Kullback-Leibler Divergence (KL Divergence): Measures how one probability
distribution diverges from a second, expected probability distribution. It's often used
in scenarios where you have a target distribution and want to measure how well your
model captures it.
 Activation Function:
1. Sigmoid / Logistic Activation Function
This function takes any real value as input and outputs values in the range of 0 to 1. The larger
the input (more positive), the closer the output value will be to 1.0, whereas the smaller the
input (more negative), the closer the output will be to 0.0, as shown below.
Mathematically it can be represented as:

It is commonly used for models where we have to predict the probability as an output. Since
probability of anything exists only between the range of 0 and 1, sigmoid is the right choice
because of its range.
The function is differentiable and provides a smooth gradient, i.e., preventing jumps in output
values. This is represented by an S-shape of the sigmoid activation function.

2. Tanh Function (Hyperbolic Tangent)


Tanh function is very similar to the sigmoid/logistic activation function, and even has the
same S-shape with the difference in output range of -1 to 1. In Tanh, the larger the input (more
positive), the closer the output value will be to 1.0, whereas the smaller the input (more
negative), the closer the output will be to -1.0.
Mathematically it can be represented as

Advantages of using this activation function are:


• The output of the tanh activation function is Zero centered; hence we can easily map
the output values as strongly negative, neutral, or strongly positive.
• Usually used in hidden layers of a neural network as its values lie between -1 to;
therefore, the mean for the hidden layer comes out to be 0 or very close to it. It helps
in centering the data and makes learning for the next layer much easier.

3. ReLU Function
ReLU stands for Rectified Linear Unit. Although it gives an impression of a linear function,
ReLU has a derivative function and allows for backpropagation while simultaneously making
it computationally efficient.
The main catch here is that the ReLU function does not activate all the neurons at the same
time.
The neurons will only be deactivated if the output of the linear transformation is less than 0.
Mathematically it can be represented as:
The advantages of using ReLU as an activation function are as follows:
• Since only a certain number of neurons are activated, the ReLU function is far more
computationally efficient when compared to the sigmoid and tanh functions.
• ReLU accelerates the convergence of gradient descent towards the global minimum
of the loss function due to its linear, non-saturating property.

4. Leaky ReLU Function:

Leaky ReLU is an improved version of ReLU function to solve the Dying ReLU problem as
it has a small positive slope in the negative area.

Mathematically it can be represented as:


The advantages of Leaky ReLU are same as that of ReLU, in addition to the fact that it does
enable backpropagation, even for negative input values.
By making this minor modification for negative input values, the gradient of the left side of
the graph comes out to be a non-zero value. Therefore, we would no longer encounter dead
neurons in that region.

5. Parametric ReLU Function:

Parametric ReLU is another variant of ReLU that aims to solve the problem of gradient’s
becoming zero for the left half of the axis.
This function provides the slope of the negative part of the function as an argument a. By
performing backpropagation, the most appropriate value of a is learnt.

Mathematically it can be represented as:

Where "a" is the slope parameter for negative values.


The parameterized ReLU function is used when the leaky ReLU function still fails at solving
the problem of dead neurons, and the relevant information is not successfully passed to the
next layer.
This function’s limitation is that it may perform differently for different problems depending
upon the value of slope parameter a.
6. Exponential Linear Units (ELUs) Function:

Exponential Linear Unit, or ELU for short, is also a variant of ReLU that modifies the slope
of the negative part of the function.
ELU uses a log curve to define the negativ values unlike the leaky ReLU and Parametric
ReLU functions with a straight line.

Mathematically it can be represented as:

ELU is a strong alternative for f ReLU because of the following advantages:


• ELU becomes smooth slowly until its output equal to -α whereas RELU sharply
smoothes.
• Avoids dead ReLU problem by introducing log curve for negative values of input. It
helps the network nudge weights and biases in the right direction.
The limitations of the ELU function are as follow:
• It increases the computational time because of the exponential operation included
• No learning of the ‘a’ value takes place
• Exploding gradient problem
 GPU Training:
Graphics processing units (GPUs), originally developed for accelerating graphics processing,
can dramatically speed up computational processes for deep learning. They are an essential
part of a modern artificial intelligence infrastructure, and new GPUs have been developed and
optimized specifically for deep learning.
Training machine learning models on GPUs can significantly speed up the process, especially
for deep learning models that require intensive computations. To train models on GPUs, you
can use libraries like TensorFlow or PyTorch, which provide GPU support out of the box.

 BASIC HYPERPARAMETERS IN ANN:

In artificial neural networks (ANNs), hyperparameters are parameters that are set before the
learning process begins. They control aspects of the learning process such as the network
architecture, the optimization algorithm, and the training process. Here are some basic
hyperparameters in ANNs:
1. Number of hidden layers: The number of layers in the neural network, not including the
input and output layers. More layers can potentially capture more complex patterns in the data
but can also lead to overfitting.
2. Number of neurons per hidden layer: The number of neurons (nodes) in each hidden layer.
A larger number of neurons can increase the model's capacity to learn complex patterns but
can also lead to overfitting.
3. Activation function: The function applied to the output of each neuron in the network.
Common activation functions include ReLU (Rectified Linear Unit), sigmoid, and tanh. The
choice of activation function can affect the model's ability to learn and the speed of
convergence during training.
4. Learning rate: The size of the step taken during the optimization process (e.g., gradient
descent) to update the weights of the network. A larger learning rate can lead to faster
convergence but may cause the model to overshoot the optimal weights, while a smaller
learning rate can lead to slower convergence but may result in more stable training.
Batch size: The number of data points used in each iteration of the training process. Larger
batch sizes can lead to faster training but may result in less noise in the gradient estimates,
while smaller batch sizes can lead to slower training but may result in more accurate gradient
estimates.
6. Epochs: The number of times the entire dataset is passed through the network during
training. One epoch is completed when the model has seen all the training data once. Training
for more epochs can lead to better performance but may also increase the risk of overfitting.
7. Optimizer: The algorithm used to update the weights of the network during training.
Common optimizers include stochastic gradient descent (SGD), Adam, and RMSprop.
8. Regularization: Techniques used to prevent overfitting, such as L1 or L2 regularization,
dropout, or early stopping. These techniques introduce additional hyper parameters that
control the strength of regularization.
These are just a few examples of hyper parameters in ANNs. The choice of hyper parameters
can significantly impact the performance of the model, and it often requires experimentation
and tuning to find the optimal set of hyperparameters for a specific problem.

 SELECTION OF NEURONS IN ANN:


Selecting the number of neurons in each layer of an artificial neural network (ANN) is a crucial
step in designing an effective model. The number of neurons can affect the model's capacity
to learn complex patterns, its ability to generalize to unseen data, and its computational
efficiency. Here are some guidelines for selecting the number of neurons in an ANN:
Input Layer: The number of neurons in the input layer should match the dimensionality of
your input data. Each neuron in the input layer represents a feature or attribute of the input
data.
Output Layer: The number of neurons in the output layer depends on the type of problem
you are trying to solve. For binary classification problems, a single neuron with a sigmoid
activation function is often used. For multi-class classification, you can use one neuron per
class with a softmax activation function. For regression problems, a single neuron without an
activation function can be used.
Hidden Layers: The number of neurons in the hidden layers is a more complex decision. Too
few neurons may result in underfitting, where the model cannot capture the complexity of the
data. Too many neurons may result in overfitting, where the model learns noise in the training
data instead of the underlying patterns.
One common approach is to start with a single hidden layer and gradually increase the number
of neurons until the model's performance on a validation set starts to decrease.
Another approach is to use the "rule of thumb," which suggests using a number of neurons
between the number of input neurons and output neurons, or a multiple of this number.
For more complex problems, such as image recognition or natural language processing, deeper
architectures with multiple hidden layers and varying numbers of neurons in each layer are
often used.

 GREEDY SEARCH IN ANN:


Greedy search is a simple optimization algorithm that makes the most optimal choice at each
step with the hope of finding a global optimum. In the context of optimizing the number of
layers in an Artificial Neural Network (ANN), a greedy search approach would involve
iteratively adding or removing layers from a base model and evaluating the performance of
each modified model based on a chosen metric, such as accuracy or loss.
Here's a basic outline of how a greedy search for optimizing the number of layers in an ANN
might work:
• Define a Base Model: Start with a simple base model with a fixed number of layers
and nodes per layer.
• Define a Greedy Search Criteria: Choose a metric to evaluate the performance of
your model, such as accuracy, loss, etc.
• Iterative Process:
1. Initialize the current best model with the base model.
2. For each iteration, consider adding or removing a layer from the current best
model and evaluate its performance using the chosen metric.
3. If the performance improves, update the current best model.
4. Repeat this process for a fixed number of iterations or until a stopping criterion
is met.
• Finalize Model: Once the iterative process is complete, finalize the model with the
best configuration found during the search.

 Random Access in ANN:

If you're referring to randomly accessing layers in an Artificial Neural Network (ANN), it


typically means accessing or modifying layers in a non-sequential manner. In most neural
network frameworks, layers are added sequentially, and accessing or modifying them
randomly is not a standard operation. However, you can achieve similar functionality by
maintaining a list or dictionary of layers and accessing them by index or name.

You might also like