Introduction to Deep Learning
Introduction to Deep Learning
Learning is “a process that leads to change, which occurs as a result of experience and increases the potential for
improved performance and future learning”.
Intelligence has been defined in many ways: the capacity for abstraction, logic, understanding, self-
awareness, learning, emotional knowledge, reasoning, planning, creativity, critical thinking, and problem-
solving. More generally, it can be described as the ability to perceive or infer information, and to retain it
as knowledge to be applied towards adaptive behaviours within an environment or context.
Artificial intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to
think like humans and mimic their actions. The term may also be applied to any machine that exhibits traits
associated with a human mind such as learning and problem-solving.
The ideal characteristic of artificial intelligence is its ability to rationalize and take actions that have the best
chance of achieving a specific goal.
A subset of artificial intelligence is machine learning (ML), which refers to the concept that computer programs
can automatically learn from and adapt to new data without being assisted by humans.
Deep learning techniques enable this automatic learning through the absorption of huge amounts of unstructured
data such as text, images, or video.
A Machine Learning system learns from historical data, builds the prediction models, and whenever it receives
new data, predicts the output for it. The accuracy of predicted output depends upon the amount of data, as the
huge amount of data helps to build a better model which predicts the output more accurately.
Suppose we have a complex problem, where we need to perform some predictions, so instead of writing a code
for it, we just need to feed the data to generic algorithms, and with the help of these algorithms, machine builds
the logic as per the data and predict the output. Machine learning has changed our way of thinking about the
problem. The below block diagram explains the working of Machine Learning algorithm:
3. It is a data-driven technology.
4. Machine learning is much similar to data mining as it also deals with the huge amount of the data.
These ML algorithms help to solve different business problems like Regression, Classification, Forecasting,
Clustering, and Associations, etc.
Based on the methods and way of learning, machine learning is divided into mainly four types, which are:
4. Reinforcement Learning
Supervised machine learning is based on supervision. It means in the supervised learning technique, we train the
machines using the "labelled" dataset, and based on the training, the machine predicts the output. Here, the labelled
data specifies that some of the inputs are already mapped to the output. More preciously, we can say; first, we
train the machine with the input and corresponding output, and then we ask the machine to predict the output
using the test dataset.
Unsupervised learning is different from the Supervised learning technique; as its name suggests, there is no need
for supervision. It means, in unsupervised machine learning, the machine is trained using the unlabeled dataset,
and the machine predicts the output without any supervision. In unsupervised learning, the models are trained
with the data that is neither classified nor labelled, and the model acts on that data without any supervision.
Semi-Supervised learning is a type of Machine Learning algorithm that lies between Supervised and
Unsupervised machine learning. It represents the intermediate ground between Supervised (With Labelled
training data) and Unsupervised learning (with no labelled training data) algorithms and uses the combination of
labelled and unlabeled datasets during the training period.
Although Semi-supervised learning is the middle ground between supervised and unsupervised learning and
operates on the data that consists of a few labels, it mostly consists of unlabeled data. As labels are costly, but for
corporate purposes, they may have few labels. It is completely different from supervised and unsupervised
learning as they are based on the presence & absence of labels.
In reinforcement learning, there is no labelled data like supervised learning, and agents learn from their
experiences only.
The limitations of machine learning models depend on particular model, problem being solved, and data set used
to train the model. Generally speaking, machine learning models can be limited by their accuracy, by the types of
problems they can solve, and by the quality of the data used to train them.
There are many supervised learning algorithms, but all of them have limitations. One of the biggest limitations is
that the algorithms can only learn so much from the data that is provided. In addition, the algorithms are also very
reliant on the data being correctly labeled. If the data is not correctly labeled, the algorithms will not produce
accurate results.
Unsupervised learning is a type of machine learning where the algorithm is not provided with a set of known
inputs and outputs, and must learn from the data itself. The main limitation of unsupervised learning is that it is
more difficult for the algorithm to learn from the data, and often produces poorer results.
1. The quality of the results depends on the quality of the training data. If the training data is poor, the
results will also be poor.
2. Semi-supervised learning is less accurate than supervised learning.
3. It is more difficult to use semi-supervised learning than supervised learning.
Reinforcement learning is a machine learning technique that allows agents to learn how to achieve a goal or satisfy
a condition by interacting with an environment.
1. It can be difficult to determine the appropriate reinforcement learning algorithm to use for a given
problem.
2. It can be difficult to find a good learning rate and other optimization parameters for a reinforcement
learning algorithm.
3. Reinforcement learning can be slow to learn, especially in complex environments.
4. Reinforcement learning can be susceptible to “catastrophic forgetting,” where learned knowledge is
forgotten when it is no longer needed.
5. Reinforcement learning can be sensitive to changes in the environment, which can lead to unstable
or unpredictable behaviour.
Let’s discuss it in wider angle, for example: one limitation is that machine learning cannot always accurately
predict outcomes for certain situations. For example, a machine may be able to predict that a customer is likely to
purchase a product, but may not be able to accurately predict which product the customer will purchase.
1. Machine learning models are often opaque, making it difficult to understand why a particular
prediction was made.
2. Machine learning models are often unstable, meaning that they can produce different results when
trained on different data sets.
3. Machine learning models are often biased, meaning that they can produce inaccurate results when
applied to data sets that don’t match the data set on which the model was trained.
4. Machine learning models are often difficult to customize, meaning that it can be hard to change their
parameters or to adapt them to new data sets.
5. Machine learning models are often expensive to train, meaning that it can take a lot of time and
computational resources to build a model that is accurate.
6. Machine learning models are often vulnerable to learning from noise in the data, which can lead to
inaccurate predictions.
7. Machine learning models are often sensitive to the order in which the data is presented to them,
meaning that they can produce different results if the data is rearranged.
8. Machine learning models are often sensitive to the scale of the data, meaning that they can produce
different results if the data is aggregated or disaggregated.
9. Machine learning models are often sensitive to the distribution of the data, meaning that they can
produce different results if the data is sorted in a different way.
10. Machine learning models are often sensitive to the selection of training data, meaning that the results
of the model can be biased if the training data is not representative of the data set that will be used to
make predictions.
Machine learning has revolutionized BIG DATA and its potential application. It is growing day by day. It has the
ability to learn from past experience and make predictions on future events. Despite these impressive capabilities,
machine learning has limitations. One of its key limitations is its inability to account for unstructured data.
Additionally, machine learning is only as good as the data it is trained on. If the data is inaccurate or biased, the
machine learning algorithm will produce inaccurate results. Lastly, machine learning can be bypassed by human
beings who are better at understanding natural language and recognizing patterns.
The classic machine learning methods were initially incredibly efficient and successful when
there was little data. However, as the volume of data reaches the millions, their performance
approaches a plateau and stays at the same level even as the data volume grows. With a bigger
data set, the performance of conventional neural networks improves, but eventually reaches a
plateau. Only deep learning neural networks continue to perform better as the size of the data
grows. Deep learning neural networks are receiving a lot of study interest because of this.
Weights control the signal (or the strength of the connection) between two neurons. In other words, a weight
decides how much influence the input will have on the output. Biases, which are constant, are an additional input
into the next layer that will always have the value of 1. Bias can be positive or negative, increasing or decreasing
a neuron's output.
In the context of neural networks and deep learning, weights and biases are fundamental components that play a
crucial role in the functioning and training of neurons.
A neuron in a neural network is a mathematical function that takes multiple inputs, applies weights to these inputs,
sums them up, adds a bias, and then applies an activation function to produce an output. This output is then passed
on to the next layer of neurons in the network.
Weights:
Weights are the parameters that the neural network learns during the training process. Each input to a neuron is
associated with a weight. The weight represents the strength of the connection between the input and the neuron.
A higher weight means the input has more influence on the neuron's output, and a lower weight means less
influence.
During training, these weights are adjusted using optimization algorithms (e.g., gradient descent) to minimize a
defined loss function, effectively tuning the network to make accurate predictions.
Bias:
The bias is an additional parameter associated with each neuron. It allows the activation function to shift left or
right, providing the model with more flexibility. The bias helps the neuron to activate even when the weighted
sum of inputs is zero.
In summary, weights determine the strength of connections between neurons and are adjusted during training to
optimize performance. The bias helps in adjusting the activation function and provides the neuron with the ability
to activate even for small input values. Together, weights and biases enable the neural network to learn and
generalize from input data to produce meaningful output predictions.
The activation function calculates a weighted total and then adds bias to it to decide whether a neuron should be
activated or not. The Activation Function’s goal is to introduce non-linearity into a neuron’s output.
A Neural Network without an activation function is basically a linear regression model in Deep Learning,
since these functions perform non-linear computations on the input of a Neural Network, enabling it to learn
and do more complex tasks.
Activation Functions
An activation function in a neural network defines how the weighted sum of the input is transformed into an output
from a node or nodes in a layer of the network.
Sometimes the activation function is called a “transfer function.” If the output range of the activation function is
limited, then it may be called a “squashing function.” Many activation functions are nonlinear and may be referred
to as the “nonlinearity” in the layer or the network design.
The choice of activation function has a large impact on the capability and performance of the neural network, and
different activation functions may be used in different parts of the model.
Technically, the activation function is used within or after the internal processing of each node in the network,
although networks are designed to use the same activation function for all nodes in a layer.
A network may have three types of layers: input layers that take raw input from the domain, hidden layers that
take input from another layer and pass output to another layer, and output layers that make a prediction.
All hidden layers typically use the same activation function. The output layer will typically use a different
activation function from the hidden layers and is dependent upon the type of prediction required by the model.
Activation functions are also typically differentiable, meaning the first-order derivative can be calculated for a
given input value. This is required given that neural networks are typically trained using the backpropagation of
error algorithm that requires the derivative of prediction error in order to update the weights of the model.
Gradient Descent Algorithm iteratively calculates the next point using gradient at the current position, scales
it (by a learning rate) and subtracts obtained value from the current position (makes a step). It subtracts the
value because we want to minimise the function (to maximise it would be adding).
Backpropagation, short for "backward propagation of errors," is an algorithm for supervised learning of
artificial neural networks using gradient descent. Given an artificial neural network and an error function, the
method calculates the gradient of the error function with respect to the neural network's weights.
There are many different types of activation functions used in neural networks, although perhaps only a small
number of functions used in practice for hidden and output layers.
A hidden layer in a neural network is a layer that receives input from another layer (such as another hidden layer
or an input layer) and provides output to another layer (such as another hidden layer or an output layer).
A hidden layer does not directly contact input data or produce outputs for a model, at least in general.
Typically, a differentiable nonlinear activation function is used in the hidden layers of a neural network. This
allows the model to learn more complex functions than a network trained using a linear activation function.
In order to get access to a much richer hypothesis space that would benefit from deep representations, you need
a non-linearity, or activation function.
There are perhaps three activation functions you may want to consider for use in hidden layers; they are:
The rectified linear activation function, or ReLU activation function, is perhaps the most common function used
for hidden layers.
It is common because it is both simple to implement and effective at overcoming the limitations of other previously
popular activation functions, such as Sigmoid and Tanh. Specifically, it is less susceptible to vanishing
gradients that prevent deep models from being trained, although it can suffer from other problems like saturated
or “dead” units.
The ReLU function is calculated as follows:
• max(0.0, x)
This means that if the input value (x) is negative, then a value 0.0 is returned, otherwise, the value is returned.
When using the ReLU function for hidden layers, it is a good practice to use a “He Normal” or “He Uniform”
weight initialization and scale input data to the range 0-1 (normalize) prior to training.
The function takes any real value as input and outputs values in the range 0 to 1. The larger the input (more
positive), the closer the output value will be to 1.0, whereas the smaller the input (more negative), the closer the
output will be to 0.0.
When using the Sigmoid function for hidden layers, it is a good practice to use a “Xavier Normal” or “Xavier
Uniform” weight initialization (also referred to Glorot initialization, named for Xavier Glorot) and scale input
data to the range 0-1 (e.g. the range of the activation function) prior to training.
Tanh Hidden Layer Activation Function
The hyperbolic tangent activation function is also referred to simply as the Tanh (also “tanh” and “TanH“)
function.
It is very similar to the sigmoid activation function and even has the same S-shape.
The function takes any real value as input and outputs values in the range -1 to 1. The larger the input (more
positive), the closer the output value will be to 1.0, whereas the smaller the input (more negative), the closer the
output will be to -1.0.
When using the TanH function for hidden layers, it is a good practice to use a “Xavier Normal” or “Xavier
Uniform” weight initialization (also referred to Glorot initialization, named for Xavier Glorot) and scale input
data to the range -1 to 1 (e.g. the range of the activation function) prior to training.
A neural network will almost always have the same activation function in all hidden layers.
Traditionally, the sigmoid activation function was the default activation function in the 1990s. Perhaps through
the mid to late 1990s to 2010s, the Tanh function was the default activation function for hidden layers.
Both the sigmoid and Tanh functions can make the model more susceptible to problems during training, via the
so-called vanishing gradients problem.
The activation function used in hidden layers is typically chosen based on the type of neural network architecture.
Modern neural network models with common architectures, such as MLP and CNN, will make use of the ReLU
activation function, or extensions.
Recurrent networks still commonly use Tanh or sigmoid activation functions, or even both. For example, the
LSTM commonly uses the Sigmoid activation for recurrent connections and the Tanh activation for output.
The output layer is the layer in a neural network model that directly outputs a prediction.
Feed forward neural networks are artificial neural networks in which nodes do not form loops. This type of
neural network is also known as a multi-layer neural network as all information is only passed forward. During
data flow, input nodes receive data, which travel through hidden layers, and exit output nodes.
There are perhaps three activation functions you may want to consider for use in the output layer; they are:
• Linear
• Logistic (Sigmoid)
• Softmax
This is not an exhaustive list of activation functions used for output layers, but they are the most commonly used.
The linear activation function is also called “identity” (multiplied by 1.0) or “no activation.”
This is because the linear activation function does not change the weighted sum of the input in any way and instead
returns the value directly.
Max(2,3) = 3
A = [2,3]
A[0] = 2
A[1] = 3
Argmax(2,3) = 0 , 1
The softmax function outputs a vector of values that sum to 1.0 that can be interpreted as probabilities of class
membership.
It is related to the argmax function that outputs a 0 for all options and 1 for the chosen option. Softmax is a “softer”
version of argmax that allows a probability-like output of a winner-take-all function.
As such, the input to the function is a vector of real values and the output is a vector of the same length with values
that sum to 1.0 like probabilities.
• e^x / sum(e^x)
Where x is a vector of outputs and e is a mathematical constant that is the base of the natural logarithm.
Target labels used to train a model with the softmax activation function in the output layer will be vectors with 1
for the target class and 0 for all other classes.
You must choose the activation function for your output layer based on the type of prediction problem that you
are solving.
For example, you may divide prediction problems into two main groups, predicting a categorical variable
(classification) and predicting a numerical variable (regression).
If your problem is a regression problem, you should use a linear activation function.
Predicting a probability is not a regression problem; it is classification. In all cases of classification, your model
will predict the probability of class membership (e.g. probability that an example belongs to each class) that you
can convert to a crisp class label by rounding (for sigmoid) or argmax (for softmax).
If there are two mutually exclusive classes (binary classification), then your output layer will have one node and
a sigmoid activation function should be used. If there are more than two mutually exclusive classes (multiclass
classification), then your output layer will have one node per class and a softmax activation should be used. If
there are two or more mutually inclusive classes (multilabel classification), then your output layer will have one
node for each class and a sigmoid activation function is used.
Biological Neurons
Typical biological neurons are individual cells, each composed of the main body of the cell along with many tendrils
that extend from that body. The body, or soma, houses the machinery for maintaining basic cell functions and
energy processing (e.g., the DNA-containing nucleus, and organelles for building proteins and processing sugar
and oxygen). There are two types of tendrils: dendrites, which receive information from other neurons and bring it
to the cell body, and axons, which send information from the cell body to other neurons.
Information transmission from a transmitting neuron to a receiving neuron is roughly composed of three stages.
First, the transmitting neuron generates a spatially- and temporally-confined electrical burst, or spike, that travels
along the neuron’s axon (and axonal branches) from the cell body to the terminal ends of the axon. An axon terminal
of the transmitting neuron is “connected” to a dendrite of a receiving neuron by a synapse. The spike causes the
transmitting neuron’s synapse to release chemicals, or neurotransmitters, that travel the short distance between the
number of cellular events (most of which are ignored in this post) when neurotransmitter molecules bind to the
receptors. One of those events is the opening of cellular channels which initiate another electrical wave, this time
propagating through the receiving neuron’s dendrite toward its cell body (this may be in the form of a spike, but
typically the wave is more spatially diffuse than spike-based transmission along axons — think of water being
Thus, information from one neuron can be transmitted to another. When a neuron receives
multiple excitatory spikes from multiple transmitting neurons, that electrical energy is accumulated at the neuron’s
cell body, and if enough energy is accumulated in a short period of time, the neuron will generate outgoing spikes
There are three remaining aspects to discuss in order to understand the Modeling that takes us from biological
• Rate-coding
• Synaptic strength
Rate-coding
A neuron that receives only a small number of excitatory spikes will produce and send few spikes of its own, if
any. If that same neuron receives many excitatory spikes it will (typically) send many spikes of its own. Although
spikes in biological neurons have a distinctly temporal characteristic, the temporal resolution is “blurred” in deep
learning neurons. For a given unit of time, the spiking activity of the deep learning neuron is represented as a
number of spikes (an integer) or more typically, an average spiking rate (a floating-point number).
In this contrived example, three neurons in the visual system receive indirect input from one of three groups of
color-sensitive cones cells in the eye. Each neuron is therefore maximally responsive to a particular wavelength
of light, and spiking activity is reported as the average spike rate (normalized to [0,1]). Thus, the input wavelength
is “encoded” by the collective spike rates of the three neurons.
Note, however, that in biological neurons, information is encoded in the relative timing of spikes in individual or
multiple neurons, not just in the individual neuron spiking rates. Thus, this type of information coding and
transmission is absent in deep learning neurons. The impact of this will be discussed further below.
Synaptic strength
Not all spikes are equal. When a propagating spike reaches an axonal terminal, the amount of electrical energy that
ultimately arises in the dendrite of the receiving neuron depends on the strength of the intervening synapse. This
strength is reflective of a number of underlying physiological factors including the amount of neurotransmitter
available for release in the transmitting neuron and the number of neurotransmitter receptors on the receiving
neuron.
Regardless, in deep learning neurons, synaptic strength is represented by a single floating-point number, and is
Up until now, we have only considered excitatory neurotransmission. In that case, spikes received from a
transmitting neuron increase the likelihood that a receiving neuron will also spike. This is due to the particular
properties of the activated receptors on the receiving neuron. Although an oversimplification, one can group
neurotransmitters and their receptors into an excitatory class and an inhibitory class. When an inhibitory
neurotransmitter binds to a inhibitory receptor, the electrical energy at the dendrite in the receiving neuron is
reduced rather than increased. In general, neurons have receptors for both excitatory and inhibitory
neurotransmitters, but can release (transmit) only one class or the other. In the mammalian cortex, there are many
more excitatory neurons (which release the neurotransmitter glutamate with each spike) than inhibitory neurons
(which release the neurotransmitter GABA with each spike). Nonetheless, these inhibitory neurons are important
for increasing information selectivity in receiving neurons, gating neurons off and thus contributing to information
routing, and preventing epileptic activity (chaotic firing of many neurons in the network).
In deep learning networks, no distinction is made between excitatory and inhibitory neurons (those having only an
excitatory or inhibitory neurotransmitter, respectively). All neurons have output activity that is greater than zero,
and it is the synapses that model inhibition. The weights of the synapses are allowed to be negative, in which case
inputs from transmitting neurons cause the output of the receiving neuron to be reduced.
Definitions
1. Biological Neuron:
A cell in the nervous system that transmits information using electrical and chemical signals. It consists
of dendrites (input), a soma (cell body), and an axon (output).
2. Linear Perceptron:
An artificial neuron model that performs binary classification by computing a weighted sum of inputs
and applying a step activation function to produce either 0 or 1 as output.
3. Expressing Linear Perceptrons as Neurons:
Linear perceptrons can be modeled as artificial neurons with inputs, weights, bias, and an activation
function that mimics the input-output behavior of biological neurons.
4. Perceptron Learning Algorithm:
A method for training a perceptron by iteratively updating its weights based on errors in predictions,
ensuring the perceptron can classify linearly separable data.
5. Sigmoid Neurons:
Neurons that use the sigmoid activation function, mapping inputs to a range between 0 and 1, often
used in binary classification tasks.
6. Tanh Neurons:
Neurons that use the hyperbolic tangent (tanh) activation function, mapping inputs to a range between -
1 and 1, helping in faster optimization during training.
7. ReLU Neurons:
Neurons that use the Rectified Linear Unit (ReLU) activation function, which outputs the input directly
if it is positive, otherwise outputs zero, widely used in deep learning.
1. Explain the structure and components of a biological neuron. How do artificial neurons model them?
Biological Neuron: A biological neuron consists of three key components:
• Cell Body (Soma): The central part of the neuron that processes incoming signals and integrates
information.
• Axon: A long, slender projection that transmits electrical impulses to other neurons, muscles, or glands.
• Synapse: The junctions between neurons through which signals are passed from one neuron to another.
Artificial Neuron: An artificial neuron mimics the behavior of a biological neuron. It consists of:
• Weights: Analogous to synaptic weights, these control the strength of the input signals.
• Bias: A parameter that shifts the activation function, helping adjust the neuron's output.
• Activation Function: Similar to the soma's decision-making process, it determines the neuron's output
based on the weighted sum of inputs.
• Output: Represents the signal that is passed on to other neurons in the network.
• The perceptron sums the weighted inputs, adds a bias, and then applies an activation function (typically
a step function) to produce an output. If the output exceeds a certain threshold, it is classified as 1;
otherwise, it's classified as 0.
Limitations:
• The perceptron can only solve linearly separable problems. It cannot solve problems where the data
cannot be separated by a straight line (like the XOR problem).
• It is unable to handle more complex decision boundaries, which is why more advanced models, such as
multi-layer neural networks, are necessary.
4. Derive the perceptron learning algorithm and explain its weight update rule.
The perceptron learning algorithm is used to adjust the weights based on the error between the predicted output
and the actual label. It works by updating the weights and bias whenever there is a misclassification. If the
perceptron makes an incorrect prediction, the weights are adjusted according to the formula:
5. What are activation functions, and why are they essential in neural networks?
Activation functions are mathematical functions applied to the weighted sum of inputs to a neuron. They introduce
non-linearity into the network, allowing it to learn complex patterns. Without activation functions, a neural
network would simply behave like a linear regression model, regardless of how many layers it has.
Why essential?
• They allow deep networks to learn complex patterns and make more accurate predictions.
• Without non-linear activation functions, neural networks would be limited in their ability to model
complex relationships.
6. Compare and contrast sigmoid, tanh, and ReLU activation functions in terms of their mathematical
properties and practical use cases.
• Sigmoid:
o Output range: (0, 1)
o Smooth gradient and differentiable, making it useful in probabilistic models.
o Issues: Can cause vanishing gradients for large positive or negative inputs, which leads to slow
convergence in deep networks.
7. What is the "vanishing gradient problem," and how do activation functions like ReLU address it?
Vanishing Gradient Problem: The vanishing gradient problem occurs when gradients become very small during
backpropagation in deep networks. This happens especially with activation functions like sigmoid or tanh, which
squish large values into small ones, making the gradients approach zero as they are propagated back. This leads
to slower learning or the network failing to learn altogether.
How ReLU addresses it: ReLU activation does not suffer from vanishing gradients for positive values (as it
simply outputs the input for positive values). This makes it effective in preventing gradients from vanishing and
ensures faster convergence in deep networks.
8. Describe how a perceptron can be geometrically interpreted in terms of decision boundaries.
A perceptron can be geometrically interpreted as a linear classifier. In a 2D space, the perceptron draws a straight
line (or hyperplane in higher dimensions) that separates the two classes. This line is the decision boundary, where
one side corresponds to one class, and the other side corresponds to the other. The perceptron adjusts the weights
to move this boundary to correctly classify the data points.
10. Discuss scenarios where sigmoid or tanh activation functions are preferred over ReLU.
• ReLU is generally used in hidden layers, where the ability to learn sparse representations and avoid
vanishing gradients outweighs the preference for bounded outputs.
Feed forward neural networks are artificial neural networks in which nodes do not form loops.
This type of neural network is also known as a multi-layer neural network as all information is
only passed forward.
During data flow, input nodes receive data, which travel through hidden layers, and exit output
nodes. No links exist in the network that could get used to by sending information back from
the output node.
Feed forward neural networks serve as the basis for object detection in photos, as shown in the
Google Photos app.
This model multiplies inputs with weights as they enter the layer. Afterward, the weighted
input values get added together to get the sum. As long as the sum of the values rises above a
certain threshold, set at zero, the output value is usually 1, while if it falls below the threshold,
it is usually -1.
As a feed forward neural network model, the single-layer perceptron often gets used for
classification. Machine learning can also get integrated into single-layer perceptrons. Through
training, neural networks can adjust their weights based on a property called the delta rule,
which helps them compare their outputs with the intended values.
• Input layer:
The neurons of this layer receive input and pass it on to the other layers of the network. Feature
or attribute numbers in the dataset must match the number of neurons in the input layer.
• Output layer:
According to the type of model getting built, this layer represents the forecasted feature.
• Hidden layer:
Input and output layers get separated by hidden layers. Depending on the type of model, there
may be several hidden layers.
There are several neurons in hidden layers that transform the input before actually transferring
it to the next layer. This network gets constantly updated with weights in order to make it easier
to predict.
• Neuron weights:
Neurons get connected by a weight, which measures their strength or magnitude. Similar to
linear regression coefficients, input weights can also get compared.
• Neurons:
Artificial neurons get used in feed forward networks, which later get adapted from biological
neurons. A neural network consists of artificial neurons.
Neurons function in two ways: first, they create weighted input sums, and second, they activate
the sums to make them normal.
Activation functions can either be linear or nonlinear. Neurons have weights based on their
inputs. During the learning phase, the network studies these weights.
• Activation Function:
According to the activation function, the neurons determine whether to make a linear or
nonlinear decision. Since it passes through so many layers, it prevents the cascading effect
from increasing neuron outputs.
An activation function can be classified into three major categories: sigmoid, Tanh, and
Rectified Linear Unit (ReLu).
• Sigmoid:
• Tanh:
Only positive values are allowed to flow through this function. Negative values get mapped to
0.
Cost function
In a feed forward neural network, the cost function plays an important role. The categorized
data points are little affected by minor adjustments to weights and biases.
Thus, a smooth cost function can get used to determine a method of adjusting weights and
biases to improve performance.
Image source
Where,
b = biases
a = output vectors
x = input
Loss function
The loss function of a neural network gets used to determine if an adjustment needs to be made
in the learning process.
Neurons in the output layer are equal to the number of classes. Showing the differences between
predicted and actual probability distributions. Following is the cross-entropy loss for binary
classification.
Image source
In the gradient descent algorithm, the next point gets calculated by scaling the gradient at the
current position by a learning rate. Then subtracted from the current position by the achieved
value.
To decrease the function, it subtracts the value (to increase, it would add). As an example, here
is how to write this procedure:
The gradient gets adjusted by the parameter η, which also determines the step size. Performance
is significantly affected by the learning rate in machine learning.
Output units
In the output layer, output units are those units that provide the desired output or prediction,
thereby fulfilling the task that the neural network needs to complete.
There is a close relationship between the choice of output units and the cost function. Any unit
that can serve as a hidden unit can also serve as an output unit in a neural network.
• Machine learning can be boosted with feed forward neural networks' simplified
architecture.
• Multi-network in the feed forward networks operate independently, with a moderated
intermediary.
• Complex tasks need several neurons in the network.
• Neural networks can handle and process nonlinear data easily compared to perceptrons
and sigmoid neurons, which are otherwise complex.
• A neural network deals with the complicated problem of decision boundaries.
• Depending on the data, the neural network architecture can vary. For example,
convolutional neural networks (CNNs) perform exceptionally well in image processing,
whereas recurrent neural networks (RNNs) perform well in text and voice processing.
• Neural networks need graphics processing units (GPUs) to handle large datasets for
massive computational and hardware performance. Several GPUs get used widely in
the market, including Kaggle Notebooks and Google Collab Notebooks.
direction, from input to output, with no feedback loops. Deep feed-forward, commonly known
as a deep neural network, consists of multiple hidden layers between input and output layers,
enabling the network to learn complex hierarchical features and patterns, enhancing its ability
without feedback loops, making them suitable for tasks like pattern recognition and
classification. Feedback neural networks, on the other hand, incorporate feedback connections,
allowing output to affect subsequent processing. Recurrent Neural Networks (RNNs) are a
common type of feedback network, useful for sequential data tasks like language modeling,
I can provide you with an example of a simple feedforward neural network problem and
illustrate the backpropagation algorithm to update the weights. Let's assume a basic
feedforward network for a binary classification problem with two features in the input, one
hidden layer with two neurons, and one output neuron.
**Problem**:
Suppose you have a feedforward neural network with the following architecture:
The network is trained to perform binary classification. Given a set of training data, we'll
calculate the weights' updates for one training example using the backpropagation algorithm.
**Training Data**:
Let's consider one training example with the following values:
**Initial Weights**:
We'll start with some initial weights for the connections:
**Forward Pass**:
1. Calculate the weighted sum and apply the activation function for the hidden layer:
Hidden Neuron 1:
z1 = (0.6 * 0.1) + (0.9 * (-0.2)) = 0.06 - 0.18 = -0.12
a1 = sigmoid(z1)
Hidden Neuron 2:
z2 = (0.6 * 0.3) + (0.9 * 0.4) = 0.18 + 0.36 = 0.54
a2 = sigmoid(z2)
2. Calculate the weighted sum and apply the activation function for the output layer:
Output Neuron:
z3 = (a1 * (-0.5)) + (a2 * 0.6) = (-0.12 * (-0.5)) + (0.54 * 0.6) = 0.06 + 0.324 = 0.384
a3 = sigmoid(z3)
**Backpropagation**:
4. Update the weights between the hidden and output layer using the backpropagation formula:
Δw5 = learning_rate * delta_output * a1
Δw6 = learning_rate * delta_output * a2
6. Calculate the derivative of the sigmoid activation function for the hidden layer:
sigmoid_derivative_hidden1 = a1 * (1 - a1)
sigmoid_derivative_hidden2 = a2 * (1 - a2)
You can repeat these steps for each training example in your dataset and update the weights
iteratively. This is the basic idea of backpropagation in a feedforward neural network. The
learning rate is a hyperparameter that controls the size of weight updates and should be tuned
during training.