0% found this document useful (0 votes)
7 views

Unit-1 Notes Complete

Uploaded by

Poranki Anusha
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Unit-1 Notes Complete

Uploaded by

Poranki Anusha
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 75

Deep learning is a branch of machine learning which is based on artificial neural networks.

It is
capable of learning complex patterns and relationships within data. In deep learning, we don’t
need to explicitly program everything. It has become increasingly popular in recent years due to
the advances in processing power and the availability of large datasets. Because it is based on
artificial neural networks (ANNs) also known as deep neural networks (DNNs). These neural
networks are inspired by the structure and function of the human brain’s biological neurons, and
they are designed to learn from large amounts of data.

1. Deep Learning is a subfield of Machine Learning that involves the use of neural

networks to model and solve complex problems. Neural networks are modeled

after the structure and function of the human brain and consist of layers of

interconnected nodes that process and transform data.

2. The key characteristic of Deep Learning is the use of deep neural networks, which have
multiple layers of interconnected nodes. These networks can learn complex representations of
data by discovering hierarchical patterns and features in the data. Deep Learning algorithms can
automatically learn and improve from data without the need for manual feature engineering.

3. Deep Learning has achieved significant success in various fields, including image recognition,
natural language processing, speech recognition, and recommendation systems. Some of the
popular Deep Learning architectures include Convolutional Neural Networks (CNNs), Recurrent
Neural Networks (RNNs), and Deep Belief Networks (DBNs).

4. Training deep neural networks typically requires a large amount of data and computational
resources. However, the availability of cloud computing and the development of specialized
hardware, such as Graphics Processing Units (GPUs), has made it easier to train deep neural
networks.

In summary, Deep Learning is a subfield of Machine Learning that involves the use of deep
neural networks to model and solve complex problems. Deep Learning

has achieved significant success in various fields, and its use is expected to continue to grow as
more data becomes available, and more powerful computing resources become available
Difference between Machine Learning and Deep Learning:

Machine learning and deep learning both are subsets of artificial intelligence but there are many
similarities and differences between them
Types of neural networks:

Deep Learning models are able to automatically learn features from the data, which makes them
well-suited for tasks such as image recognition, speech recognition, and natural language
processing. The most widely used architectures in deep learning are eedforward neural networks,
convolutional neural networks (CNNs), and recurrent neural networks (RNNs).

Feedforward neural networks (FNNs) are the simplest type of ANN, with a linear flow of
information through the network. FNNs have been widely used for tasks such as image
classification, speech recognition, and natural language processing.

Convolutional Neural Networks (CNNs) are specifically for image and video recognition tasks.
CNNs are able to automatically learn features from the images, which makes them well-suited
for tasks such as image classification, object detection, and image segmentation.

Recurrent Neural Networks (RNNs) are a type of neural network that is able to process
sequential data, such as time series and natural language. RNNs are able to maintain an internal
state that captures information about the previous inputs, which makes them well-suited for tasks
such as speech recognition, natural language processing, and language translation.

Applications of Deep Learning:

The main applications of deep learning can be divided into computer vision, natural language
processing (NLP), and reinforcement learning.

The Neural Network

Building Intelligent Machines:

The brain is the most incredible organ in the human body. It dictates the way we perceive every
sight, sound, smell, taste, and touch. It enables us to store memories, experience emotions, and
even dream. Without it, we would be primitive organisms, incapable of anything other than the
simplest of reflexes. The brain is, inherently, what makes us intelligent.

The infant brain only weighs a single pound, but somehow it solves problems that even our
biggest, most powerful supercomputers find impossible. Within a matter of months after birth,
infants can recognize the faces of their parents, discern discrete objects from their backgrounds,
and even tell apart voices. Within a year, they’ve already developed an intuition for natural
physics, can track objects even when they become partially or completely blocked, and can
associate sounds with specific meanings. And by early childhood, they have a sophisticated
understanding of grammar and thousands of words in their vocabularies.

The Biological Neuron


The human brain consists of a large number, more than a billion of neural cells that
processinformation. Each cell works like a simple processor. The massive interaction between all
cells and their parallel processing only makes the brain’s abilities possible. Figure 1 represents a
human biological nervous unit. Various parts of biological neural network(BNN) is marked in
Figure 1.

Figure 1: Biological Neural Network

 Dendrites are branching fibres that extend from the cell body or soma.
 Soma or cell body of a neuron contains the nucleus and other structures, support chemical
processing and production of neurotransmitters.
 Axon is a singular fiber carries information away from the soma to the synaptic sites of
other neurons (dendrites ans somas), muscels, or glands.Axon hillock is the site of
summation for incoming information.

At any moment, the collective influence of all neurons that conduct impulses to a given
neuron will determine whether or n ot an action potential will be initiated at the axon
hillock and propagated along the axon.
 Myelin sheath consists of fat-containing cells that insulate the axon from electrical
activity.This insulation acts to increase the rate of transmission of signals. A gap exists
between each myelin sheath cell along the axon. Since fat inhibits the propagation of
electricity, the signals jump from one gap to the next.
 Nodes of Ranvier are the gaps (about 1 μm) between myelin sheath cells. Since fat serves
as a good insulator, the myelin sheaths speed the rate of transmission of an electrical
impulse along the axon.
 Synapse is the point of connection between two neurons or a neuron and a muscle or a
gland.Electrochemical communication between neurons take place at these junctions.
 Terminal buttons of a neuron are the small knobs at the end of an axon that release
chemicals called neurotransmitters.

For decades, we’ve dreamed of building intelligent machines with brains like ours—robotic
assistants to clean our homes, cars that drive themselves, microscopes that automatically detect
diseases. But building these artificially intelligent machines requires us to solve some of the most
complex computational problems we have ever grappled with; problems that our brains can
already solve in a manner of microseconds. To tackle these problems, we’ll have to develop a
radically different way of programming a computer using techniques largely developed over the
past decade. This is an extremely active field of artificial computer intelligence often referred to
as deep learning.

The limits of Traditional Computer Programs :

Why exactly are certain problems so difficult for computers to solve? Well, it turns out that
traditional computer programs are designed to be very good at two things: 1) performing
arithmetic really fast and 2) explicitly following a list of instructions. So if you want to do some
heavy financial number crunching, you’re in luck. Traditional computer programs can do the
trick. But let’s say we want to do something slightly more interesting, like write a program to
automatically read someone’s handwriting. Figure 1-1 will serve as a starting point.
Figure 1-1. Image from MNIST handwritten digit dataset2

Although every digit in Figure 1-1 is written in a slightly different way, we can easily recognize
every digit in the first row as a zero, every digit in the second row as a one, etc. Let’s try to write
a computer program to crack this task. What rules could we use to tell one digit from another?

Well, we can start simple! For example, we might state that we have a zero if our image only has
a single, closed loop. All the examples in Figure 1-1 seem to fit this bill, but this isn’t really a
sufficient condition. What if someone doesn’t perfectly close the loop on their zero? And, as
in Figure 1-2, how do you distinguish a messy zero from a six?

Figure 1-2. A zero that’s algorithmically difficult to distinguish from a six

You could potentially establish some sort of cutoff for the distance between the starting point of
the loop and the ending point, but it’s not exactly clear where we should be drawing the line. But
this dilemma is only the beginning of our worries. How do we distinguish between threes and
fives? Or between fours and nines? We can add more and more rules, or features, through careful
observation and months of trial and error, but it’s quite clear that this isn’t going to be an easy
process.

Many other classes of problems fall into this same category: object recognition, speech
comprehension, automated translation, etc. We don’t know what program to write because we
don’t know how it’s done by our brains. And even if we did know how to do it, the program
might be horrendously complicated.

The Mechanics of Machine Learning :

Deep learning is a subset of a more general field of artificial intelligence called machine
learning, which is predicated on this idea of learning from example. In machine learning, instead
of teaching a computer a massive list of rules to solve the problem, we give it a model with
which it can evaluate examples, and a small set of instructions to modify the model when it
makes a mistake. We expect that, over time, a well-suited model would be able to solve the
problem extremely accurately.

mathematically. Let’s define our model to be a function h ( 𝐱 , θ ) . The input x is an example


Let’s be a little bit more rigorous about what this means so we can formulate this idea

expressed in vector form. For example, if x were a grayscale image, the vector’s components
would be pixel intensities at each position, as shown in Figure 1-3.

Figure 1-3. The process of vectorizing an image for a machine learning algorithm

The input θ is a vector of the parameters that our model uses. Our machine learning program
tries to perfect the values of these parameters as it is exposed to more and more examples. We’ll
see this in action and in more detail in Chapter 2.

To develop a more intuitive understanding for machine learning models, let’s walk through a
quick example. Let’s say we wanted to determine how to predict exam performance based on the
a lot of data, and for each data point 𝐱 = x 1 x 2 T , we record the number of hours of sleep we
number of hours of sleep we get and the number of hours we study the previous day. We collect

below the class average. Our goal, then, might be to learn a model h ( 𝐱 , θ ) with parameter
got ( x 1 ), the number of hours we spent studying ( x 2 ), and whether we performed above or

vector θ = θ 0 θ 1 θ 2 T such that:

h ( 𝐱 , θ ) = - 1 if 𝐱 T · θ 1 θ 2 + θ 0 < 0 1 if 𝐱 T · θ 1 θ 2 + θ 0 ≥ 0

In other words, we guess that the blueprint for our model h ( 𝐱 , θ ) is as described
above (geometrically, this particular blueprint describes a linear classifier that divides the
coordinate plane into two halves). Then, we want to learn a parameter vector θ such that our
model makes the right predictions (−1 if we perform below average, and 1 otherwise) given an
input example x. This model is called a linear perceptron, and it’s a model that’s been used since
the 1950s.3 Let’s assume our data is as shown in Figure 1-4.

Figure 1-4. Sample data for our exam predictor algorithm and a potential classifier

Then it turns out, by selecting θ = -2434 T , our machine learning model makes the correct
prediction on every data point:

h ( 𝐱 , θ ) = - 1 if 3 x 1 + 4 x 2 - 24 < 0 1 if 3 x 1 + 4 x 2 - 24 ≥ 0

An optimal parameter vector θ positions the classifier so that we make as many correct
predictions as possible. In most cases, there are many (or even infinitely many) possible choices
for θ that are optimal. Fortunately for us, most of the time these alternatives are so close to one
another that the difference is negligible. If this is not the case, we may want to collect more data
to narrow our choice of θ .

While the setup seems reasonable, there are still some pretty significant questions that remain.
First off, how do we even come up with an optimal value for the parameter vector θ in the first
place? Solving this problem requires a technique commonly known as optimization. An
optimizer aims to maximize the performance of a machine learning model by iteratively
tweaking its parameters until the error is minimized. We’ll begin to tackle this question of
learning parameter vectors in more detail in Chapter 2, when we describe the process of gradient
descent.4 In later chapters, we’ll try to find ways to make this process even more efficient.

Second, it’s quite clear that this particular model (the linear perceptron model) is quite limited in
the relationships it can learn. For example, the distributions of data shown in Figure 1-5 cannot
be described well by a linear perceptron.

Figure 1-5. As our data takes on more complex forms, we need more complex models to
describe them

But these situations are only the tip of the iceberg. As we move on to much more complex
problems, such as object recognition and text analysis, our data becomes extremely high
dimensional, and the relationships we want to capture become highly nonlinear. To
accommodate this complexity, recent research in machine learning has attempted to build models
that resemble the structures utilized by our brains. It’s essentially this body of research,
commonly referred to as deep learning, that has had spectacular success in tackling problems in
computer vision and natural language processing. These algorithms not only far surpass other
kinds of machine learning algorithms, but also rival (or even exceed!) the accuracies achieved by
humans.

In “The Mechanics of Machine Learning”, we talked about using machine learning models to
capture the relationship between success on exams and time spent studying and sleeping. To
tackle this problem, we constructed a linear perceptron classifier that divided the Cartesian
coordinate plane into two halves:

h ( 𝐱 , θ ) = - 1 if 3 x 1 + 4 x 2 - 24 < 0 1 if 3 x 1 + 4 x 2 - 24 ≥ 0
As shown in Figure 1-4, this is an optimal choice for θ because it correctly classifies every
sample in our dataset. Here, we show that our model h is easily using a neuron. Consider the
neuron depicted in Figure 1-8. The neuron has two inputs, a bias, and uses the function:

f ( z ) = - 1 if z < 0 1 if z ≥ 0

It’s very easy to show that our linear perceptron and the neuronal model are perfectly equivalent.
And in general, it’s quite simple to show that singular neurons are strictly more expressive than
linear perceptrons. In other words, every linear perceptron can be expressed as a single neuron,
but single neurons can also express models that cannot be expressed by any linear perceptron.

Figure . Expressing our exam performance perceptron as a neuron

Neuron :

What is a Neuron in Biology?

Neurons in deep learning were inspired by neurons in the human brain. Here is a diagram of the
anatomy of a brain neuron:
Groups of neurons work together inside the human brain to perform the functionality that we
require in our day-to-day lives. Neuron by itself is useless. Instead, you require networks of
neurons to generate any meaningful functionality.

This is because neurons function by receiving and sending signals. More specifically, the
neuron’s dendrites receive signals and pass along those signals through the axon.

The dendrites of one neuron are connected to the axon of another neuron. These connections are
called synapses, which is a concept that has been generalized to the field of deep learning.

What is a Neuron in Deep Learning?

Neurons in deep learning models are nodes through which data and computations flow.

Neurons work like this:

 They receive one or more input signals. These input signals can come from either the raw
data set or from neurons positioned at a previous layer of the neural net.

 They perform some calculations.

 They send some output signals to neurons deeper in the neural net through a synapse.

Here is a diagram of the functionality of a neuron in a deep learning neural net:


Neurons in a deep learning model are capable of having synapses that connect to more than one
neuron in the preceding layer. Each synapse has an associated weight, which impacts the
preceding neuron’s importance in the overall neural network.

Weights are a very important topic in the field of deep learning because adjusting a model’s
weights is the primary way through which deep learning models are trained.

Once a neuron receives its inputs from the neurons in the preceding layer of the model, it adds up
each signal multiplied by its corresponding weight and passes them on to an activation function,
like this:

The activation function calculates the output value for the neuron. This output value is then
passed on to the next layer of the neural network through another synapse.

Neuron consists of three basic components –weights, thresholds and a single activation function.
An Artificial neural network(ANN) model based on the biological neural sytems is shown in
figure .

Expressing Linear Perception as Neurons :

Perceptron Model :

Simple Perceptron for Pattern Classification:


Perceptron network is capable of performing pattern classification into two or more categories.
The perceptron is trained using the perceptron learning rule. We will first consider classification
into two categories and then the general multiclass classification later. For classification into
only two categories, all we need is a single output neuron. Here we will use bipolar neurons. The
simplest architecture that could do the job consists of a layer of N input neurons, an output layer
with a single output neuron, and no hidden layers. This is the same architecture as we saw before
for Hebbian learning. However, we will use a different transfer function here for the output
neurons as given below in equation. Figure represents a single layer perceptron network.

Perceptron Algorithm:
Single Layer Perceptron :

The single-layer perceptron was the first neural network model, proposed in 1958 by Frank
Rosenbluth. It is one of the earliest models for learning. Our goal is to find a linear decision
function measured by the weight vector w and the bias parameter b.
To understand the perceptron layer, it is necessary to comprehend artificial neural networks
(ANNs). The artificial neural network (ANN) is an information processing system, whose
mechanism is inspired by the functionality of biological neural circuits. An artificial neural
network consists of several processing units that are interconnected.
This is the first proposal when the neural model is built. The content of the neuron's local
memory contains a vector of weight. The single vector perceptron is calculated by calculating the
sum of the input vector multiplied by the corresponding element of the vector, with each
increasing the amount of the corresponding component of the vector by weight. The value that is
displayed in the output is the input of an activation function.

Let us focus on the implementation of a single-layer perceptron for an image classification


problem using TensorFlow. The best example of drawing a single-layer perceptron is through the
representation of "logistic regression." Now, we have to do the following necessary steps of
training logistic regression-o The weights are initialized with the random values at the
origination of each training.

 For each element of the training set, the error is calculated with the difference

between the desired output and the actual output. The calculated error is

used to adjust the weight.

 The process is repeated until the fault made on the entire training set is less

than the specified limit until the maximum number of iterations has been

reached.

Multi-layer Perceptron:

Multi-layer perception is also known as MLP. It is fully connected dense

layers, which transform any input dimension to the desired dimension. A multi-layer

perception is a neural network that has multiple layers. To create a neural network,

we combine neurons together so that the outputs of some neurons are inputs of

other neurons.
A multi-layer perceptron has one input layer and for each input, there is one

neuron (or node), it has one output layer with a single node for each output and it

can have any number of hidden layers and each hidden layer can have any number

of nodes. A schematic diagram of a Multi-Layer Perceptron (MLP) is depicted

below.

In the multi-layer perceptron diagram above, we can see that there are three inputs

and thus three input nodes and the hidden layer has three nodes. The output layer gives two

outputs, therefore there are two output nodes. The nodes in the input layer take input and

forward it for further process, in the diagram above the nodes in the input layer forwards

their output to each of the three nodes in the hidden layer, and in the same way, the hidden

layer processes the information and passes it to the output layer.

Every node in the multi-layer perception uses a sigmoid activation function. The

sigmoid activation function takes real values as input and converts them to numbers

between 0 and 1 using the sigmoid formula

Expressing Linear Perception as Neurons:

Neural Network, Non-linear classification example using Neural Networks: XOR/XNOR:

XOR problem with neural networks:


Among various logical gates, the XOR or also known as the “exclusive or” problem is one of the
logical operations when performed on binary inputs that yield output for different combinations
of input, and for the same combination of input no output is produced. The outputs generated by
the XOR logic are not linearly separable in the hyperplane. So, in this article let us see what is
the XOR logic and how to integrate the XOR logic using neural networks.

From the below truth table, it can be inferred that XOR produces an output for different states of
inputs and for the same inputs the XOR logic does not produce any output. The Output of XOR
logic is yielded by the equation as shown below
Implementation of Perceptron Algorithm for AND Logic Gate with 2-bit Binary Input :

Perceptron algorithm states that:

Prediction (y`) = 1 if Wx+b > 0 and 0 if Wx+b ≤ 0

Also, the steps in this method are very similar to how Neural Networks learn, which is as
follows;

 Initialize weight values and bias


 Forward Propagate
 Check the error
 Backpropagate and Adjust weights and bias
 Repeat for all training examples

AND Gate

From our knowledge of logic gates, we know that an AND logic table is given by the diagram
below
AND Gate

The question is, what are the weights and bias for the AND perceptron?

First, we need to understand that the output of an AND gate is 1 only if both inputs (in this case,
x1 and x2) are 1. So, following the steps listed above;

Row 1

From w1*x1+w2*x2+b, initializing w1, w2, as 1 and b as –1, we get;

x1(1)+x2(1)–1

Passing the first row of the AND logic table (x1=0, x2=0), we get;

0+0–1 = –1

From the Perceptron rule, if Wx+b≤0, then y`=0. Therefore, this row is correct, and no need for
Backpropagation.

Row 2

Passing (x1=0 and x2=1), we get;

0+1–1 = 0

From the Perceptron rule, if Wx+b≤0, then y`=0. This row is correct, as the output is 0 for the
AND gate.

From the Perceptron rule, this works (for both row 1, row 2 and 3).
Row 4

Passing (x1=1 and x2=1), we get;

1+1–1 = 1

Again, from the perceptron rule, this is still valid.

Therefore, we can conclude that the model to achieve an AND gate, using the Perceptron
algorithm is;

x1+x2–1

OR Gate

OR Gate
From the diagram, the OR gate is 0 only if both inputs are 0.

Row 1

 From w1x1+w2x2+b, initializing w1, w2, as 1 and b as –1, we get;

x1(1)+x2(1)–1

 Passing the first row of the OR logic table (x1=0, x2=0), we get;

0+0–1 = –1

 From the Perceptron rule, if Wx+b≤0, then y`=0. Therefore, this row is correct.

Row 2

 Passing (x1=0 and x2=1), we get;

0+1–1 = 0

 From the Perceptron rule, if Wx+b <= 0, then y`=0. Therefore, this row is incorrect.
 So we want values that will make inputs x1=0 and x2=1 give y` a value of 1. If we
change w2 to 2, we have;

0+2–1 = 1

 From the Perceptron rule, this is correct for both the row 1 and 2.

Row 3

 Passing (x1=1 and x2=0), we get;

1+0–1 = 0

 From the Perceptron rule, if Wx+b <= 0, then y`=0. Therefore, this row is incorrect.
 Since it is similar to that of row 2, we can just change w1 to 2, we have;

2+0–1 = 1

 From the Perceptron rule, this is correct for both the row 1, 2 and 3.

Row 4
 Passing (x1=1 and x2=1), we get;

2+2–1 = 3

 Again, from the perceptron rule, this is still valid. Quite Easy!

Therefore, we can conclude that the model to achieve an OR gate, using the Perceptron
algorithm is;

2x1+2x2–1

NOT Gate

NOT Gate

From the diagram, the output of a NOT gate is the inverse of a single input. So, following the
steps listed above;

Row 1

 From w1x1+b, initializing w1 as 1 (since single input), and b as –1, we get;


x1(1)–1

 Passing the first row of the NOT logic table (x1=0), we get;

0–1 = –1

 From the Perceptron rule, if Wx+b≤0, then y`=0. This row is incorrect, as the output is 1
for the NOT gate.
 So we want values that will make input x1=0 to give y` a value of 1. If we change b to 1,
we have;

0+1 = 1

 From the Perceptron rule, this works.

Row 2

 Passing (x1=1), we get;

1+1 = 2

 From the Perceptron rule, if Wx+b > 0, then y`=1. This row is so incorrect, as the output
is 0 for the NOT gate.
 So we want values that will make input x1=1 to give y` a value of 0. If we change w1 to –
1, we have;

–1+1 = 0

 From the Perceptron rule, if Wx+b ≤ 0, then y`=0. Therefore, this works (for both row 1
and row 2).

Therefore, we can conclude that the model to achieve a NOT gate, using the Perceptron
algorithm is;

–x1+1
NOR Gate

NOR Gate

From the diagram, the NOR gate is 1 only if both inputs are 0.

Row 1

 From w1x1+w2x2+b, initializing w1 and w2 as 1, and b as –1, we get;

x1(1)+x2(1)–1

 Passing the first row of the NOR logic table (x1=0, x2=0), we get;

0+0–1 = –1

 From the Perceptron rule, if Wx+b≤0, then y`=0. This row is incorrect, as the output is 1
for the NOR gate.
 So we want values that will make input x1=0 and x2 = 0 to give y` a value of 1. If we
change b to 1, we have;

0+0+1 = 1

 From the Perceptron rule, this works.


Row 2

 Passing (x1=0, x2=1), we get;

0+1+1 = 2

 From the Perceptron rule, if Wx+b > 0, then y`=1. This row is incorrect, as the output is 0
for the NOR gate.
 So we want values that will make input x1=0 and x2 = 1 to give y` a value of 0. If we
change w2 to –1, we have;

0–1+1 = 0

 From the Perceptron rule, this is valid for both row 1 and row 2.

Row 3

 Passing (x1=1, x2=0), we get;

1+0+1 = 2

 From the Perceptron rule, if Wx+b > 0, then y`=1. This row is incorrect, as the output is 0
for the NOR gate.
 So we want values that will make input x1=0 and x2 = 1 to give y` a value of 0. If we
change w1 to –1, we have;

–1+0+1 = 0

 From the Perceptron rule, this is valid for both row 1, 2 and 3.

Row 4

 Passing (x1=1, x2=1), we get;

-1-1+1 = -1

 From the Perceptron rule, this still works.

Therefore, we can conclude that the model to achieve a NOR gate, using the Perceptron
algorithm is;

-x1-x2+1
NAND Gate

From the diagram, the NAND gate is 0 only if both inputs are 1.

Row 1

 From w1x1+w2x2+b, initializing w1 and w2 as 1, and b as -1, we get;

x1(1)+x2(1)-1

 Passing the first row of the NAND logic table (x1=0, x2=0), we get;

0+0-1 = -1

 From the Perceptron rule, if Wx+b≤0, then y`=0. This row is incorrect, as the output is 1
for the NAND gate.
 So we want values that will make input x1=0 and x2 = 0 to give y` a value of 1. If we
change b to 1, we have;

0+0+1 = 1

 From the Perceptron rule, this works.

Row 2

 Passing (x1=0, x2=1), we get;

0+1+1 = 2

 From the Perceptron rule, if Wx+b > 0, then y`=1. This row is also correct (for both row
2 and row 3).

Row 4

 Passing (x1=1, x2=1), we get;

1+1+1 = 3

 This is not the expected output, as the output is 0 for a NAND combination of x1=1 and
x2=1.
 Changing values of w1 and w2 to -1, and value of b to 2, we get;

-1-1+2 = 0

 It works for all rows.

Therefore, we can conclude that the model to achieve a NAND gate, using the Perceptron
algorithm is;

-x1-x2+2
Feed-Forward Neural Networks

Feed Forward Network:

Why are neural networks used?

Neuronal networks can theoretically estimate any function, regardless of its complexity.
Supervised learning is a method of determining the correct Y for a fresh X by learning a function
that translates a given X into a specified Y. But what are the differences between neural
networks and other methods of machine learning? The answer is based on the Inductive Bias
phenomenon, a psychological phenomenon.

Machine learning models are built on assumptions such as the one where X and Y are related. An
Inductive Bias of linear regression is the linear relationship between X and Y. In this way, a line
or hyperplane gets fitted to the data.

When X and Y have a complex relationship, it can get difficult for a Linear Regression method
to predict Y. For this situation, the curve must be multi-dimensional or approximate to the
relationship.

A manual adjustment is needed sometimes based on the complexity of the function and the
number of layers within the network. In most cases, trial and error methods combined with
experience get used to accomplishing this. Hence, this is the reason these parameters are called
hyperparameters.

What is a feed forward neural network?

Feed forward neural networks are artificial neural networks in which nodes do not form loops.
This type of neural network is also known as a multi-layer neural network as all information is
only passed forward.

During data flow, input nodes receive data, which travel through hidden layers, and exit output
nodes. No links exist in the network that could get used to by sending information back from the
output node.

A feed forward neural network approximates functions in the following way:


 An algorithm calculates classifiers by using the formula y = f* (x).
 Input x is therefore assigned to category y.
 According to the feed forward model, y = f (x; θ). This value determines the closest
approximation of the function.

Feed forward neural networks serve as the basis for object detection in photos, as shown in the
Google Photos app

working principle of a feed forward neural network:

Its most basic form, a Feed-Forward Neural Network is a single layer perceptron. A sequence of
inputs enter the layer and are multiplied by the weights in this model. The weighted input values
are then summed together to form a total. If the sum of the values is more than a predetermined
threshold, which is normally set at zero, the output value is usually 1, and if the sum is less than
the threshold, the output value is usually -1.

The single-layer perceptron is a popular feed-forward neural network model that is frequently
used for classification. Single-layer perceptrons can also contain machine learning features.

The neural network can compare the outputs of its nodes with the desired values using a property
known as the delta rule, allowing the network to alter its weights through training to create more
accurate output values. This training and learning procedure results in gradient descent. The
technique of updating weights in multi-layered perceptrons is virtually the same, however, the
process is referred to as back-propagation.

Feed forward neural networks are artificial neural networks in which nodes do not form
loops.This type of neural network is also known as a multi-layer neural network as all
information is only passed forward. During data flow, input nodes receive data, which travel
through hidden layers, and exit output nodes. No links exist in the network that could get used to
by sending information back from the output node.

When the feed forward neural network gets simplified, it can appear as a single layer

perceptron.This model multiplies inputs with weights as they enter the layer. Afterward, the

weighted input values get added together to get the sum. As long as the sum of the values

rises above a certain threshold, set at zero, the output value is usually 1, while if it falls below

the threshold, it is usually -1.

As a feed forward neural network model, the single-layer perceptron often gets used
for classification. Machine learning can also get integrated into single-layer perceptrons.

Through training, neural networks can adjust their weights based on a property called the

delta rule, which helps them compare their outputs with the intended values.

As a result of training and learning, gradient descent occurs. Similarly, multi-layered

perceptrons update their weights. But, this process gets known as back-propagation. If this is

the case, the network's hidden layers will get adjusted according to the output values

produced by the final layer.

Layers of feed forward neural network:


Linear Neurons and their Limitations

Linear neurons are artificial neurons where the output is a linear function of the input.
Mathematically, this can be represented as:
In linear neurons:

 No activation function or a simple linear activation function f(x)=xf(x) = xf(x)=x is


applied.

Applications of Linear Neurons

Linear neurons are often used in simple models such as:

 Linear regression: Predicts a continuous output.


 Perceptron: A simple linear classifier (but typically with a step activation function for
classification).

Limitations of Linear Neurons

1. Inability to Model Nonlinear Relationships:


o Linear neurons cannot model complex, nonlinear relationships between inputs and
outputs.
o For example, they fail in tasks like XOR classification where the decision
boundary is nonlinear.

2. Stacking Multiple Layers Still Yields a Linear Model:


o If multiple layers of linear neurons are stacked, the overall function remains
linear. Mathematically: f(f(x))=a(w1x+b1)+b2f(f(x)) = a(w_1x + b_1) +
b_2f(f(x))=a(w1x+b1)+b2 can be reduced to a linear form. This limits the
representational power of deep networks.

3. Limited Usefulness in Complex Neural Networks:


o Neural networks gain power from their ability to model complex, nonlinear
interactions, which requires non-linear activation functions (e.g., ReLU, sigmoid,
tanh).

4. Poor Performance in Complex Tasks:


o Tasks such as image classification, natural language processing, and most real-
world problems require non-linear relationships to be learned.

Overcoming These Limitations

To overcome the limitations of linear neurons:

 Use Nonlinear Activation Functions:


o Apply functions like ReLU, sigmoid, or tanh to introduce nonlinearity.
 Combine Linear Models with Nonlinear Preprocessing:
o Use kernel methods (as in Support Vector Machines) or transform data into
higher-dimensional spaces.

Activation Functions

An activation function in the context of neural networks is a mathematical function applied to the
output of a neuron. The purpose of an activation function is to introduce non-linearity into the
model, allowing the network to learn and represent complex patterns in the data. Without non-
linearity, a neural network would essentially behave like a linear regression model, regardless of
the number of layers it has.

The activation function decides whether a neuron should be activated or not by calculating the
weighted sum and further adding bias to it. The purpose of the activation function is to introduce
non-linearity into the output of a neuron.

There are several commonly used activation functions such as the ReLU, Softmax, tanH and the
Sigmoid functions. Each of these functions have a specific usage. For a binary classification
CNN model, sigmoid and softmax functions are preferred a for a multi-class classification,
generally softmax us used. In simple terms,activation functions in a CNN model determine
whether a neuron should be activated or not. It decides whether the input to the work is important
or not to predict using mathematical operations.
 It is a function which is plotted as ‘S’ shaped graph.

 Equation : A = 1/(1 + e-x)

 Nature : Non-linear. Notice that X values lies between -2 to 2, Y values are very steep.
This means, small changes in x would also bring about large changes in the value of Y.

 Value Range : 0 to 1

 Uses : Usually used in output layer of a binary classification, where result is either 0 or 1,
as value for sigmoid function lies between 0 and 1 only so, result can be predicted easily
to be 1 if value is greater than 0.5 and 0 otherwise.
 The activation that works almost always better than sigmoid function is Tanh function
also known as Tangent Hyperbolic function. It’s actually mathematically shifted
version of the sigmoid function. Both are similar and can be derived from each other.

Equation :-

 Value Range :- -1 to +1

 Nature :- non-linear

 Uses :- Usually used in hidden layers of a neural network as it’s values lies between -1 to
1 hence the mean for the hidden layer comes out be 0 or very close to it, hence helps in
centering the data by bringing mean close to 0. This makes learning for the next layer
much easier.
 It Stands for Rectified linear unit. It is the most widely used activation function. Chiefly
implemented in hidden layers of Neural network.

 Equation :- A(x) = max(0,x). It gives an output x if x is positive and 0 otherwise.

 Value Range :- [0, inf)


 Nature :- non-linear, which means we can easily backpropagate the errors and have
multiple layers of neurons being activated by the ReLU function.

 Uses :- ReLu is less computationally expensive than tanh and sigmoid because it involves
simpler mathematical operations. At a time only a few neurons are activated making the
network sparse making it efficient and easy for computation.

In simple words, RELU learns much faster than sigmoid and Tanh function.

Softmax output layers

The activation function is an integral part of a neural network. A neural network


is a simple linear regression model without an activation function. This means
the activation function gives non-linearity to the neural network gradient
parameter.

What is SoftMax Activation Function?

The SoftMax activation function is commonly used in machine learning,


particularly in neural networks for classification tasks. An activation function
converts a vector of raw prediction scores (logits) into probabilities.

Key Characteristics of the SoftMax Function:

1. Normalization: The SoftMax activation function normalizes the input


values into a probability distribution, ensuring that the sum of all output
values is 1. This makes it suitable for classification problems where the
output needs to represent probabilities over multiple classes.
2. Exponentiation: By exponentiating the inputs, the SoftMax function in
machine learning amplifies the differences between the input values, making
the largest value more pronounced in the output probabilities.
3. Differentiability: The SoftMax function is differentiable and essential for
backpropagation in neural networks.

Example

Suppose we have the following dataset: For every observation, we have five
features from FeatureX1 to FeatureX5, and the target variable has three classes.
Now, let’s create a simple neural network to solve this problem. Here, we have
an Input layer with five neurons, as we have five features in the dataset. Next,
we have one hidden layer with four neurons. Each of these neurons uses inputs,
weights, and biases to calculate a value, which is represented as Zij here.

For example, the first neuron of the first layer is represented as Z11 Similarly
the second neuron of the first layer is represented as Z12, and so on.

We apply the activation function, let’s say a tanh activation function, to these
values and send the values or result to the output layer.

The number of neurons in the output layer depends on the number of classes in
the dataset. Since we have three classes in the dataset, we will have three
neurons in the output layer. Each of these neurons will give the probability of
individual classes. This means the first neuron will give you the probability that
the data point belongs to class 1. Similarly, the second neuron will give you the
probability that the data point belongs to class 2

Hence,
 The softmax function converts the input value to a value between 0 and 1,
where the sum is 1.
 When someone says, “softmax the result of ~,” it should be understood as
“convert the result of ~ into an easy-to-understand probability.
 Softmax is a simple mechanism that (1) takes an exponent and (2) divides it
by the sum.
 The formula is also very simple if you understand the flow of the process.

Chapter-2

Training Feed-Forward Neural Networks

Training a feed-forward neural network (FFNN) involves optimizing its parameters


(weights and biases) to minimize the difference between predicted and actual outputs. Here's
a step-by-step overview: predicted and actual outputs. Here's a step-by-step overview:

1. Architecture of a Feed-Forward Neural Network

 Input Layer: Takes in the input features.


 Hidden Layers: Consist of neurons with activation functions to introduce non-linearity.
 Output Layer: Produces the final output, often with a specific activation function (e.g., softmax
for classification, linear for regression).

2. Forward Propagation

This step calculates the output of the network for a given input.

For each layer l:


Gradient Descent
Delta Rule

Delta Learning Rule: Developed by Widrow and Hoff, the delta rule, is one of the most
common learning rules. It depends on supervised learning. This rule states that the
modification in sympatric weight of a node is equal to the multiplication of error and the
input. In Mathematical form the delta rule is as follows:

The Delta Rule

The Delta Rule updates the weights to reduce the error between the predicted
output and the target output. It's derived from the gradient of the loss function
(usually Mean Squared Error).

Mathematical Formulation:
It has seen that for networks with linear activation functions and with

no hidden units. The error squared vs. the weight graph is a paraboloid in n-space. Since

the proportionality constant is negative, the graph of such a function is concave upward

and has the least value. The vertex of this paraboloid represents the point where it

reduces the error. The weight vector corresponding to this point is then the ideal weight
vector. We can use the delta learning rule with both single output unit and several output

units. While applying the delta rule assume that the error can be directly measured. The

aim of applying the delta rule is to reduce the difference between the actual and expected

output that is the error.

Gradient Descent with Sigmoidal Neurons

Gradient descent is widely used for optimizing neural networks, including those with sigmoidal
neurons.
The Backpropagation algorithm
How does back propagation work?

It has four layers: input layer, hidden layer,hidden layer II and final output layer.
So, the main three layers are:
1. Input layer
2. Hidden layer
3. Output layer
Each layer has its own way of working and its own way to take action such that we are able to
get the desired results and correlate these scenarios to our conditions.
Example

Network:

 1 hidden layer with 2 neurons.


 Sigmoid activation.

Data:

 Input x=[0.5,0.2]x = [0.5, 0.2]x=[0.5,0.2]


 Target t=1t = 1t=1
 Initial weights and biases: W,bW, bW,b.

Forward Propagation:

1. Compute activations.
2. Compute loss.

Backward Propagation:

1. Compute output error δ(L)\delta^{(L)}δ(L).


2. Propagate error to hidden layers.
3. Update weights and biases

Stochastic and Minibatch Gradient


Mini-batch gradient descent
Train, Validate, and Test in NN

The field of machine learning has expanded tremendously thanks to neural networks. These
neural nets are employed for a wide variety of reasons because they are very flexible models that
can fit almost any kind of data, provided that we have sufficient computing resources to train
them in a timely manner. To efficiently exploit these learning structures, we need to make
sure that the model generalizes the information that is being processed.

The problem here is that if we feed all the data we have for the model to train, there is no way
that we can test if the model has correctly extracted a function from the information. This metric
is called accuracy, and it’s essential for assessing the performance of our model.

Training vs. Testing

Alright, we can begin by making a training set and a testing set. We can now train our model and
verify its accuracy using the testing set. The model has never seen the test data during training.
Therefore, the accuracy result we will obtain will be valid. We can use different techniques to
train a neural network model, but the easiest to understand and implement is backpropagation.
Now, let’s say that we get a less-than-favorable performance from our training-testing approach.
We can maybe change some hyperparameters from our model and try again.

However, if we do so, we will be using the results from the test set to tune the training of the
model. There is something wrong with this approach in theory because we are adding a feedback
loop from our test set. This will falsify the accuracy results that we will generate because we are
changing parameters based on the results we achieve. In other words, we are using the data from
the test set to optimize our model.

Purpose of Validation Sets

To avoid this, we perform a sort of “blind test” only at the end. In contrast, to iterate and make
changes throughout the development of the model, we use a validation set. Now we can use
this validation set to fine-tune various hyperparameters to help the models fit the data.
Additionally, this set will act as a sort of index for the actual testing accuracy of the model. This
is why having a validation data set is important.

We can now train a model, validate it, and change different hyperparameters to optimize
performance and then test the model once to report its results.

Let’s see if we can apply this to an example.

Implementation

To implement these notions in a classic supervised learning fashion, we must first obtain a
labeled data set to work with. An example of one with two classes that use coordinates as a
single feature will be represented below:
The first thing to note is that there is an outlier in our data. It’s good practice to find these using
common statistical methods, examine them, and then remove those that don’t add information to
the model. This is part of an important step called data pre-processing:
Splitting Our Dataset

Now that we have our data ready, we can split it into training, validation, and testing sets. In the
figure below, we add a column to our data set, but we could also make three separate sets:

We must ensure that these groups are balanced so that our model is less biased. This means
that they must have more or less the same amount of examples from each label. Failure in
balancing could lead to the model not having enough examples of a class to learn accurately.
This could also put the test results in jeopardy.

In this binary classification example, our training set has two “1” labels and only one “0” label.
However, our validation set has one of each, and our testing set has two “0” labels and only one
“1” label. Because our data is minimal, this is satisfactory.

However, we could change the groups that we defined and pick the best configuration to see how
the model performs in testing. This would be called cross-validation. K-fold cross-validation is
widely used in ML. However, it’s not covered in this tutorial.

Training and Testing Our Model

Moving on, we can now train our model. In the case of our feed-forward neural net, we could
use a backpropagation algorithm to do so. This algorithm will compute an error for each
training example and use it to finely adjust the weights of the connections in our neural net. We
run this algorithm for as many iterations as we can until it’s just about to overfit our data, and we
get the model below:

In order to verify the correct training of the model, we feed our trained model the validation
dataset and ask it to classify the different examples. This will give us validation accuracy.

If we were to have an error in the validation phase, we could change around any hyper-
parameters to make the model perform better and try again. Maybe we can add a hidden
layer, change our batch size, or even adjust our learning rate depending on the optimization
methods.

We can also use the validation dataset for early stopping to prevent the model from overfitting
data. This would be a form of regularization. Now that we have a model that we fancy, we
simply use the test dataset to report our results, as the validation dataset has already been used to
tune the hyper-parameters of our network.

Preventing Overfitting in Deep neural Networks

Overfitting occurs when a model learns not only the underlying patterns but also the noise in the
training data, leading to poor generalization on unseen data.

Training a deep neural network that can generalize well to new data is a challenging problem.

A model with too little capacity cannot learn the problem, whereas a model with too much
capacity can learn it too well and overfit the training dataset. Both cases result in a model that
does not generalize well.

A modern approach to reducing generalization error is to use a larger model that may be required
to use regularization during training that keeps the weights of the model small. These techniques
not only reduce overfitting, but they can also lead to faster optimization of the model and better
overall performance.

1.Regularization Techniques:

You might also like