Main Dataset
Main Dataset
Today Deep learning has become one of the most popular and visible areas of machine
learning, due to its success in a variety of applications, such as computer vision,
natural language processing, and Reinforcement learning.
Deep learning can be used for supervised, unsupervised as well as reinforcement
machine learning. it uses a variety of ways to process these.
Supervised Machine Learning: Supervised machine learning is the machine learning
technique in which the neural network learns to make predictions or classify data
based on the labeled datasets. Here we input both input features along with the
target variables. the neural network learns to make predictions based on the cost
or error that comes from the difference between the predicted and the actual
target, this process is known as backpropagation. Deep learning algorithms like
Convolutional neural networks, Recurrent neural networks are used for many
supervised tasks like image classifications and recognization, sentiment analysis,
language translations, etc.Unsupervised Machine Learning: Unsupervised machine
learning is the machine learning technique in which the neural network learns to
discover the patterns or to cluster the dataset based on unlabeled datasets. Here
there are no target variables. while the machine has to self-determined the hidden
patterns or relationships within the datasets. Deep learning algorithms like
autoencoders and generative models are used for unsupervised tasks like clustering,
dimensionality reduction, and anomaly detection.Reinforcement Machine Learning:
Reinforcement Machine Learning is the machine learning technique in which an agent
learns to make decisions in an environment to maximize a reward signal. The agent
interacts with the environment by taking action and observing the resulting
rewards. Deep learning can be used to learn policies, or a set of actions, that
maximizes the cumulative reward over time. Deep reinforcement learning algorithms
like Deep Q networks and Deep Deterministic Policy Gradient (DDPG) are used to
reinforce tasks like robotics and game playing etc.Artificial neural networks
Artificial neural networks are built on the principles of the structure and
operation of human neurons. It is also known as neural networks or neural nets. An
artificial neural network’s input layer, which is the first layer, receives input
from external sources and passes it on to the hidden layer, which is the second
layer. Each neuron in the hidden layer gets information from the neurons in the
previous layer, computes the weighted total, and then transfers it to the neurons
in the next layer. These connections are weighted, which means that the impacts of
the inputs from the preceding layer are more or less optimized by giving each input
a distinct weight. These weights are then adjusted during the training process to
enhance the performance of the model.
Fully Connected Artificial Neural Network
Artificial neurons, also known as units, are found in artificial neural networks.
The whole Artificial Neural Network is composed of these artificial neurons, which
are arranged in a series of layers. The complexities of neural networks will depend
on the complexities of the underlying patterns in the dataset whether a layer has a
dozen units or millions of units. Commonly, Artificial Neural Network has an input
layer, an output layer as well as hidden layers. The input layer receives data from
the outside world which the neural network needs to analyze or learn about.
In a fully connected artificial neural network, there is an input layer and one or
more hidden layers connected one after the other. Each neuron receives input from
the previous layer neurons or the input layer. The output of one neuron becomes the
input to other neurons in the next layer of the network, and this process continues
until the final layer produces the output of the network. Then, after passing
through one or more hidden layers, this data is transformed into valuable data for
the output layer. Finally, the output layer provides an output in the form of an
artificial neural network’s response to the data that comes in.
Units are linked to one another from one layer to another in the bulk of neural
networks. Each of these links has weights that control how much one unit influences
another. The neural network learns more and more about the data as it moves from
one unit to another, ultimately producing an output from the output layer.
Difference between Machine Learning and Deep Learning :
machine learning and deep learning both are subsets of artificial intelligence but
there are many similarities and differences between them.
Machine Learning
Deep Learning
Apply statistical algorithms to learn the hidden patterns and relationships in the
dataset.Uses artificial neural network architecture to learn the hidden patterns
and relationships in the dataset.Can work on the smaller amount of datasetRequires
the larger volume of dataset compared to machine learningBetter for the low-label
task.Better for complex task like image processing, natural language processing,
etc.Takes less time to train the model.Takes more time to train the model.A model
is created by relevant features which are manually extracted from images to detect
an object in the image.Relevant features are automatically extracted from images.
It is an end-to-end learning process.Less complex and easy to interpret the
result.More complex, it works like the black box interpretations of the result are
not easy.It can work on the CPU or requires less computing power as compared to
deep learning.It requires a high-performance computer with GPU.Types of neural
networks
Deep Learning models are able to automatically learn features from the data, which
makes them well-suited for tasks such as image recognition, speech recognition, and
natural language processing. The most widely used architectures in deep learning
are feedforward neural networks, convolutional neural networks (CNNs), and
recurrent neural networks (RNNs).
Feedforward neural networks (FNNs) are the simplest type of ANN, with a linear flow
of information through the network. FNNs have been widely used for tasks such as
image classification, speech recognition, and natural language processing.
Convolutional Neural Networks (CNNs) are specifically for image and video
recognition tasks. CNNs are able to automatically learn features from the images,
which makes them well-suited for tasks such as image classification, object
detection, and image segmentation.
Recurrent Neural Networks (RNNs) are a type of neural network that is able to
process sequential data, such as time series and natural language. RNNs are able to
maintain an internal state that captures information about the previous inputs,
which makes them well-suited for tasks such as speech recognition, natural language
processing, and language translation.
Applications of Deep Learning :
The main applications of deep learning can be divided into computer vision, natural
language processing (NLP), and reinforcement learning.
Computer vision
In computer vision, Deep learning models can enable machines to identify and
understand visual data. Some of the main applications of deep learning in computer
vision include:
Object detection and recognition: Deep learning model can be used to identify and
locate objects within images and videos, making it possible for machines to perform
tasks such as self-driving cars, surveillance, and robotics. Image classification:
Deep learning models can be used to classify images into categories such as
animals, plants, and buildings. This is used in applications such as medical
imaging, quality control, and image retrieval. Image segmentation: Deep learning
models can be used for image segmentation into different regions, making it
possible to identify specific features within images.Natural language processing
(NLP):
In NLP, the Deep learning model can enable machines to understand and generate
human language. Some of the main applications of deep learning in NLP include:
Automatic Text Generation – Deep learning model can learn the corpus of text and
new text like summaries, essays can be automatically generated using these trained
models.Language translation: Deep learning models can translate text from one
language to another, making it possible to communicate with people from different
linguistic backgrounds. Sentiment analysis: Deep learning models can analyze the
sentiment of a piece of text, making it possible to determine whether the text is
positive, negative, or neutral. This is used in applications such as customer
service, social media monitoring, and political analysis. Speech recognition: Deep
learning models can recognize and transcribe spoken words, making it possible to
perform tasks such as speech-to-text conversion, voice search, and voice-controlled
devices. Reinforcement learning:
In reinforcement learning, deep learning works as training agents to take action in
an environment to maximize a reward. Some of the main applications of deep learning
in reinforcement learning include:
Game playing: Deep reinforcement learning models have been able to beat human
experts at games such as Go, Chess, and Atari. Robotics: Deep reinforcement
learning models can be used to train robots to perform complex tasks such as
grasping objects, navigation, and manipulation. Control systems: Deep reinforcement
learning models can be used to control complex systems such as power grids, traffic
management, and supply chain optimization. Challenges in Deep Learning
Deep learning has made significant advancements in various fields, but there are
still some challenges that need to be addressed. Here are some of the main
challenges in deep learning:
Data availability: It requires large amounts of data to learn from. For using deep
learning it’s a big concern to gather as much data for training.Computational
Resources: For training the deep learning model, it is computationally expensive
because it requires specialized hardware like GPUs and TPUs.Time-consuming: While
working on sequential data depending on the computational resource it can take very
large even in days or months. Interpretability: Deep learning models are complex,
it works like a black box. it is very difficult to interpret the
result.Overfitting: when the model is trained again and again, it becomes too
specialized for the training data, leading to overfitting and poor performance on
new data.Advantages of Deep Learning:High accuracy: Deep Learning algorithms can
achieve state-of-the-art performance in various tasks, such as image recognition
and natural language processing.Automated feature engineering: Deep Learning
algorithms can automatically discover and learn relevant features from data without
the need for manual feature engineering.Scalability: Deep Learning models can scale
to handle large and complex datasets, and can learn from massive amounts of
data.Flexibility: Deep Learning models can be applied to a wide range of tasks and
can handle various types of data, such as images, text, and speech.Continual
improvement: Deep Learning models can continually improve their performance as more
data becomes available.Disadvantages of Deep Learning:High computational
requirements: Deep Learning models require large amounts of data and computational
resources to train and optimize.Requires large amounts of labeled data: Deep
Learning models often require a large amount of labeled data for training, which
can be expensive and time- consuming to acquire.Interpretability: Deep Learning
models can be challenging to interpret, making it difficult to understand how they
make decisions.Overfitting: Deep Learning models can sometimes overfit to the
training data, resulting in poor performance on new and unseen data.Black-box
nature: Deep Learning models are often treated as black boxes, making it difficult
to understand how they work and how they arrived at their predictions.In summary,
while Deep Learning offers many advantages, including high accuracy and
scalability, it also has some disadvantages, such as high computational
requirements, the need for large amounts of labeled data, and interpretability
challenges. These limitations need to be carefully considered when deciding whether
to use Deep Learning for a specific task.
dendrites
synapse
axon
cell body
Learningvery precise structures and formatted datathey can tolerate
ambiguityProcessor
complex
high speed
one or a few
simple
low speed
large number
Memory
separate from a processor
localized
non-content addressable
distributed
parallel
self-learning
Reliabilityvery vulnerablerobustExpertise
numerical and symbolic
manipulations
perceptual
problems
Operating Environment
well-defined
well-constrained
poorly defined
un-constrained
Fault Tolerancethe potential of fault toleranceperformance degraded even on partial
damage
Overall, while BNNs and ANNs share many basic components, there are significant
differences in their complexity, flexibility, and adaptability. BNNs are highly
complex and adaptable systems that can process information in parallel, and their
plasticity allows them to learn and adapt over time. In contrast, ANNs are simpler
systems that are designed to perform specific tasks, and their connections are
usually fixed, with the network architecture determined by the designer.
Python3
Python3
Step 3: Now display the shape and image of the single image in the dataset. The
image size contains a 28*28 matrix and length of the training set is 60,000 and the
testing set is 10,000.
Python3
len(x_train) len(x_test) x_train[0].shape plt.matshow(x_train[0])
Output:
Sample image from the training dataset
Step 4: Now normalize the dataset in order to compute the calculations in a fast
and accurate manner.
Python3
# Normalizing the dataset x_train = x_train/255x_test = x_test/255 # Flatting the
dataset in order # to compute for model building x_train_flatten =
x_train.reshape(len(x_train), 28*28) x_test_flatten = x_test.reshape(len(x_test),
28*28)
Step 5: Building a neural network with single-layer perception. Here we can observe
as the model is a single-layer perceptron that only contains one input layer and
one output layer there is no presence of the hidden layers.
Python3
Python3
model.evaluate(x_test_flatten, y_test)
Output:
Models performance on the testing data
In this article, we will understand the concept of a multi-layer perceptron and its
implementation in Python using the TensorFlow library.
Multi-layer Perceptron
Multi-layer perception is also known as MLP. It is fully connected dense layers,
which transform any input dimension to the desired dimension. A multi-layer
perception is a neural network that has multiple layers. To create a neural network
we combine neurons together so that the outputs of some neurons are inputs of other
neurons.
A gentle introduction to neural networks and TensorFlow can be found here:
Neural NetworksIntroduction to TensorFlow
A multi-layer perceptron has one input layer and for each input, there is one
neuron(or node), it has one output layer with a single node for each output and it
can have any number of hidden layers and each hidden layer can have any number of
nodes. A schematic diagram of a Multi-Layer Perceptron (MLP) is depicted below.
In the multi-layer perceptron diagram above, we can see that there are three inputs
and thus three input nodes and the hidden layer has three nodes. The output layer
gives two outputs, therefore there are two output nodes. The nodes in the input
layer take input and forward it for further process, in the diagram above the nodes
in the input layer forwards their output to each of the three nodes in the hidden
layer, and in the same way, the hidden layer processes the information and passes
it to the output layer.
Every node in the multi-layer perception uses a sigmoid activation function. The
sigmoid activation function takes real values as input and converts them to numbers
between 0 and 1 using the sigmoid formula.
Now that we are done with the theory part of multi-layer perception, let’s go ahead
and implement some code in python using the TensorFlow library.
Stepwise Implementation
Step 1: Import the necessary libraries.
Python3
Python3
Output:
Downloading data from
https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11493376/11490434 [==============================] – 2s 0us/step
Python3
We are converting the pixel values into floating-point values to make the
predictions. Changing the numbers into grayscale values will be beneficial as the
values become small and the computation becomes easier and faster. As the pixel
values range from 0 to 256, apart from 0 the range is 255. So dividing all the
values by 255 will convert it to range from 0 to 1
Step 4: Understand the structure of the dataset
Python3
print("Feature matrix:", x_train.shape) print("Target matrix:", x_test.shape)
print("Feature matrix:", y_train.shape) print("Target matrix:", y_test.shape)
Output:
Feature matrix: (60000, 28, 28)
Target matrix: (10000, 28, 28)
Feature matrix: (60000,)
Target matrix: (10000,)
Thus we get that we have 60,000 records in the training dataset and 10,000 records
in the test dataset and Every image in the dataset is of the size 28×28.
Step 5: Visualize the data.
Python3
fig, ax = plt.subplots(10, 10) k = 0for i in range(10): for j in range(10):
ax[i][j].imshow(x_train[k].reshape(28, 28), aspect='auto')
k += 1plt.show()
Output
Python3
Python
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy', metrics=['accuracy'])
Compile function is used here that involves the use of loss, optimizers, and
metrics. Here loss function used is sparse_categorical_crossentropy, optimizer used
is adam.
Step 8: Fit the model.
Python3
Output:
Python3
results = model.evaluate(x_test, y_test, verbose = 0) print('test loss, test
acc:', results)
Output:
test loss, test acc: [0.27210235595703125, 0.9223999977111816]
We got the accuracy of our model 92% by using model.evaluate() on the test samples.
Model Architecture
Weights and bias:
The weights and the bias that is going to be used for both the layers have to be
declared initially and also among them the weights will be declared randomly in
order to avoid the same output of all units, while the bias will be initialized to
zero. The calculation will be done from the scratch itself and according to the
rules given below where W1, W2 and b1, b2 are the weights and bias of first and
second layer respectively. Here A stands for the activation of a particular layer.
Cost Function:
The cost function of the above model will pertain to the cost function used with
logistic regression. Hence, in this tutorial we will be using the cost function:
# X --> input dataset of shape (input size, number of examples) # Y --> labels of
shape (output size, number of examples) W1 = np.random.randn(4, X.shape[0]) *
0.01b1 = np.zeros(shape =(4, 1)) W2 = np.random.randn(Y.shape[0], 4) * 0.01b2 =
np.zeros(shape =(Y.shape[0], 1))
def back_propagate(W1, b1, W2, b2, cache): # Retrieve also A1 and A2 from
dictionary "cache" A1 = cache['A1'] A2 = cache['A2'] # Backward
propagation: calculate dW1, db1, dW2, db2. dZ2 = A2 - Y dW2 = (1 / m) *
np.dot(dZ2, A1.T) db2 = (1 / m) * np.sum(dZ2, axis = 1, keepdims = True)
dZ1 = np.multiply(np.dot(W2.T, dZ2), 1 - np.power(A1, 2)) dW1 = (1 / m) *
np.dot(dZ1, X.T) db1 = (1 / m) * np.sum(dZ1, axis = 1, keepdims = True)
# Updating the parameters according to algorithm W1 = W1 - learning_rate * dW1
b1 = b1 - learning_rate * db1 W2 = W2 - learning_rate * dW2 b2 = b2 -
learning_rate * db2 return W1, W2, b1, b2
Code: Training the custom model Now we will train the model using the functions
defined above, the epochs can be put as per the convenience and power of the
processing unit.
# Please note that the weights and bias are global # Here num_iteration is epochs
for i in range(0, num_iterations): # Forward propagation. Inputs: "X,
parameters". return: "A2, cache". A2, cache = forward_propagation(X, W1,
W2, b1, b2) # Cost function. Inputs: "A2, Y". Outputs: "cost".
cost = compute_cost(A2, Y) # Backpropagation. Inputs: "parameters,
cache, X, Y". Outputs: "grads". W1, W2, b1, b2 = backward_propagation(W1,
b1, W2, b2, cache) # Print the cost every 1000 iterations
if print_cost and i % 1000 == 0: print ("Cost after iteration % i: % f"
% (i, cost))
Like Article
Save Article
Previous
Next
Last Updated :
16 Oct, 2021
Like Article
Save Article
Previous
Deep Neural net with forward and back propagation from scratch - Python
Next
Deep learning (DL) is characterized by the use of neural networks with multiple
layers to model and solve complex problems. Each layer in the neural network plays
a unique role in the process of converting input data into meaningful and
insightful outputs. The article explores the layers that are used to construct a
neural network.
Table of Content
Role of Deep Learning LayersMATLAB Input LayerMATLAB Fully Connected LayersMATLAB
Convolution LayersMATLAB Recurrent LayersMATLAB Activation LayersMATLAB Pooling and
Unpooling LayersMATLAB Normalization Layer and Dropout LayerMATLAB Output
LayersRole of Deep Learning LayersA layer in a deep learning model serves as a
fundamental building block in the model’s architecture. The structure of the
network is responsible for processing and transforming input data. The flow of
information through these layers is sequential, with each layer taking input from
the preceding layers and passing its transformed output to the subsequent layers.
This cascading process continues through the network until the final layer produces
the model’s ultimate output.
The input to a layer consists of features or representations derived from the data
processed by earlier layers. Each layer performs a specific computation or set of
operations on this input, introducing non-linearity and abstraction to the
information. The transformed output, often referred to as activations or feature
maps, encapsulates higher-level representations that capture complex patterns and
relationships within the data. The nature and function of each layer vary based on
its type within the neural network architecture.
The nature and function of each layer vary based on its type within the neural
network architecture. For instance:
Dense (Fully Connected) Layer: Neurons in this layer are connected to every neuron
in the previous layer, creating a dense network of connections. This layer is
effective in capturing global patterns in the data.Convolutional Layer: Specialized
for grid-like data, such as images, this layer employs convolution operations to
detect spatial patterns and features.Recurrent Layer: Suited for sequential data,
recurrent layers utilize feedback loops to consider context from previous time
steps, making them suitable for tasks like natural language processing.Pooling
Layer: Reduces spatial dimensions and focuses on retaining essential information,
aiding in downsampling and feature selection.MATLAB Input LayerLayer
Description of LayerinputLayer
Input layer receives and process data in a specialized format, serving as the
initial stage for information entry into a neural network.sequenceInputLayer
Sequence input layer receives sequential data for a neural network and incorporates
the normalization of the data during the input process.featureInputLayer
Feature Input Layer processes feature data for a neural network and integrates data
normalization. This layer is suitable when dealing with a dataset consisting of
numerical scalar values that represent features, without spatial or temporal
dimensions.imageInputLayer
Image input layer processes 2 dimensional images in a neural network and uses data
normalization using the input stage.image3dInputLayer
3-D image input layer receives 3-D image for a neural network.MATLAB Fully
Connected LayersLayers
Description of LayerfullyConnectedLayer
Fully connected layer performs matrix multiplication with a weight matrix and
subsequently adding a bias vector.MATLAB Convolution LayersLayer
Description of Layerconvolution1dLayer
One-dimensional convolutional layer employs sliding convolutional filters on 1-D
input data.convolution2dLayer
Two-dimensional convolutional layer employs sliding convolutional filters on 2-D
input data.convolution3dLayer
Three-dimensional convolutional layer employs sliding convolutional filters on 3-D
input data.transposedConv2dLayer
Transposed two-dimensional convolutional layer increases the resolution of two-
dimensional feature maps through upsampling.transposedConv3dLayer
Transposed three-dimensional convolutional layer increases the resolution of three-
dimensional feature maps through upsampling.MATLAB Recurrent LayersLayer
Description of LayerlstmLayer
LSTM layer represents a type of recurrent neural network (RNN) layer specifically
designed to capture and learn long-term dependencies among different time steps in
time-series and sequential data.lstmProjectedLayer
LSTM projected layer, within the realm of recurrent neural networks (RNNs), is
adept at understanding and incorporating long-term dependencies among various time
steps within time-series and sequential data. This is achieved through the
utilization of learnable weights designed for projection.bilstmLayer
Bidirectional LSTM (BiLSTM) layer, belonging to the family of recurrent neural
networks (RNNs), is proficient in capturing long-term dependencies in both forward
and backward directions among different time steps within time-series or sequential
data. This bidirectional learning is valuable when the RNN needs to gather insights
from the entire time series at each individual time step.gruLayer
Gated Recurrent Unit (GRU) layer serves as a type of recurrent neural network (RNN)
layer designed to capture dependencies among different time steps within time-
series and sequential data.gruProjectedLayer
A GRU projected layer, within the context of recurrent neural networks (RNNs), is
specialized in understanding and incorporating dependencies among various time
steps within time-series and sequential data. This is accomplished through the
utilization of learnable weights designed for projection.MATLAB Activation
LayersLayer
Description of LayerreluLayer
ReLU conducts a threshold operation on each element of the input, setting any value
that is less zero to zero.leakyReluLayer
Leaky ReLU applies a threshold operation, where any input value that is less than
zero is multiplied by a constant scalar.clippedReluLayer
Clipped ReLU layer executes a threshold operation, setting any input value below
zero to zero and capping any value surpassing the defined ceiling to that specific
ceiling value.eluLayer
Exponential Linear Unit (ELU) activation layer executes the identity operation for
positive inputs and applies an exponential nonlinearity for negative
inputs.geluLayer
Gaussian Error Linear Unit (GELU) layer adjusts the input by considering its
probability within a Gaussian distribution.tanhLayer
Hyperbolic tangent (tanh) activation layer utilizes the tanh function to transform
the inputs of the layer.swishLayer
Swish activation layer employs the swish function to process the inputs of the
layer.MATLAB Pooling and Unpooling LayersLayer
Description of LayeraveragePooling1dLayer
One dimensional average pooling layer accomplishes downsampling by segmenting the
input into 1-D pooling regions and subsequently calculating the average within each
region.averagePooling2dLayer
Two dimensional average pooling layer conducts downsampling by partitioning the
input into rectangular pooling regions and subsequently determining the average
value within each region.averagePooling3dLayer
Three dimensional average pooling layer achieves downsampling by partitioning the
three-dimensional input into cuboidal pooling regions and then calculating the
average values within each of these regions.globalAveragePooling1dLayer
1-D global average pooling layer achieves downsampling by generating the average
output across the time or spatial dimensions of the
input.globalAveragePooling2dLayer
2-D global average pooling layer accomplishes downsampling by determining the mean
value across the height and width dimensions of the
input.globalAveragePooling3dLayer
3-D global average pooling layer achieves downsampling by calculating the mean
across the height, width, and depth dimensions of the input.maxPooling1dLayer
1-D global max pooling layer achieves downsampling by producing the maximum value
across the time or spatial dimensions of the input.maxUnpooling2dLayer
2-D max unpooling layer reverses the pooling operation on the output of a 2-D max
pooling layer.MATLAB Normalization Layer and Dropout LayerLayer
Description of Layer batchNormalizationLayer
Batch normalization layer normalizes a mini-batch of data independently across all
observations for each channel. To enhance the training speed of a convolutional
neural network and mitigate sensitivity to network initialization, incorporate
batch normalization layers between convolutional layers and non-linearities, such
as ReLU layers.groupNormalizationLayer
Group normalization layer normalizes a mini-batch of data independently across
distinct subsets of channels for each observation. To expedite the training of a
convolutional neural network and minimize sensitivity to network initialization,
integrate group normalization layers between convolutional layers and non-
linearities, such as ReLU layers.layerNormalizationLayer
Layer normalization layer normalizes a mini-batch of data independently across all
channels for each observation. To accelerate the training of recurrent and
multilayer perceptron neural networks and diminish sensitivity to network
initialization, incorporate layer normalization layers after the learnable layers,
such as LSTM and fully connected layers.dropoutLayer
Dropout layer randomly zeros out input elements based on a specified
probability.MATLAB Output LayersLayer
Description of LayersoftmaxLayer
Softmax layer employs the softmax function on the input.sigmoidLayer
Sigmoid layer utilizes a sigmoid function on the input, ensuring that the output is
constrained within the range (0,1).classificationLayer
Classification layer calculates the cross-entropy loss for tasks involving
classification and weighted classification, specifically for scenarios with
mutually exclusive classes.regressionLayer
Regression layer calculates the loss using the half-mean-squared-error for tasks
related to regression.
Last Updated :
31 Jan, 2024
Like Article
Save Article
Previous
Next
Activation Functions
To put in simple
terms, an artificial neuron calculates the ‘weighted sum’ of its inputs and adds a
bias, as shown in the figure below by the net input.
Mathematically,
Now the value of net input can be any anything from -inf to +inf. The neuron
doesn’t really know how to bound to value and thus is not able to decide the firing
pattern. Thus the activation function is an important part of an artificial neural
network. They basically decide whether a neuron should be activated or not. Thus it
bounds the value of the net input.
The activation function is a non-linear transformation that we do over the input
before sending it to the next layer of neurons or finalizing it as output.
Step Function:
Step Function is one of the simplest kind of activation functions. In this, we
consider a threshold value and if the value of net input say y is greater than the
threshold then the neuron is activated.
Mathematically,
Sigmoid Function:
Sigmoid function is a widely used activation function. It is defined as:
Graphically,
This is a smooth function and is continuously differentiable. The biggest advantage
that it has over step and linear function is that it is non-linear. This is an
incredibly cool feature of the sigmoid function. This essentially means that when I
have multiple neurons having sigmoid function as their activation function – the
output is non linear as well. The function ranges from 0-1 having an S shape.
ReLU:
The ReLU function is the Rectified linear unit. It is the most widely used
activation function. It is defined as:
Graphically,
The main advantage of using the ReLU function over other activation functions is
that it does not activate all the neurons at the same time. What does this mean ?
If you look at the ReLU function if the input is negative it will convert it to
zero and the neuron does not get activated.
Leaky ReLU:
Leaky ReLU function is nothing but an improved version of the ReLU function.Instead
of defining the Relu function as 0 for x less than 0, we define it as a small
linear component of x. It can be defined as:
Graphically,
Last Updated :
23 Aug, 2019
Like Article
Save Article
Previous
Next
The biological neural network has been modeled in the form of Artificial Neural
Networks with artificial neurons simulating the function of a biological neuron.
The artificial neuron is depicted in the below picture:
Structure of an Artificial Neuron
Each neuron consists of three major components:
A set of ‘i’ synapses having weight wi. A signal xi forms the input to the i-th
synapse having weight wi. The value of any weight may be positive or negative. A
positive weight has an extraordinary effect, while a negative weight has an
inhibitory effect on the output of the summation junction.A summation junction for
the input signals is weighted by the respective synaptic weight. Because it is a
linear combiner or adder of the weighted input signals, the output of the summation
junction can be expressed as follows: A threshold activation function (or simply
the activation function, also known as squashing function) results in an output
signal only when an input signal exceeding a specific threshold value comes as an
input. It is similar in behaviour to the biological neuron which transmits the
signal only when the total input signal meets the firing threshold.
Types of Activation Function :
There are different types of activation functions. The most commonly used
activation function are listed below:
A. Identity Function: Identity function is used as an activation function for the
input layer. It is a linear function having the form
The threshold function is almost like the step function, with the only difference
being a fact that is used as a threshold value instead of . Expressing
mathematically,
C. ReLU (Rectified Linear Unit) Function: It is the most popularly used activation
function in the areas of convolutional neural networks and deep learning. It is of
the form:
This means that f(x) is zero when x is less than zero and f(x) is equal to x when x
is above or equal to zero. This function is differentiable, except at a single
point x = 0. In that sense, the derivative of a ReLU is actually a sub-derivative.
D. Sigmoid Function: It is by far the most commonly used activation function in
neural networks. The need for sigmoid function stems from the fact that many
learning algorithms require the activation function to be differentiable and hence
continuous. There are two types of sigmoid function:
1. Binary Sigmoid Function
Last Updated :
22 Jan, 2021
Like Article
Save Article
Previous
Activation Functions
Next
Python3
This function is very useful as when the input is negative the differentiation of
the function is not zero. Hence the learning of neurons doesn’t stop. Let us
illustrate the use of LReLU with the help of the Python program.
Python3
Output:
tensor([ 1.0000, -0.4000, 3.0000, -1.0000])Sigmoid Activation Function:
Sigmoid Function is a non-linear and differentiable activation function. It is an
S-shaped curve that does not pass through the origin. It produces an output that
lies between 0 and 1. The output values are often treated as a probability. It is
often used for binary classification. It is slow in computation and, graphically
Sigmoid has the following transformative behavior:
Python3
Output:
tensor([0.7311, 0.1192, 0.9526, 0.0067])Tanh Activation Function:
Tanh function is a non-linear and differentiable function similar to the sigmoid
function but output values range from -1 to +1. It is an S-shaped curve that passes
through the origin and, graphically Tanh has the following transformative behavior:
The problem with the Tanh Activation function is it is slow and the vanishing
gradient problem persists. Let us illustrate the use of the Tanh function with the
help of a Python Program.
Python3
Output:
tensor([0.7311, 0.1192, 0.9526, 0.0067])Softmax Activation Function:
The softmax function is different from other activation functions as it is placed
at the last to normalize the output. We can use other activation functions in
combination with Softmax to produce the output in probabilistic form. It is used in
multiclass classification and generates an output of probabilities whose sum is 1.
The range of output lies between 0 and 1. Softmax has the following transformative
behavior:
Python3
import torchimport torch.nn as nn # Calling the Softmax function with # dimension =
0 as dimension starts # from 0sm = nn.Softmax(dim=0) # Defining tensorinput =
torch.Tensor([1,-2,3,-5]) # Applying function to the tensoroutput = sm(input)
print(output)
Output:
tensor([0.7311, 0.1192, 0.9526, 0.0067])
Last Updated :
07 Jul, 2022
Like Article
Save Article
Previous
Next
It maps the resulting values into the desired range such as between 0 to 1 or -1
to 1 etc. It depends upon the choice of the activation function. For example, the
use of the logistic activation function would map all inputs in the real number
domain into the range of 0 to 1.
(1)
`
Hidden Layers is which are neuron nodes stacked in between inputs and outputs,
allowing neural networks to learn more complicated features (such as XOR logic).
It allows the information to go back from the cost backward through the network in
order to compute the gradient. Therefore, loop over the nodes starting from the
final node in reverse topological order to compute the derivative of the final node
output. Doing so will help us know who is responsible for the most error and change
the parameters appropriate in that direction.
Note: If gradient descent is working properly, the cost function should decrease
after every iteration.
The sigmoid function produces similar results to step function in that the output
is between 0 and 1. The curve crosses 0.5 at z=0, which we can set up rules for the
activation function, such as: If the sigmoid neuron’s output is larger than or
equal to 0.5, it outputs 1; if the output is smaller than 0.5, it outputs 0.
The sigmoid function does not have a jerk on its curve. It is smooth and it has a
very nice and simple derivative, which is differentiable everywhere on the curve.
Derivation of Sigmoid:
Sigmoids saturate and kill gradients. A very common property of the sigmoid is that
when the neuron’s activation saturates at either 0 or 1, the gradient at these
regions is almost zero. Recall that during backpropagation, this local gradient
will be multiplied by the gradient of this gate’s output for the whole objective.
Therefore, if the local gradient is very small, it will effectively “kill” the
gradient and almost no signal will flow through the neuron to its weights and
recursively to its data. Additionally, the extra penalty will be added initializing
the weights of sigmoid neurons to prevent saturation. For example, if the initial
weights are too large then most neurons would become saturated and the network will
barely learn.
It is the most widely used activation function. Since it is used in almost all the
convolutional neural networks. ReLU is half rectified from the bottom. The function
and it’s derivative both are monotonic.f(x) = max(0, x)
The models that are close to linear are easy to optimize. Since ReLU shares a lot
of the properties of linear functions, it tends to work well on most of the
problems. The only issue is that the derivative is not defined at z = 0, which we
can overcome by assigning the derivative to 0 at z = 0. However, this means that
for z <= 0 the gradient is zero and again can’t learn.
3. Leaky ReLU:
Leaky ReLU is an improved version of the ReLU function. ReLU function, the gradient
is 0 for x<0, which made the neurons die for activations in that region. Leaky ReLU
is defined to address this problem. Instead of defining the Relu function as 0 for
x less than 0, we define it as a small linear component of x.
Leaky ReLUs are one attempt to fix the Dying ReLU problem. Instead of the function
being zero when x < 0, a leaky ReLU will instead have a small negative slope (of
0.01, or so). That is, the function computes:
(2)
It squashes a real-valued number to the range [-1, 1] Like the Sigmoid, its
activations saturate, but unlike the sigmoid neuron, its output is zero-centred.
Therefore the tanh non-linearity is always preferred to the sigmoid nonlinearity.
tanh neuron is simply a scaled sigmoid neuron.
Tanh is also like logistic sigmoid but better. The advantage is that the negative
inputs will be mapped to strongly negative and the zero inputs will be mapped to
near zero in the tanh graph.
The function is differentiable monotonic but its derivative is not monotonic. Both
tanh and logistic Sigmoid activation functions are used in feed-forward nets.
It is actually just a scaled version of the sigmoid function. tanh(x)=2
sigmoid(2x)-1
5. Softmax :
The sigmoid function can be applied easily and ReLUs will not vanish the effect
during your training process. However, when you want to deal with classification
problems, they cannot help much. the sigmoid function can only handle two classes,
which is not what we expect but we want something more. The softmax function
squashes the outputs of each unit to be between 0 and 1, just like a sigmoid
function. and it also divides each output such that the total sum of the outputs is
equal to 1.
where 0 is a vector of the inputs to the output layer (if you have 10 output units,
then there are 10 elements in z). And again, j indexes the output units, so j = 1,
2, …, K.
Example:
(3)
Softmax function turns logits [1.2, 0.9, 0.4] into probabilities [0.46, 0.34,
0.20], and the probabilities sum to 1.
Last Updated :
03 Jan, 2022
Like Article
Save Article
Previous
Next
Last Updated :
02 Jun, 2023
Like Article
Save Article
Previous
Next
Python3
# Import tensorflow 2 as tensorflow 1 import tensorflow.compat.v1 as tf
tf.disable_v2_behavior() # Define the placeholders for # the input and output
data x = tf.placeholder(tf.float32) y = tf.placeholder(tf.float32) # Define the
placeholders for # the input and output data x = tf.placeholder(tf.float32) y =
tf.placeholder(tf.float32)
Next, we need to define the variables that represent the parameters of our linear
regression model. In this example, we will use a single variable (w) to represent
the slope of the best-fit line. We initialize the value of w with a random value,
say 0.5.
Here is the code to define the variable for the model parameters:
Python3
Python3
Python3
Once we have defined the cost function, we can use the TensorFlow
tf.train.GradientDescentOptimizer() function to create an optimizer that uses the
gradient descent algorithm to minimize the cost function. The
tf.train.GradientDescentOptimizer() function takes the learning rate as an input
parameter. The learning rate is a hyperparameter that determines the size of the
steps that the algorithm takes to reach the minimum of the cost function.
Here is the code to create the gradient descent optimizer:
Python3
Once we have defined the optimizer, we can use the minimize() method of the
optimizer to minimize the cost function. The minimize() method takes the cost
function as an input parameter and returns an operation that, when executed,
performs one step of gradient descent on the cost function.
Here is the code to minimize the cost function using the gradient descent
optimizer:
Python3
# Minimize the cost function train = optimizer.minimize(cost)
Once we have defined the gradient descent optimizer and the train operation, we can
use the TensorFlow Session class to train our model. The Session class provides a
way to execute TensorFlow operations. To train the model, we need to initialize the
variables that we have defined earlier (i.e., the model parameters and the
optimizer) and then run the train operation in a loop for a specified number of
iterations.
Here is the code to train the linear regression model using the gradient descent
optimizer:
Python3
In the above code, we have defined a Session object and used the
global_variables_initializer() method to initialize the variables. Next, we have
run the train operation in a loop for 1000 iterations. In each iteration, we have
fed the input and output data to the train operation using the feed_dict parameter.
Finally, we evaluated the trained model by running the w variable to get the value
of the model parameters. This will train a linear regression model on the toy
dataset using gradient descent. The model will learn the weights w that minimizes
the mean squared error between the predicted and true output values.
Visualizing the convergence of Gradient Descent using Linear Regression
Linear regression is a method for modeling the linear relationship between a
dependent variable (also known as the response or output variable) and one or more
independent variables (also known as the predictor or input variables). The goal of
linear regression is to find the values of the model parameters (coefficients) that
minimize the difference between the predicted values and the true values of the
dependent variable.
The linear regression model can be expressed as follows:
where:
is the predicted value of the dependent variable are the independent
variables are the coefficients (model parameters) associated with the independent
variables.b is the intercept (a constant term).
To train the linear regression model, you need a dataset with input features
(independent variables) and labels (dependent variables). You can then use an
optimization algorithm, such as gradient descent, to find the values of the model
parameters that minimize the loss function.
The loss function measures the difference between the predicted values and the true
values of the dependent variable. There are various loss functions that can be used
for linear regression, such as mean squared error (MSE) and mean absolute error
(MAE). The MSE loss function is defined as follows:
where:
is the predicted value for the th sample is the true value for the th sampleN is
the total number of samplesThe MSE loss function measures the average squared
difference between the predicted values and the true values. A lower MSE value
indicates that the model is performing better.
Python3
import tensorflow as tf import matplotlib.pyplot as plt # Set up the data and
model X = tf.constant([[1.], [2.], [3.], [4.]]) y = tf.constant([[2.], [4.], [6.],
[8.]]) w = tf.Variable(0.) b = tf.Variable(0.) # Define the model and loss
function def model(x): return w * x + b def loss(predicted_y, true_y):
return tf.reduce_mean(tf.square(predicted_y - true_y)) # Set the learning rate
learning_rate = 0.001 # Training loop losses = [] for i in range(250): with
tf.GradientTape() as tape: predicted_y = model(X) current_loss =
loss(predicted_y, y) gradients = tape.gradient(current_loss, [w, b])
w.assign_sub(learning_rate * gradients[0]) b.assign_sub(learning_rate *
gradients[1]) losses.append(current_loss.numpy()) # Plot the loss
plt.plot(losses) plt.xlabel("Iteration") plt.ylabel("Loss") plt.show()
Output:
Loss vs Iteration
The loss function calculates the mean squared error (MSE) loss between the
predicted values and the true labels. The model function defines the linear
regression model, which is a linear function of the form .
The training loop performs 250 iterations of gradient descent. At each iteration,
the with tf.GradientTape() as tape: block activates the gradient tape, which
records the operations for computing the gradients of the loss with respect to the
model parameters.
Inside the block, the predicted values are calculated using the model function and
the current values of the model parameters. The loss is then calculated using the
loss function and the predicted values and true labels.
After the loss has been calculated, the gradients of the loss with respect to the
model parameters are computed using the gradient method of the gradient tape. The
model parameters are then updated by subtracting the learning rate multiplied by
the gradients from the current values of the parameters. This process is repeated
until the training loop is completed.
Finally, the model parameters will contain the optimized values that minimize the
loss function, and the model will be trained to predict the dependent variable
given the independent variables.
A list called losses store the loss at each iteration. After the training loop is
completed, the losses list contains the loss values at each iteration.
The plt.plot function plots the losses list as a function of the iteration number,
which is simply the index of the loss in the list. The plt.xlabel and plt.ylabel
functions add labels to the x-axis and y-axis of the plot, respectively. Finally,
the plt.show function displays the plot.
The resulting plot shows how the loss changes over the course of the training
process. As the model is trained, the loss should decrease, indicating that the
model is learning and the model parameters are being optimized to minimize the
loss. Eventually, the loss should converge to a minimum value, indicating that the
model has reached a good solution. The rate at which the loss decreases and the
final value of the loss will depend on various factors, such as the learning rate,
the initial values of the model parameters, and the complexity of the model.
Visualizing the Gradient Descent
Gradient descent is an optimization algorithm that is used to find the values of
the model parameters that minimize the loss function. The algorithm works by
starting with initial values for the parameters and then iteratively updating the
values to minimize the loss.
The equation for the gradient descent algorithm for linear regression can be
written as follows:
where:
w_i is the ith model parameter. is the learning rate (a hyperparameter that
determines the step size of the update). is the partial derivative of the MSE loss
function with respect to the ith model parameter.
This equation updates the value of each parameter in the direction that reduces the
loss. The learning rate determines the size of the update, with a smaller learning
rate resulting in smaller steps and a larger learning rate resulting in larger
steps.
The process of performing gradient descent can be visualized as taking small steps
downhill on a loss surface, with the goal of reaching the global minimum of the
loss function. The global minimum is the point on the loss surface where the loss
is the lowest.
Here is an example of how to plot the loss surface and the trajectory of the
gradient descent algorithm:
Python3
import numpy as np import matplotlib.pyplot as plt # Generate synthetic data
np.random.seed(42) X = 2 * np.random.rand(100, 1) - 1y = 4 + 3 * X +
np.random.randn(100, 1) # Initialize model parameters w = np.random.randn(2, 1) b
= np.random.randn(1)[0] # Set the learning rate alpha = 0.1 # Set the number of
iterations num_iterations = 20 # Create a mesh to plot the loss surface w1, w2 =
np.meshgrid(np.linspace(-5, 5, 100), np.linspace(-5, 5, 100))
# Compute the loss for each point on the grid loss = np.zeros_like(w1) for i in
range(w1.shape[0]): for j in range(w1.shape[1]): loss[i, j] =
np.mean((y - w1[i, j] \ * X - w2[i, j] * X**2)**2)
# Perform gradient descent for i in range(num_iterations): # Compute the
gradient of the loss # with respect to the model parameters grad_w1 = -2 *
np.mean(X * (y - w[0] \ * X - w[1] * X**2))
grad_w2 = -2 * np.mean(X**2 * (y - w[0] \ * X -
w[1] * X**2)) # Update the model parameters w[0] -= alpha * grad_w1
w[1] -= alpha * grad_w2 # Plot the loss surface fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(projection='3d') ax.plot_surface(w1, w2, loss,
cmap='coolwarm') ax.set_xlabel('w1') ax.set_ylabel('w2') ax.set_zlabel('Loss') #
Plot the trajectory of the gradient descent algorithm ax.plot(w[0], w[1],
np.mean((y - w[0]\ * X - w[1] * X**2)**2),
'o', c='red', markersize=10) plt.show()
Output:
Gradient Descent finding global minima
This code generates synthetic data for a quadratic regression problem, initializes
the model parameters, and performs gradient descent to find the values of the model
parameters that minimize the mean squared error loss. The code also plots the loss
surface and the trajectory of the gradient descent algorithm on the loss surface.
The resulting plot shows how the gradient descent algorithm takes small steps
downhill on the loss surface and eventually reaches the global minimum of the loss
function. The global minimum is the point on the loss surface where the loss is the
lowest.
It is important to choose an appropriate learning rate for the gradient descent
algorithm. If the learning rate is too small, the algorithm will take a long time
to converge to the global minimum. On the other hand, if the learning rate is too
large, the algorithm may overshoot the global minimum and may not converge to a
good solution.
Another important consideration is the initialization of the model parameters. If
the initialization is too far from the global minimum, the gradient descent
algorithm may take a long time to converge. It is often helpful to initialize the
model parameters to small random values.
It is also important to choose an appropriate stopping criterion for the gradient
descent algorithm. One common stopping criterion is to stop the algorithm when the
loss function stops improving or when the improvement is below a certain threshold.
Another option is to stop the algorithm after a fixed number of iterations.
Overall, gradient descent is a powerful optimization algorithm that can be used to
find the values of the model parameters that minimize the loss function for a wide
range of machine learning problems.
Conclusion
In this blog, we have discussed gradient descent optimization in TensorFlow and how
to implement it to train a linear regression model. We have seen that TensorFlow
provides several optimizers that implement different variations of gradient
descent, such as stochastic gradient descent and mini-batch gradient descent.
Gradient descent is a powerful optimization algorithm that is widely used in
machine learning and deep learning to find the optimal solution to a given problem.
It is an iterative algorithm that updates the parameters of a function by taking
steps in the opposite direction of the gradient of the function. TensorFlow makes
it easy to implement gradient descent by providing built-in optimizers and
functions for computing gradients.
Last Updated :
09 Feb, 2023
Like Article
Save Article
Previous
Next
One of the critical issues while training a neural network on the sample data is
Overfitting. When the number of epochs used to train a neural network model is more
than necessary, the training model learns patterns that are specific to sample data
to a great extent. This makes the model incapable to perform well on a new dataset.
This model gives high accuracy on the training set (sample data) but fails to
achieve good accuracy on the test set. In other words, the model loses
generalization capacity by overfitting the training data. To mitigate overfitting
and to increase the generalization capacity of the neural network, the model should
be trained for an optimal number of epochs. A part of the training data is
dedicated to the validation of the model, to check the performance of the model
after each epoch of training. Loss and accuracy on the training set as well as on
the validation set are monitored to look over the epoch number after which the
model starts overfitting.
keras.callbacks.callbacks.EarlyStopping()
Either loss/accuracy values can be monitored by the Early stopping call back
function. If the loss is being monitored, training comes to halt when there is an
increment observed in loss values. Or, If accuracy is being monitored, training
comes to halt when there is a decrement observed in accuracy values.
Syntax with default values:
keras.callbacks.callbacks.EarlyStopping(monitor=’val_loss’, min_delta=0,
patience=0, verbose=0, mode=’auto’, baseline=None, restore_best_weights=False)
Python3
Python3
from keras import modelsfrom keras import layers model =
models.Sequential()model.add(layers.Conv2D(32, (3, 3),
activation="relu", input_shape=(28, 28,
1)))model.add(layers.MaxPooling2D(2, 2))model.add(layers.Conv2D(64, (3, 3),
activation="relu"))model.add(layers.MaxPooling2D(2,
2))model.add(layers.Flatten())model.add(layers.Dense(64,
activation="relu"))model.add(layers.Dense(10,
activation="softmax")) model.summary()
Output: Summary of the modelStep 4: Compiling the model with RMSprop optimizer,
categorical cross entropy loss function and accuracy as success metric
model.compile(optimizer=”rmsprop”, loss=”categorical_crossentropy”,
metrics=[‘accuracy’])
Step 5: Creating a validation set and training set by partitioning the current
training set
Python3
val_images = train_images[:10000]partial_images = train_images[10000:]val_labels =
y_train[:10000]partial_labels = y_train[10000:]
Python3
Training stopped at 11th epoch i.e., the model will start overfitting from 12th
epoch.
Observing loss values without using Early Stopping call back function: Train the
model up to 25 epochs and plot the training loss values and validation loss values
against number of epochs. However, the patience in the call-back is set to 5, so
the model will train for 5 more epochs after the optimal. This would make the
optimal 6, not 11. The results are provided in an image which also shows the lowest
validation loss was achieved by epoch 6, not 11, making it the optimal. Therefore,
the optimal number of epochs to train most dataset is 6.
Last Updated :
28 Feb, 2023
Like Article
Save Article
Previous
Gradient Descent Optimization in Tensorflow
Next
Classifying handwritten digits is the basic problem of the machine learning and can
be solved in many ways here we will implement them by using TensorFlowUsing a
Linear Classifier Algorithm with tf.contrib.learn linear classifier achieves the
classification of handwritten digits by making a choice based on the value of a
linear combination of the features also known as feature values and is typically
presented to the machine in a vector called a feature vector. Modules
required :NumPy:
$ pip install numpy
Matplotlib:
$ pip install matplotlib
Tensorflow:
$ pip install tensorflow
Steps to follow
Step 1 : Importing all dependence
Python3
import numpy as npimport matplotlib.pyplot as pltimport tensorflow as tf learn =
tf.contrib.learn tf.logging.set_verbosity(tf.logging.ERROR)
Python3
mnist = learn.datasets.load_dataset('mnist')data = mnist.train.imageslabels =
np.asarray(mnist.train.labels, dtype=np.int32)test_data =
mnist.test.imagestest_labels = np.asarray(mnist.test.labels, dtype=np.int32)
Python3
Python3
Python3
feature_columns = learn.infer_real_valued_columns_from_input(data)classifier =
learn.LinearClassifier(n_classes=10,
feature_columns=feature_columns)classifier.fit(data, labels, batch_size=100,
steps=1000)
Python3
classifier.evaluate(test_data, test_labels)print(classifier.evaluate(test_data,
test_labels)["accuracy"])
Output :
0.9137
Step 7 : Predicting data
Python3
prediction = classifier.predict(np.array([test_data[0]],
dtype=float),
as_iterable=False)print("prediction : {}, label : {}".format(prediction,
test_labels[0]) )
Output :
prediction : [7], label : 7
Full Code for classifying handwritten
Python3
Steps to follow
Step 1 : Importing all dependence
Python3
Python3
mnist = tf.keras.datasets.mnist(x_train,y_train) , (x_test,y_test) =
mnist.load_data() x_train = tf.keras.utils.normalize(x_train,axis=1)x_test =
tf.keras.utils.normalize(x_test,axis=1)
Python3
Python3
Python3
predictions=model.predict([x_test])print('label -> ',y_test[2])print('prediction ->
',np.argmax(predictions[2])) draw(x_test[2])
Python3
new_model = tf.keras.models.load_model('epic_num_reader.h5')
Python3
predictions=new_model.predict([x_test]) print('label ->
',y_test[2])print('prediction -> ',np.argmax(predictions[2])) draw(x_test[2])
Last Updated :
22 Sep, 2021
Like Article
Save Article
Previous
Next
Python3
import torch import torchvision import torch.nn as nn import torch.optim as optim
from torchsummary import summary import torch.nn.functional as F
Python3
# Load the MNIST dataset train_dataset = torchvision.datasets.MNIST(root='./data',
train=True,
transform=torchvision.transforms.ToTensor(),
download=True) test_dataset = torchvision.datasets.MNIST(root='./data',
train=False,
transform=torchvision.transforms.ToTensor(),
download=True)
Python3
The Classifier class inherits from PyTorch’s nn.Module class and defines the
architecture of the CNN. The __init__ method is called when an instance of the
class is created and it sets up the layers of the network.
self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1): This line creates a 2D
convolutional layer with 1 input channel, 32 output channels, a kernel size of 3,
and padding of 1. The convolutional layer applies a set of filters (also called
kernels) to the input image in order to extract features from it.self.conv2 =
nn.Conv2d(32, 64, kernel_size=3, padding=1): This line creates another 2D
convolutional layer with 32 input channels, 64 output channels, a kernel size of 3,
and padding of 1. This layer is connected to the output of the first convolutional
layer, allowing the network to learn more complex features from the previous
layer’s output.self.pool = nn.MaxPool2d(2, 2): This line creates a max pooling
layer with a kernel size of 2 and a stride of 2. Max pooling is a down-sampling
operation that selects the maximum value from a small neighborhood for each input
channel. It helps to reduce the dimensionality of the data, reduce the
computational cost and helps to prevent overfitting.self.dropout1 =
nn.Dropout2d(0.25): This line creates a dropout layer with a probability of 0.25.
Dropout is a regularization technique that randomly drops out some neurons during
training, which helps to reduce overfitting.self.dropout2 = nn.Dropout2d(0.5): This
line creates another dropout layer with a probability of 0.5self.fc1 = nn.Linear(64
* 7 * 7, 128): This line creates a fully connected (linear) layer with 64 * 7 * 7
input features and 128 output features. Fully connected layers are used to make the
final predictions based on the features learned by the previous layers.self.fc2 =
nn.Linear(128, 10): This line creates another fully connected layer with 128 input
features and 10 output features. This layer will produce the final output of the
network with 10 classesThe forward method defines the
Next, there is the Forward pass method of the network. It takes an input x and
applies a series of operations defined by the layers in the __init__ method.
x = self.pool(F.relu(self.conv1(x))): This line applies the ReLU activation
function (F.relu) to the output of the first convolutional layer (self.conv1), and
then applies max pooling (self.pool) to the result.x = self.dropout1(x): This line
applies dropout to the output of the first pooling layer.x =
self.pool(F.relu(self.conv2(x))): This line applies the ReLU activation function to
the output of the second convolutional layer (self.conv2), and then applies max
pooling to the result.x = self.dropout2(x): This line applies dropout to the output
of the second pooling layer.x = x.view(-1, 64 * 7 * 7): This line reshapes the
tensor x to a 1D tensor, with -1 indicating that the number of elements in the
tensor is inferred from the other dimensions.x = F.relu(self.fc1(x)): This line
applies the ReLU activation function to the output of the first fully connected
layer (self.fc1).x = self.fc2(x): This line applies the final fully connected layer
(self.fc2) to the output of the previous layer and returns the result, which will
be the final output of the network.This CNN architecture is a simple one, and it
can be used as a starting point for more complex tasks. However, it could be
improved by adding more layers, using different types of layers, or tuning the
hyperparameters for better performance.GPU VS CPU
Python3
Output:
device(type='cuda')
This piece of code is used to select the device where we should rain our model. If
we are running our code in google colab we can check if the `cuda` device is
available if it is available we can use it else we can use normal CPU.
`CUDA` is a GPU optimized for running the ML models
Model Summary
After defining the model we can use the class to create a model object and view the
summary of the model. The summary option can be used to print the summary of the
model like below.
Python3
# Instantiate the model model = Classifier() # Move the model to the GPU if
available model.to(device) summary(model, (1, 28, 28))
Output:
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 32, 28, 28] 320
MaxPool2d-2 [-1, 32, 14, 14] 0
Dropout2d-3 [-1, 32, 14, 14] 0
Conv2d-4 [-1, 64, 14, 14] 18,496
MaxPool2d-5 [-1, 64, 7, 7] 0
Dropout2d-6 [-1, 64, 7, 7] 0
Linear-7 [-1, 128] 401,536
Linear-8 [-1, 10] 1,290
================================================================
Total params: 421,642
Trainable params: 421,642
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.43
Params size (MB): 1.61
Estimated Total Size (MB): 2.04
----------------------------------------------------------------Step 4: Define the
loss function and optimizer
Now, we need to define a loss function and an optimizer. For this example, we will
be using the cross-entropy loss and the ADAM optimizer.
Python3
The code defines the loss function and optimizer for the neural network.
nn.CrossEntropyLoss() is a PyTorch function that creates an instance of the cross-
entropy loss function. Cross-entropy loss is commonly used in classification
problems as it measures the dissimilarity between the predicted class probabilities
and the true class. It is calculated by taking the negative logarithm of the
predicted class probability for the true class.
optimizer = optim.Adam(model.parameters(), lr=0.001): This line creates an instance
of the optim.Adam class, which is an optimization algorithm commonly used for deep
learning. The Adam optimizer is an extension of stochastic gradient descent that
uses moving averages of the parameters to provide a running estimate of the second
raw moments of the gradients; the term Adam is derived from adaptive moment
estimation. It requires the model’s parameters to be passed as the first argument
and the learning rate is set to 0.001. The learning rate is a hyperparameter that
controls the step size at which the optimizer makes updates to the model’s
parameters.
The optimizer and the loss function are used during the training process to update
the model’s parameters and to evaluate the model’s performance, respectively.
Step 5: Train the model
Now, we can train our model using the training dataset. We will be using a batch
size of 100 and will train the model for 10 epochs. The below code is training the
neural network on a dataset using a loop that iterates over the number of training
epochs and over the data in the training dataset.
batch_size = 100 and num_epochs = 10 define the batch size and number of epochs for
the training process. The batch size is the number of samples from the training
dataset that are used in one forward and backward pass of the neural network. The
number of epochs is the number of times the entire training dataset is passed
through the network.torch.utils.data.DataLoader(train_dataset,
batch_size=batch_size, shuffle=True) creates a PyTorch DataLoader for the training
dataset. The DataLoader takes the training dataset as an input and returns an
iterator over the dataset. The iterator will return a set of samples (images and
labels) in each iteration, where the number of samples is determined by the batch
size. By setting shuffle=True, the DataLoader will randomly shuffle the dataset
before each epoch.The outer loop, for epoch in range(num_epochs), iterates over the
number of training epochs.The inner loop, for i, (images, labels) in
enumerate(train_loader), iterates over the DataLoader, which returns batches of
images and labels. The images are passed through the model using outputs =
model(images) to get the model’s predictions.The loss is calculated by passing the
model’s predictions and the true labels to the loss function using loss =
criterion(outputs, labels).The optimizer is used to update the model’s parameters
in the direction that minimizes the loss. This is done in the following 3 steps:
optimizer.zero_grad() which clears the gradients of all optimizable
parameters.loss.backward() computes the gradients of the loss with respect to the
model’s parameters.optimizer.step() updates the model’s parameters based on the
computed gradients.After the end of each epoch, the code prints the current epoch
and the loss at the end of the epoch.
At the end of the training process, the model’s parameters will have been updated
to minimize the loss on the training dataset.
It’s worth noting that it’s also useful to use a validation set to evaluate the
model performance during training, so we can detect overfitting and adjust the
model accordingly. we can achieve this by splitting the training set into two
parts: training and validation. Then, use the training set for training, and use
the validation set for evaluating the model performance during training.
Python3
Output:
Epoch [1/10], Loss:0.2086, Validation Loss:14.6681, Accuracy:0.99, Validation
Accuracy:0.94
Epoch [2/10], Loss:0.1703, Validation Loss:11.0446, Accuracy:0.95, Validation
Accuracy:0.94
Epoch [3/10], Loss:0.1617, Validation Loss:8.9060, Accuracy:0.98, Validation
Accuracy:0.97
Epoch [4/10], Loss:0.1670, Validation Loss:7.7104, Accuracy:0.98, Validation
Accuracy:0.97
Epoch [5/10], Loss:0.0723, Validation Loss:7.1193, Accuracy:1.00, Validation
Accuracy:0.96
Epoch [6/10], Loss:0.0970, Validation Loss:7.5116, Accuracy:1.00, Validation
Accuracy:0.98
Epoch [7/10], Loss:0.1623, Validation Loss:6.8909, Accuracy:0.99, Validation
Accuracy:0.96
Epoch [8/10], Loss:0.1251, Validation Loss:7.2684, Accuracy:1.00, Validation
Accuracy:0.97
Epoch [9/10], Loss:0.0874, Validation Loss:6.9928, Accuracy:1.00, Validation
Accuracy:0.98
Epoch [10/10], Loss:0.0405, Validation Loss:6.0112, Accuracy:0.99, Validation
Accuracy:0.99
In this example, we have covered the basic steps to train a deep-learning model
using PyTorch on the MNIST dataset. This model can be further improved by using
more complex architectures, data augmentation, and other techniques. PyTorch is a
powerful and flexible library that allows you to build and train a wide range of
models, and this example is just the beginning of what you can do with it.
Step 6: Plot Training and Validation curve to check overfitting or underfitting
Once the model is trained, We can plot the Training and Validation Loss and
accuracy curve. This can give us an idea of how the model is performing on unseen
data, and if it’s overfitting or underfitting.
Python3
import matplotlib.pyplot as plt # Plot the training and validation loss over time
plt.plot(range(num_epochs), losses, color='red',
label='Training Loss', marker='o') plt.plot(range(num_epochs),
val_losses, color='blue', linestyle='--',
label='Validation Loss', marker='x') plt.xlabel('Epoch')
plt.ylabel('Loss') plt.title('Training and Validation Loss') plt.legend()
plt.show() # Plot the training and validation accuracy over time
plt.plot(range(num_epochs), accuracies, label='Training
Accuracy', color='red', marker='o')
plt.plot(range(num_epochs), val_accuracies, label='Validation
Accuracy', color='blue', linestyle=':', marker='x')
plt.xlabel('Epoch') plt.ylabel('Accuracy') plt.title('Training and Validation
Accuracy') plt.legend() plt.show()
Output:
Training and Validation LossTraining and Validation Accuracy
Note that the loss is generally decreasing with each epoch and accuracy is
increasing. This is the expected scenario
Step 7: Evaluation
Another important aspect is the choice of the evaluation metric. In this example,
we used accuracy as the evaluation metric, which is a good starting point for many
problems. However, it’s important to be aware that accuracy can be misleading in
some cases, especially when the classes are imbalanced. In those cases, other
metrics such as precision, recall, F1-score, or AUC-ROC should be used.
After training the model, you can evaluate its performance on the test dataset by
making predictions and comparing them to the true labels. One way to evaluate the
performance of a classification model is to use a classification report, which is a
summary of the model’s performance across all classes.
The first thing is to evaluate the model on the test dataset and calculate its
overall accuracy by comparing the predicted labels to the true labels using the
torch.max() function.
Then, it generates a classification report using the classification_report function
from the scikit-learn library. The classification report gives you a summary of the
model’s performance across all classes by calculating several metrics such as
precision, recall, f1-score, and support.
Precision – Precision is the number of true positives divided by the number of true
positives plus the number of false positives. It is a measure of how many of the
positive predictions were correct.
Recall – Recall is the number of true positives divided by the number of true
positives plus the number of false negatives. It is a measure of how many of the
actual positive cases were correctly predicted.
F1-score – The F1-score is the harmonic mean of precision and recall. It is a
single number that represents the balance between precision and recall.
Support – Support is the number of instances in the test set that belong to a
specific class.
It is important to note that the classification report is calculated based on the
predictions made on the entire test set, and not just a sample of the test set.
Here is an example of how to evaluate the model and generate a classification
report:
Python3
# Create a DataLoader for the test dataset test_loader =
torch.utils.data.DataLoader(test_dataset,
batch_size=batch_size, shuffle=False)
# Evaluate the model on the test dataset model.eval() with torch.no_grad():
correct = 0 total = 0 y_true = [] y_pred = [] for images, labels in
test_loader: images = images.to(device) labels = labels.to(device)
outputs = model(images) _, predicted = torch.max(outputs.data, 1)
total += labels.size(0) correct += (predicted == labels).sum().item()
predicted=predicted.to('cpu') labels=labels.to('cpu')
y_true.extend(labels) y_pred.extend(predicted) print('Test Accuracy: {}
%'.format(100 * correct / total)) # Generate a classification report from
sklearn.metrics import classification_report print(classification_report(y_true,
y_pred))
Output:
Test Accuracy: 99.1%
precision recall f1-score support
Last Updated :
15 Sep, 2023
Like Article
Save Article
Previous
Next
Python3
# We can run this Python code on a Jupyter notebook# to automatically install the
correct version of # PyTorch. # http://pytorch.org / from os import pathfrom
wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tagplatform = '{}{}-
{}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag()) accelerator = 'cu80' if
path.exists('/opt / bin / nvidia-smi') else 'cpu' ! pip install -q
http://download.pytorch.org / whl/{accelerator}/torch-1.3.1.post4-{platform}-
linux_x86_64.whl torchvision
With PyTorch installed, let us now have a look at the code. Write the two lines
given below to import the necessary library functions and objects.
Python3
We also define some data and assign them to variables x_data and y_data as given
below:
Python3
x_data = Variable(torch.Tensor([[1.0], [2.0], [3.0]]))y_data =
Variable(torch.Tensor([[2.0], [4.0], [6.0]]))
Here, x_data is our independent variable and y_data is our dependent variable. This
will be our dataset for now. Next, we need to define our model. There are two main
steps associated with defining our model. They are:
Initializing our model.Declaring the forward pass.
We use the class given below:
Python3
Python3
After this, we select the optimizer and the loss criteria. Here, we will use the
mean squared error (MSE) as our loss function and stochastic gradient descent (SGD)
as our optimizer. Also, we arbitrarily fix a learning rate of 0.01.
Python3
We now arrive at our training step. We perform the following tasks 500 times during
training:
Perform a forward pass bypassing our data and finding out the predicted value of
y.Compute the loss using MSE.Reset all the gradients to 0, perform a
backpropagation and then, update the weights.
Python3
for epoch in range(500): # Forward pass: Compute predicted y by passing # x
to the model pred_y = our_model(x_data) # Compute and print loss loss =
criterion(pred_y, y_data) # Zero gradients, perform a backward pass, # and
update the
weights. optimizer.zero_grad() loss.backward() optimizer.step() print('
epoch {}, loss {}'.format(epoch, loss.item()))
Once the training is completed, we test if we are getting correct results using the
model that we defined. So, we test it for an unknown value of x_data, in this case,
4.0.
Python3
Python3
Last Updated :
17 Sep, 2021
Like Article
Save Article
Previous
Next
Python3
In order to make the random numbers predictable, we will define fixed seeds for
both Numpy and Tensorflow.
Python3
np.random.seed(101)
Now, let us generate some random data for training the Linear Regression Model.
Python3
# Generating random linear data# There will be 50 data points ranging from 0 to 50x
= np.linspace(0, 50, 50)y = np.linspace(0, 50, 50) # Adding noise to the random
linear datax += np.random.uniform(-4, 4, 50)y += np.random.uniform(-4, 4, 50) n =
len(x) # Number of data points
Python3
Output:
Now we will start creating our model by defining the placeholders X and Y, so that
we can feed our training examples X and Y into the optimizer during the training
process.
Python3
X = tf.placeholder("float")Y = tf.placeholder("float")
Now we will declare two trainable Tensorflow Variables for the Weights and Bias and
initializing them randomly using np.random.randn().
Python3
W = tf.Variable(np.random.randn(), name = "W")b = tf.Variable(np.random.randn(),
name = "b")
Now we will define the hyperparameters of the model, the Learning Rate and the
number of Epochs.
Python3
Python3
Python3
# Starting the Tensorflow Sessionwith tf.Session() as sess: # Initializing
the Variables sess.run(init) # Iterating through all the epochs for
epoch in range(training_epochs): # Feeding each data point into the
optimizer using Feed Dictionary for (_x, _y) in zip(x,
y): sess.run(optimizer, feed_dict = {X : _x, Y : _y}) #
Displaying the result after every 50 epochs if (epoch + 1) % 50 ==
0: # Calculating the cost a every epoch c = sess.run(cost,
feed_dict = {X : x, Y : y}) print("Epoch", (epoch + 1), ": cost =", c,
"W =", sess.run(W), "b =", sess.run(b)) # Storing necessary values to be
used outside the Session training_cost = sess.run(cost, feed_dict ={X: x, Y:
y}) weight = sess.run(W) bias = sess.run(b)
Output:
Python3
# Calculating the predictionspredictions = weight * x + biasprint("Training cost
=", training_cost, "Weight =", weight, "bias =", bias, '\n')
Output:
Note that in this case both the Weight and bias are scalars. This is because, we
have considered only one dependent variable in our training data. If we have m
dependent variables in our training dataset, the Weight will be an m-dimensional
vector while bias will be a scalar.
Finally, we will plot our result.
Python3
Last Updated :
03 Apr, 2023
Like Article
Save Article
Previous
Next
Hyperparameter tuning
Python3
# Necessary importsfrom sklearn.linear_model import LogisticRegressionfrom
sklearn.model_selection import GridSearchCVimport numpy as npfrom sklearn.datasets
import make_classification X, y = make_classification( n_samples=1000,
n_features=20, n_informative=10, n_classes=2, random_state=42) # Creating the
hyperparameter gridc_space = np.logspace(-5, 8, 15)param_grid = {'C': c_space} #
Instantiating logistic regression classifierlogreg = LogisticRegression() #
Instantiating the GridSearchCV objectlogreg_cv = GridSearchCV(logreg, param_grid,
cv=5) # Assuming X and y are your feature matrix and target variable# Fit the
GridSearchCV object to the datalogreg_cv.fit(X, y) # Print the tuned parameters and
scoreprint("Tuned Logistic Regression Parameters:
{}".format(logreg_cv.best_params_))print("Best score is
{}".format(logreg_cv.best_score_))
Output:
Tuned Logistic Regression Parameters: {'C': 0.006105402296585327}
Best score is 0.853
Python3
Output:
Tuned Decision Tree Parameters: {'criterion': 'entropy', 'max_depth': None,
'max_features': 8, 'min_samples_leaf': 7}
Best score is 0.842
Drawback: It’s possible that the outcome could not be the ideal hyperparameter
combination is a disadvantage.
3. Bayesian OptimizationGrid search and random search are often inefficient because
they evaluate many unsuitable hyperparameter combinations without considering the
previous iterations’ results. Bayesian optimization, on the other hand, treats the
search for optimal hyperparameters as an optimization problem. It considers the
previous evaluation results when selecting the next hyperparameter combination and
applies a probabilistic function to choose the combination that will likely yield
the best results. This method discovers a good hyperparameter combination in
relatively few iterations.
Data scientists use a probabilistic model when the objective function is unknown.
The probabilistic model estimates the probability of a hyperparameter combination’s
objective function result based on past evaluation results.
P(score(y)|hyperparameters(x))
It is a “surrogate” of the objective function, which can be the root-mean-square
error (RMSE), for example. The objective function is calculated using the training
data with the hyperparameter combination, and we try to optimize it (maximize or
minimize, depending on the objective function selected).
Applying the probabilistic model to the hyperparameters is computationally
inexpensive compared to the objective function. Therefore, this method typically
updates and improves the surrogate probability model every time the objective
function runs. Better hyperparameter predictions decrease the number of objective
function evaluations needed to achieve a good result. Gaussian processes, random
forest regression, and tree-structured Parzen estimators (TPE) are examples of
surrogate models.
The Bayesian optimization model is complex to implement, but off-the-shelf
libraries like Ray Tune can simplify the process. It’s worth using this type of
model because it finds an adequate hyperparameter combination in relatively few
iterations. However, compared to grid search or random search, we must compute
Bayesian optimization sequentially, so it doesn’t allow distributed processing.
Therefore, Bayesian optimization takes longer yet uses fewer computational
resources.
Drawback: Requires an understanding of the underlying probabilistic model.
Challenges in Hyperparameter TuningDealing with High-Dimensional Hyperparameter
Spaces: Efficient Exploration and OptimizationHandling Expensive Function
Evaluations: Balancing Computational Efficiency and AccuracyIncorporating Domain
Knowledge: Utilizing Prior Information for Informed TuningDeveloping Adaptive
Hyperparameter Tuning Methods: Adjusting Parameters During TrainingApplications of
Hyperparameter TuningModel Selection: Choosing the Right Model Architecture for the
TaskRegularization Parameter Tuning: Controlling Model Complexity for Optimal
PerformanceFeature Preprocessing Optimization: Enhancing Data Quality and Model
PerformanceAlgorithmic Parameter Tuning: Adjusting Algorithm-Specific Parameters
for Optimal ResultsAdvantages of Hyperparameter tuning:Improved model
performanceReduced overfitting and underfittingEnhanced model
generalizabilityOptimized resource utilizationImproved model
interpretabilityDisadvantages of Hyperparameter tuning:Computational costTime-
consuming processRisk of overfittingNo guarantee of optimal performanceRequires
expertiseFrequently Asked Question(FAQ’s)1. What are the methods of hyperparameter
tuning?There are several methods for hyperparameter tuning, including grid search,
random search, and Bayesian optimization. Grid search exhaustively evaluates all
possible combinations of hyperparameter values, while random search randomly
samples combinations. Bayesian optimization uses a probabilistic model to guide the
search for optimal hyperparameters.
2. What is the difference between parameter tuning and hyperparameter tuning?
Parameters are the coefficients or weights learned during the training process of a
machine learning model, while hyperparameters are settings that control the
training process itself. For example, the learning rate is a hyperparameter that
controls how quickly the model learns from the data
3. What is the purpose of hyperparameter tuning?The purpose of hyperparameter
tuning is to find the best set of hyperparameters for a given machine learning
model. This can improve the model’s performance on unseen data, prevent
overfitting, and reduce training time.
4. Which hyperparameter to tune first?The order in which you tune hyperparameters
depends on the specific model and dataset. However, a good rule of thumb is to
start with the most important hyperparameters, such as the learning rate, and then
move on to less important ones.
5. What is hyperparameter tuning and cross validation?Cross validation is a
technique used to evaluate the performance of a machine learning model.
Hyperparameter tuning is often performed within a cross-validation loop to ensure
that the selected hyperparameters generalize well to unseen data.
Last Updated :
07 Dec, 2023
Like Article
Save Article
Previous
Next
Now imagine taking a small patch of this image and running a small neural network,
called a filter or kernel on it, with say, K outputs and representing them
vertically. Now slide that neural network across the whole image, as a result, we
will get another image with different widths, heights, and depths. Instead of just
R, G, and B channels now we have more channels but lesser width and height. This
operation is called Convolution. If the patch size is the same as that of the image
it will be a regular neural network. Because of this small patch, we have fewer
weights.
Image source: Deep Learning Udacity
Now let’s talk about a bit of mathematics that is involved in the whole convolution
process.
Convolution layers consist of a set of learnable filters (or kernels) having small
widths and heights and the same depth as that of input volume (3 if the input layer
is image input).For example, if we have to run convolution on an image with
dimensions 34x34x3. The possible size of filters can be axax3, where ‘a’ can be
anything like 3, 5, or 7 but smaller as compared to the image dimension.During the
forward pass, we slide each filter across the whole input volume step by step where
each step is called stride (which can have a value of 2, 3, or even 4 for high-
dimensional images) and compute the dot product between the kernel weights and
patch from input volume.As we slide our filters we’ll get a 2-D output for each
filter and we’ll stack them together as a result, we’ll get output volume having a
depth equal to the number of filters. The network will learn all the filters.Layers
used to build ConvNetsA complete Convolution Neural Networks architecture is also
known as covnets. A covnets is a sequence of layers, and every layer transforms one
volume to another through a differentiable function. Types of layers: datasetsLet’s
take an example by running a covnets on of image of dimension 32 x 32 x 3.
Input Layers: It’s the layer in which we give input to our model. In CNN,
Generally, the input will be an image or a sequence of images. This layer holds the
raw input of the image with width 32, height 32, and depth 3.Convolutional Layers:
This is the layer, which is used to extract the feature from the input dataset. It
applies a set of learnable filters known as the kernels to the input images. The
filters/kernels are smaller matrices usually 2×2, 3×3, or 5×5 shape. it slides over
the input image data and computes the dot product between kernel weight and the
corresponding input image patch. The output of this layer is referred ad feature
maps. Suppose we use a total of 12 filters for this layer we’ll get an output
volume of dimension 32 x 32 x 12.Activation Layer: By adding an activation function
to the output of the preceding layer, activation layers add nonlinearity to the
network. it will apply an element-wise activation function to the output of the
convolution layer. Some common activation functions are RELU: max(0, x), Tanh,
Leaky RELU, etc. The volume remains unchanged hence output volume will have
dimensions 32 x 32 x 12.Pooling layer: This layer is periodically inserted in the
covnets and its main function is to reduce the size of volume which makes the
computation fast reduces memory and also prevents overfitting. Two common types of
pooling layers are max pooling and average pooling. If we use a max pool with 2 x 2
filters and stride 2, the resultant volume will be of dimension 16x16x12.
Image source: cs231n.stanford.edu
Flattening: The resulting feature maps are flattened into a one-dimensional vector
after the convolution and pooling layers so they can be passed into a completely
linked layer for categorization or regression.Fully Connected Layers: It takes the
input from the previous layer and computes the final classification or regression
task.
Image source: cs231n.stanford.edu
Output Layer: The output from the fully connected layers is then fed into a
logistic function for classification tasks like sigmoid or softmax which converts
the output of each class into the probability score of each class.Example:Let’s
consider an image and apply the convolution layer, activation layer, and pooling
layer operation to extract the inside feature.
Input image:
Input image
Step:import the necessary librariesset the parameterdefine the kernelLoad the image
and plot it.Reformat the image Apply convolution layer operation and plot the
output image.Apply activation layer operation and plot the output image.Apply
pooling layer operation and plot the output image.
Python3
Like Article
Save Article
Previous
Hyperparameter tuning
Next
The right side of this equation is digital image by definition. Every element of
this matrix is called image element , picture element , or pixel.
DIGITAL IMAGE REPRESENTATION IN MATLAB:
According to block 1,if input is an image and we get out image as a output, then it
is termed as Digital Image Processing. According to block 2,if input is an image
and we get some kind of information or description as a output, then it is termed
as Computer Vision. According to block 3,if input is some description or code and
we get image as an output, then it is termed as Computer Graphics. According to
block 4,if input is description or some keywords or some code and we get
description or some keywords as a output,then it is termed as Artificial
Intelligence
Advantages of Digital Image Processing:Improved image quality: Digital image
processing algorithms can improve the visual quality of images, making them
clearer, sharper, and more informative.Automated image-based tasks: Digital image
processing can automate many image-based tasks, such as object recognition, pattern
detection, and measurement.Increased efficiency: Digital image processing
algorithms can process images much faster than humans, making it possible to
analyze large amounts of data in a short amount of time.Increased accuracy: Digital
image processing algorithms can provide more accurate results than humans,
especially for tasks that require precise measurements or quantitative
analysis.Disadvantages of Digital Image Processing:High computational cost: Some
digital image processing algorithms are computationally intensive and require
significant computational resources.Limited interpretability: Some digital image
processing algorithms may produce results that are difficult for humans to
interpret, especially for complex or sophisticated algorithms.Dependence on quality
of input: The quality of the output of digital image processing algorithms is
highly dependent on the quality of the input images. Poor quality input images can
result in poor quality output.Limitations of algorithms: Digital image processing
algorithms have limitations, such as the difficulty of recognizing objects in
cluttered or poorly lit scenes, or the inability to recognize objects with
significant deformations or occlusions.Dependence on good training data: The
performance of many digital image processing algorithms is dependent on the quality
of the training data used to develop the algorithms. Poor quality training data can
result in poor performance of the algorit REFERENCES
Digital Image Processing (Rafael c. gonzalez)
Reference books:
“Digital Image Processing” by Rafael C. Gonzalez and Richard E. Woods.“Computer
Vision: Algorithms and Applications” by Richard Szeliski.“Digital Image Processing
Using MATLAB” by Rafael C. Gonzalez, Richard E. Woods, and Steven L. Eddins.
Last Updated :
22 Feb, 2023
Like Article
Save Article
Previous
Next
Image processing and Computer Vision both are very exciting field of Computer
Science.
Computer Vision: In Computer Vision, computers or machines are made to gain high-
level understanding from the input digital images or videos with the purpose of
automating tasks that the human visual system can do. It uses many techniques and
Image Processing is just one of them.
Image Processing: Image Processing is the field of enhancing the images by tuning
many parameter and features of the images. So Image Processing is the subset of
Computer Vision. Here, transformations are applied to an input image and the
resultant output image is returned. Some of these transformations are- sharpening,
smoothing, stretching etc.
Now, as both the fields deal with working in visuals, i.e., images and videos,
there seems to be lot of confusion about the difference about these fields of
computer science. In this article we will discuss the difference between them.
Difference between Image Processing and Computer Vision:
Image ProcessingComputer VisionImage processing is mainly focused on processing the
raw input images to enhance them or preparing them to do other tasksComputer vision
is focused on extracting information from the input images or videos to have a
proper understanding of them to predict the visual input like human brain.Image
processing uses methods like Anisotropic diffusion, Hidden Markov models,
Independent component analysis, Different Filtering etc.Image processing is one of
the methods that is used for computer vision along with other Machine learning
techniques, CNN etc.Image Processing is a subset of Computer Vision.Computer Vision
is a superset of Image Processing.Examples of some Image Processing applications
are- Rescaling image (Digital Zoom), Correcting illumination, Changing tones
etc.Examples of some Computer Vision applications are- Object detection, Face
detection, Hand writing recognition etc.
Last Updated :
03 Nov, 2022
Like Article
Save Article
Previous
Next
The pooling operation involves sliding a two-dimensional filter over each channel
of feature map and summarising the features lying within the region covered by the
filter. For a feature map having dimensions nh x nw x nc, the dimensions of output
obtained after a pooling layer is
(nh - f + 1) / s x (nw - f + 1)/s x nc
where,
-> nh - height of feature map
-> nw - width of feature map
-> nc - number of channels in the feature map
-> f - size of filter
-> s - stride length
A common CNN model architecture is to have a number of convolution and pooling
layers stacked one after the other.
Why to use Pooling Layers?Pooling layers are used to reduce the dimensions of the
feature maps. Thus, it reduces the number of parameters to learn and the amount of
computation performed in the network.The pooling layer summarises the features
present in a region of the feature map generated by a convolution layer. So,
further operations are performed on summarised features instead of precisely
positioned features generated by the convolution layer. This makes the model more
robust to variations in the position of the features in the input image. Types of
Pooling Layers: Max PoolingMax pooling is a pooling operation that selects the
maximum element from the region of the feature map covered by the filter. Thus, the
output after max-pooling layer would be a feature map containing the most prominent
features of the previous feature map.
Python3
Python3
Output:
[[4.25 4.25]
[4.25 3.5 ]]Global PoolingGlobal pooling reduces each channel in the feature map to
a single value. Thus, an nh x nw x nc feature map is reduced to 1 x 1 x nc feature
map. This is equivalent to using a filter of dimensions nh x nw i.e. the dimensions
of the feature map. Further, it can be either global max pooling or global average
pooling.Code #3 : Performing Global Pooling using keras
Python3
import numpy as npfrom keras.models import Sequentialfrom keras.layers import
GlobalMaxPooling2Dfrom keras.layers import GlobalAveragePooling2D # define input
imageimage = np.array([[2, 2, 7, 3], [9, 4, 6,
1], [8, 5, 2, 4], [3, 1, 2, 6]])image =
image.reshape(1, 4, 4, 1) # define gm_model containing just a single global-max
pooling layergm_model = Sequential( [GlobalMaxPooling2D()]) # define ga_model
containing just a single global-average pooling layerga_model =
Sequential( [GlobalAveragePooling2D()]) # generate pooled outputgm_output =
gm_model.predict(image)ga_output = ga_model.predict(image) # print output
imagegm_output = np.squeeze(gm_output)ga_output =
np.squeeze(ga_output)print("gm_output: ", gm_output)print("ga_output: ", ga_output)
Output:
gm_output: 9.0
ga_output: 4.0625
In convolutional neural networks (CNNs), the pooling layer is a common type of
layer that is typically added after convolutional layers. The pooling layer is used
to reduce the spatial dimensions (i.e., the width and height) of the feature maps,
while preserving the depth (i.e., the number of channels).
The pooling layer works by dividing the input feature map into a set of non-
overlapping regions, called pooling regions. Each pooling region is then
transformed into a single output value, which represents the presence of a
particular feature in that region. The most common types of pooling operations are
max pooling and average pooling.In max pooling, the output value for each pooling
region is simply the maximum value of the input values within that region. This has
the effect of preserving the most salient features in each pooling region, while
discarding less relevant information. Max pooling is often used in CNNs for object
recognition tasks, as it helps to identify the most distinctive features of an
object, such as its edges and corners.In average pooling, the output value for each
pooling region is the average of the input values within that region. This has the
effect of preserving more information than max pooling, but may also dilute the
most salient features. Average pooling is often used in CNNs for tasks such as
image segmentation and object detection, where a more fine-grained representation
of the input is required.
Pooling layers are typically used in conjunction with convolutional layers in a
CNN, with each pooling layer reducing the spatial dimensions of the feature maps,
while the convolutional layers extract increasingly complex features from the
input. The resulting feature maps are then passed to a fully connected layer, which
performs the final classification or regression task.
Advantages of Pooling Layer:
Dimensionality reduction: The main advantage of pooling layers is that they help in
reducing the spatial dimensions of the feature maps. This reduces the computational
cost and also helps in avoiding overfitting by reducing the number of parameters in
the model.Translation invariance: Pooling layers are also useful in achieving
translation invariance in the feature maps. This means that the position of an
object in the image does not affect the classification result, as the same features
are detected regardless of the position of the object.Feature selection: Pooling
layers can also help in selecting the most important features from the input, as
max pooling selects the most salient features and average pooling preserves more
information.
Disadvantages of Pooling Layer:
Information loss: One of the main disadvantages of pooling layers is that they
discard some information from the input feature maps, which can be important for
the final classification or regression task.Over-smoothing: Pooling layers can also
cause over-smoothing of the feature maps, which can result in the loss of some
fine-grained details that are important for the final classification or regression
task.Hyperparameter tuning: Pooling layers also introduce hyperparameters such as
the size of the pooling regions and the stride, which need to be tuned in order to
achieve optimal performance. This can be time-consuming and requires some expertise
in model building.
Last Updated :
21 Apr, 2023
Like Article
Save Article
Previous
Difference between Image Processing and Computer Vision
Next
Python3
import tensorflow as tf # Display the versionprint(tf.__version__) # other
importsimport numpy as npimport matplotlib.pyplot as pltfrom
tensorflow.keras.layers import Input, Conv2D, Dense, Flatten, Dropoutfrom
tensorflow.keras.layers import GlobalMaxPooling2D, MaxPooling2Dfrom
tensorflow.keras.layers import BatchNormalizationfrom tensorflow.keras.models
import Model
Output:
2.4.1
The output of the above code should display the version of tensorflow you are using
eg 2.4.1 or any other.
Now we have the required module support so let’s load in our data. The dataset of
CIFAR-10 is available on tensorflow keras API, and we can download it on our local
machine using tensorflow.keras.datasets.cifar10 and then distribute it to train and
test set using load_data() function.
Python3
# Load in the datacifar10 = tf.keras.datasets.cifar10 # Distribute it to train and
test set(x_train, y_train), (x_test, y_test) =
cifar10.load_data()print(x_train.shape, y_train.shape, x_test.shape, y_test.shape)
Output:
The output of the above code will display the shape of all four partitions and will
look something like this
Here we can see we have 5000 training images and 1000 test images as specified
above and all the images are of 32 by 32 size and have 3 color channels i.e. images
are color images. As well as it is also visible that there is only a single label
assigned with each image.
Until now, we have our data with us. But still, we cannot be sent it directly to
our neural network. We need to process the data in order to send it to the network.
The first thing in the process is to reduce the pixel values. Currently, all the
image pixels are in a range from 1-256, and we need to reduce those values to a
value ranging between 0 and 1. This enables our model to easily track trends and
efficient training. We can do this simply by dividing all pixel values by 255.0.
Another thing we want to do is to flatten(in simple words rearrange them in form of
a row) the label values using the flatten() function.
Python3
# Reduce pixel valuesx_train, x_test = x_train / 255.0, x_test / 255.0 # flatten
the label valuesy_train, y_test = y_train.flatten(), y_test.flatten()
Now is a good time to see few images of our dataset. We can visualize it in a
subplot grid form. Since the image size is just 32×32 so don’t expect much from the
image. It would be a blurred one. We can do the visualization using the subplot()
function from matplotlib and looping over the first 25 images from our training
dataset portion.
Python3
Though the images are not clear there are enough pixels for us to specify which
object is there in those images.
After completing all the steps now is the time to built our model. We are going to
use a Convolution Neural Network or CNN to train our model. It includes using a
convolution layer in this which is Conv2d layer as well as pooling and
normalization methods. Finally, we’ll pass it into a dense layer and the final
dense layer which is our output layer. We are using ‘relu‘ activation function. The
output layer uses a “softmax” function.
Python3
Our model is now ready, it’s time to compile it. We are using model.compile()
function to compile our model. For the parameters, we are using
adam optimizersparse_categorical_crossentropy as the loss
functionmetrics=[‘accuracy’]
Python3
#
Compilemodel.compile(optimizer='adam', loss='sparse_categorical_crosse
ntropy', metrics=['accuracy'])
Now let’s fit our model using model.fit() passing all our data to it. We are going
to train our model till 50 epochs, it gives us a fair result though you can tweak
it if you want.
Python3
Output:
The model will start training, and it will look something like this
After this, our model is trained. Though it will work fine but to make our model
much more accurate we can add data augmentation on our data and then train it
again. Calling model.fit() again on augmented data will continue training where it
left off. We are going to fir our data on a batch size of 32 and we are going to
shift the range of width and height by 0.1 and flip the images horizontally. Then
call model.fit again for 50 epochs.
Python3
# Fit with data augmentation# Note: if you run this AFTER calling# the previous
model.fit()# it will CONTINUE training where it left offbatch_size =
32data_generator =
tf.keras.preprocessing.image.ImageDataGenerator( width_shift_range=0.1,
height_shift_range=0.1, horizontal_flip=True) train_generator =
data_generator.flow(x_train, y_train, batch_size)steps_per_epoch = x_train.shape[0]
// batch_size r = model.fit(train_generator, validation_data=(x_test,
y_test), steps_per_epoch=steps_per_epoch, epochs=50)
Output:
The model will start training for 50 epochs. Though it is running on GPU it will
take at least 10 to 15 minutes.
Now we have trained our model, before making any predictions from it let’s
visualize the accuracy per iteration for better analysis. Though there are other
methods that include confusion matrix for better analysis of the model.
Python3
# Plot accuracy per iterationplt.plot(r.history['accuracy'], label='acc',
color='red')plt.plot(r.history['val_accuracy'], label='val_acc',
color='green')plt.legend()
Output:
Let’s make a prediction over an image from our model using model.predict()
function. Before sending the image to our model we need to again reduce the pixel
values between 0 and 1 and change its shape to (1,32,32,3) as our model expects the
input to be in this form only. To make things easy let us take an image from the
dataset itself. It is already in reduced pixels format still we have to reshape it
(1,32,32,3) using reshape() function. Since we are using data from the dataset we
can compare the predicted output and original output.
Python3
# label mapping labels = '''airplane automobile bird cat deerdog frog horseship
truck'''.split() # select the image from our test datasetimage_number = 0 # display
the imageplt.imshow(x_test[image_number]) # load the image in an arrayn =
np.array(x_test[image_number]) # reshape itp = n.reshape(1, 32, 32, 3) # pass in
the network for prediction and # save the predicted labelpredicted_label =
labels[model.predict(p).argmax()] # load the original labeloriginal_label =
labels[y_test[image_number]] # display the resultprint("Original label is {} and
predicted label is {}".format( original_label, predicted_label))
Output:
Now we have the output as Original label is cat and the predicted label is also
cat.
Let’s check it for some label which was misclassified by our model, e.g. for image
number 5722 we receive something like this:
Finally, let’s save our model using model.save() function as an h5 file. If you are
using Google colab you can download your model from the files section.
Python3
Last Updated :
02 Nov, 2022
Like Article
Save Article
Previous
Next
Black and white image colorization with OpenCV and Deep Learning
Introduction:
Introduced in the 1980s by Yann LeCun, Convolution Neural Networks(also called CNNs
or ConvNets) have come a long way. From being employed for simple digit
classification tasks, CNN-based architectures are being used very profoundly over
much Deep Learning and Computer Vision-related tasks like object detection, image
segmentation, gaze tracking, among others. Using the PyTorch framework, this
article will implement a CNN-based image classifier on the popular CIFAR-10
dataset.
Before going ahead with the code and installation, the reader is expected to
understand how CNNs work theoretically and with various related operations like
convolution, pooling, etc. The article also assumes a basic familiarity with the
PyTorch workflow and its various utilities, like Dataloaders, Datasets, Tensor
transforms, and CUDA operations. For a quick refresher of these concepts, the
reader is encouraged to go through the following articles:
Introduction to Convolutional Neural NetworkTraining Neural Networks with
Validation using PyTorchHow to set up and Run CUDA Operations in Pytorch?
Installation
For the implementation of the CNN and downloading the CIFAR-10 dataset, we’ll be
requiring the torch and torchvision modules. Apart from that, we’ll be using numpy
and matplotlib for data analysis and plotting. The required libraries can be
installed using the pip package manager through the following command:
Python3
import torch import torchvision import matplotlib.pyplot as plt import numpy as np
# The below two lines are optional and are just there to avoid any SSL # related
errors while downloading the CIFAR-10 dataset import ssl
ssl._create_default_https_context = ssl._create_unverified_context #Defining
plotting settings plt.rcParams['figure.figsize'] = 14, 6 #Initializing normalizing
transform for the dataset normalize_transform = torchvision.transforms.Compose([
torchvision.transforms.ToTensor(), torchvision.transforms.Normalize(mean =
(0.5, 0.5, 0.5), std = (0.5, 0.5, 0.5))])
#Downloading the CIFAR10 dataset into train and test sets train_dataset =
torchvision.datasets.CIFAR10( root="./CIFAR10/train", train=True,
transform=normalize_transform, download=True) test_dataset =
torchvision.datasets.CIFAR10( root="./CIFAR10/test", train=False,
transform=normalize_transform, download=True) #Generating data loaders from
the corresponding datasets batch_size = 128train_loader =
torch.utils.data.DataLoader(train_dataset, batch_size=batch_size) test_loader =
torch.utils.data.DataLoader(test_dataset, batch_size=batch_size) #Plotting 25
images from the 1st batch dataiter = iter(train_loader) images, labels =
dataiter.next() plt.imshow(np.transpose(torchvision.utils.make_grid( images[:25],
normalize=True, padding=1, nrow=5).numpy(), (1, 2, 0))) plt.axis('off')
Output:
Figure 1: Some sample images from the training dataset
Step-2: Plotting class distribution of the dataset
It’s generally a good idea to plot out the class distribution of the training set.
This helps in checking whether the provided dataset is balanced or not. To do this,
we iterate over the entire training set in batches and collect the respective
classes of each instance. Finally, we calculate the counts of the unique classes
and plot them.
Code:
Python3
#Iterating over the training dataset and storing the target class for each sample
classes = [] for batch_idx, data in enumerate(train_loader, 0): x, y = data
classes.extend(y.tolist()) #Calculating the unique classes and the respective
counts and plotting them unique, counts = np.unique(classes, return_counts=True)
names = list(test_dataset.class_to_idx.keys()) plt.bar(names, counts)
plt.xlabel("Target Classes") plt.ylabel("Number of training instances")
Output:
Figure 2: Class distribution of the training set
As shown in Figure 2, each of the ten classes has almost the same number of
training samples. Thus we don’t need to take additional steps to rebalance the
dataset.
Step-3: Implementing the CNN architecture
On the architecture side, we’ll be using a simple model that employs three
convolution layers with depths 32, 64, and 64, respectively, followed by two fully
connected layers for performing classification.
Each convolutional layer involves a convolutional operation involving a 3×3
convolution filter and is followed by a ReLU activation operation for introducing
nonlinearity into the system and a max-pooling operation with a 2×2 filter to
reduce the dimensionality of the feature map.After the end of the convolutional
blocks, we flatten the multidimensional layer into a low dimensional structure for
starting our classification blocks. After the first linear layer, the last output
layer(also a linear layer) has ten neurons for each of the ten unique classes in
our dataset.
The architecture is as follows:
Figure 3: Architecture of the CNN
For building our model, we’ll make a CNN class inherited from the torch.nn.Module
class for taking advantage of the Pytorch utilities. Apart from that, we’ll be
using the torch.nn.Sequential container to combine our layers one after the other.
The Conv2D(), ReLU(), and MaxPool2D() layers perform the convolution, activation,
and pooling operations. We used padding of 1 to give sufficient learning space to
the kernel as padding gives the image more coverage area, especially the pixels in
the outer frame.After the convolutional blocks, the Linear() fully connected layers
perform classification.
Code:
Python3
Python3
Python3
Python3
#Generating predictions for 'num_images' amount of images from the last batch of
test set num_images = 5y_true_name = [names[y_true[idx]] for idx in
range(num_images)] y_pred_name = [names[y_pred[idx]] for idx in
range(num_images)] #Generating the title for the plot title = f"Actual labels:
{y_true_name}, Predicted labels: {y_pred_name}" #Finally plotting the images with
their actual and predicted labels in the title
plt.imshow(np.transpose(torchvision.utils.make_grid(images[:num_images].cpu(),
normalize=True, padding=1).numpy(), (1, 2, 0))) plt.title(title) plt.axis("off")
Output:
Figure 6: Actual vs. Predicted labels for 5 sample images from the test set. Note
that the labels are in the same order as the respective images, from left to right.
As can be seen from Figure 6, the model is producing correct predictions for all
the images except the 2nd one as it misclassifies the dog as a cat!
Conclusion:
This article covered the PyTorch implementation of a simple CNN on the popular
CIFAR-10 dataset. The reader is encouraged to play around with the network
architecture and model hyperparameters to increase the model accuracy even more!
Referenceshttps://cs231n.github.io/convolutional-networks/https://pytorch.org/
docs/stable/index.htmlhttps://pytorch.org/tutorials/beginner/blitz/
cifar10_tutorial.html
Last Updated :
25 Feb, 2022
Like Article
Save Article
Previous
Next
Python3
import torch from torchsummary import summary import torch.nn as nn import
torch.nn.functional as F class LeNet5(nn.Module): def __init__(self):
# Call the parent class's init method super(LeNet5, self).__init__()
# First Convolutional Layer self.conv1 = nn.Conv2d(in_channels=1,
out_channels=6, kernel_size=5, stride=1) # Max Pooling Layer
self.pool = nn.MaxPool2d(kernel_size=2, stride=2) # Second
Convolutional Layer self.conv2 = nn.Conv2d(in_channels=6, out_channels=16,
kernel_size=5, stride=1) # First Fully Connected Layer
self.fc1 = nn.Linear(in_features=16 * 5 * 5, out_features=120) #
Second Fully Connected Layer self.fc2 = nn.Linear(in_features=120,
out_features=84) # Output Layer self.fc3 =
nn.Linear(in_features=84, out_features=10) def forward(self, x): #
Pass the input through the first convolutional layer and activation function
x = self.pool(F.relu(self.conv1(x))) # Pass the output of the
first layer through # the second convolutional layer and activation
function x = self.pool(F.relu(self.conv2(x))) # Reshape
the output to be passed through the fully connected layers x = x.view(-1,
16 * 5 * 5) # Pass the output through the first fully connected
layer and activation function x = F.relu(self.fc1(x)) #
Pass the output of the first fully connected layer through # the second
fully connected layer and activation function x = F.relu(self.fc2(x))
# Pass the output of the second fully connected layer through the output layer
x = self.fc3(x) # Return the final output return x
lenet5 = LeNet5() print(lenet5)
Output:
LeNet5(
(conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
(pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1,
ceil_mode=False)
(conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
(fc1): Linear(in_features=400, out_features=120, bias=True)
(fc2): Linear(in_features=120, out_features=84, bias=True)
(fc3): Linear(in_features=84, out_features=10, bias=True)
)Model Summary :
Print the summary of the lenet5 to check the params
Python3
# add the cuda to the mode device = torch.device("cuda" if
torch.cuda.is_available() else "cpu") lenet5.to(device) #Print the summary of the
model summary(lenet5, (1, 32, 32))
Output:
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 6, 28, 28] 156
MaxPool2d-2 [-1, 6, 14, 14] 0
Conv2d-3 [-1, 16, 10, 10] 2,416
MaxPool2d-4 [-1, 16, 5, 5] 0
Linear-5 [-1, 120] 48,120
Linear-6 [-1, 84] 10,164
Linear-7 [-1, 10] 850
================================================================
Total params: 61,706
Trainable params: 61,706
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.06
Params size (MB): 0.24
Estimated Total Size (MB): 0.30
----------------------------------------------------------------2. AlexNNetThe
AlexNet CNN architecture won the 2012 ImageNet ILSVRC challenges of deep learning
algorithm by a large variance by achieving 17% with top-5 error rate as the second
best achieved 26%! It was introduced by Alex Krizhevsky (name of founder), The Ilya
Sutskever and Geoffrey Hinton are quite similar to LeNet-5, only much bigger and
deeper and it was introduced first to stack convolutional layers directly on top of
each other models, instead of stacking a pooling layer top of each on CN network
convolutional layer.AlexNNet has 60 million parameters as AlexNet has total 8
layers, 5 convolutional and 3 fully connected layers. AlexNNet is first to execute
(ReLUs) Rectified Linear Units as activation functionsit was the first CNN
architecture that uses GPU to improve the performance.ALexNNet
Example Model of AlexNNet
Python3
Output:
AlexNet(
(conv1): Conv2d(3, 96, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
(pool): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1,
ceil_mode=False)
(conv2): Conv2d(96, 256, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
(conv3): Conv2d(256, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(conv4): Conv2d(384, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(conv5): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(fc1): Linear(in_features=9216, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=4096, bias=True)
(fc3): Linear(in_features=4096, out_features=1000, bias=True)
)Model Summary :
Print the summary of the alexnet to check the params
Python3
# add the cuda to the mode device = torch.device("cuda" if
torch.cuda.is_available() else "cpu") alexnet.to(device) #Print the summary of
the model summary(alexnet, (3, 224, 224))
Output:
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 96, 55, 55] 34,944
MaxPool2d-2 [-1, 96, 27, 27] 0
Conv2d-3 [-1, 256, 27, 27] 614,656
MaxPool2d-4 [-1, 256, 13, 13] 0
Conv2d-5 [-1, 384, 13, 13] 885,120
Conv2d-6 [-1, 384, 13, 13] 1,327,488
Conv2d-7 [-1, 256, 13, 13] 884,992
MaxPool2d-8 [-1, 256, 6, 6] 0
Linear-9 [-1, 4096] 37,752,832
Linear-10 [-1, 4096] 16,781,312
Linear-11 [-1, 1000] 4,097,000
================================================================
Total params: 62,378,344
Trainable params: 62,378,344
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.57
Forward/backward pass size (MB): 5.96
Params size (MB): 237.95
Estimated Total Size (MB): 244.49
----------------------------------------------------------------
Output as in google Colab Link –
https://colab.research.google.com/drive/1kicnALE1T2c28hHPYeyFwNaOpkl_nFpQ?
usp=sharing
3. GoogleNet (Inception vl)The GoogleNet architecture was created by Christian
Szegedy from Google Research and achieved a breakthrough result by lowering the
top-5 error rate to below 7% in the ILSVRC 2014 challenge. This success was largely
attributed to its deeper architecture than other CNNs, enabled by its inception
modules which enabled more efficient use of parameters than preceding
architecturesGoogleNet has fewer parameters than AlexNet, with a ratio of 10:1
(roughly 6 million instead of 60 million)The architecture of the inception module
looks as shown in Fig. GoogleNet (Inception Module)The notation “3 x 3 + 2(5)”
means that the layer uses a 3 x 3 kernel, a stride of 2, and SAME padding. The
input signal is then fed to four different layers, each with a RelU activation
function and a stride of 1. These convolutional layers have varying kernel sizes (1
x 1, 3 x 3, and 5 x 5) to capture patterns at different scales. Additionally, each
layer uses SAME padding, so all outputs have the same height and width as their
inputs. This allows for the feature maps from all four top convolutional layers to
be concatenated along the depth dimension in the final depth concat layer.The
overall GoogleNet architecture has 22 larger deep CNN layers.4. ResNet (Residual
Network)Residual Network (ResNet), the winner of the ILSVRC 2015 challenge, was
developed by Kaiming He and delivered an impressive top-5 error rate of 3.6% with
an extremely deep CNN composed of 152 layers. An essential factor enabling the
training of such a deep network is the use of skip connections (also known as
shortcut connections). The signal that enters a layer is added to the output of a
layer located higher up in the stack. Let’s explore why this is beneficial.When
training a neural network, the goal is to make it replicate a target function h(x).
By adding the input x to the output of the network (a skip connection), the network
is made to model f(x) = h(x) – x, a technique known as residual learning.F(x) =
H(x) - x which gives H(x) := F(x) + x. Skip (Shortcut) connectionWhen initializing
a regular neural network, its weights are near zero, resulting in the network
outputting values close to zero. With the addition of skip connections, the
resulting network outputs a copy of its inputs, effectively modeling the identity
function. This can be beneficial if the target function is similar to the identity
function, as it will accelerate training. Furthermore, if multiple skip connections
are added, the network can begin to make progress even if several layers have not
yet begun learning.the target function is fairly close to the identity function
(which is often the case), this will speed up training considerably. Moreover, if
you add many skin connections, the network can start making progress even if
severalThe deep residual network can be viewed as a series of residual units, each
of which is a small neural network with a skip connection5. DenseNetThe DenseNet
model introduced the concept of a densely connected convolutional network, where
the output of each layer is connected to the input of every subsequent layer. This
design principle was developed to address the issue of accuracy decline caused by
the vanishing and exploding gradients in high-level neural networks.In simpler
terms, due to the long distance between the input and output layer, the data is
lost before it reaches its destination.The DenseNet model introduced the concept of
a densely connected convolutional network, where the output of each layer is
connected to the input of every subsequent layer. This design principle was
developed to address the issue of accuracy decline caused by the vanishing and
exploding gradients in high-level neural networks.DenseNetAll convolutions in a
dense block are ReLU-activated and use batch normalization. Channel-wise
concatenation is only possible if the height and width dimensions of the data
remain unchanged, so convolutions in a dense block are all of stride 1. Pooling
layers are inserted between dense blocks for further dimensionality
reduction.Intuitively, one might think that by concatenating all previously seen
outputs, the number of channels and parameters would exponentially increase.
However, DenseNet is surprisingly economical in terms of learnable parameters. This
is because each concatenated block, which may have a relatively large number of
channels, is first fed through a 1×1 convolution, reducing it to a small number of
channels. Additionally, 1×1 convolutions are economical in terms of parameters.
Then, a 3×3 convolution with the same number of channels is applied.The resulting
channels from each step of the DenseNet are concatenated to the collection of all
previously generated outputs. Each step, which utilizes a pair of 1×1 and 3×3
convolutions, adds K channels to the data. Consequently, the number of channels
increases linearly with the number of convolutional steps in the dense block. The
growth rate remains constant throughout the network, and DenseNet has demonstrated
good performance with K values between 12 and 40.Dense blocks and pooling layers
are combined to form a Tu DenseNet network. The DenseNet21 has 121 layers, however,
the structure is adjustable and can readily be extended to more than 200
layersDenseNet
Last Updated :
21 Mar, 2023
Like Article
Save Article
Previous
Next
Last Updated :
28 Jun, 2022
Like Article
Save Article
Previous
Convolutional Neural Network (CNN) Architectures
Next
In terms of speed,
YOLO is one of the best models in object recognition, able to recognize objects and
process frames at the rate up to 150 FPS for small networks. However, In terms of
accuracy mAP, YOLO was not the state of the art model but has fairly good Mean
average Precision (mAP) of 63% when trained on PASCAL VOC2007 and PASCAL VOC 2012.
However, Fast R-CNN which was the state of the art at that time has an mAP of 71%.
YOLO v2 and YOLO 9000 was proposed by J. Redmon and A. Farhadi in 2016 in the paper
titled YOLO 9000: Better, Faster, Stronger. At 67 FPS, YOLOv2 gives mAP of 76.8%
and at 67 FPS it gives an mAP of 78.6% on VOC 2007 dataset bettered the models
like Faster R-CNN and SSD. YOLO 9000 used YOLO v2 architecture but was able to
detect more than 9000 classes. YOLO 9000, however, has an mAP of 19.7%.
Batch Normalization:
By adding batch normalization to the architecture we can increase the convergence
of the model that leads us for faster training. This also eliminates the need for
applying other types of normalization such as Dropout without overfitting. It is
also observed that adding batch normalization alone can cause an increase in mAP by
2% as compared to basic YOLO.
We remove the fully connected layer responsible for predicting bounding boxes and
replace it with anchor boxes prediction.
YOLOv1 with layers removed (in filled red color)
We change the size of input from 448 * 448 to 416 * 416. This creates a feature map
of size 13 * 13 when we downsample it 32x. The idea behind this that there is a
good possibility of the object at the center of the feature map.
Remove one pooling layer to get 13 * 13 spatial network instead of 7*7
With these changes, the mAP of the model is slightly decreased (from 69.5% to
69.2%) however recall increases from 81% to 88%.
Output of each object proposal
Dimensionality clusters:
We need to identify the number of anchors (priors) generated so that they provide
the best results. Let’s take as K for now. Our task is to identify the top-K
bounding boxes for images that have maximum accuracy. We use the K-means clustering
algorithm for that purpose. But, we don’t need to minimize the Euclidean distance
instead we maximize the IOU as the target of this algorithm.
YOLO v2 uses K=5 for the better trade-off of the algorithm. We can conclude from
the graph below that as we increase the value of K=5 accuracy doesn’t change
significantly.
IOU based clustering on K = 5 gives mAP of 61%.
Dimension clusters(number of dimension for each anchors) vs mAP
Multi-Scale Training :
YOLO v2 has been trained on different input sizes from 320 * 320 to 608 * 608 using
step of 32. This architecture randomly chooses image dimensions for every 10
batches. There can be a trade-off established between accuracy and image size. For
Example, YOLOv2 with images size of 288 * 288 at 90 FPS gives as much as mAP as
Fast R-CNN.
Architecture:
YOLO v2 is trained on different architectures such as VGG-16 and GoogleNet. The
paper also proposed an architecture called Darknet-19. The reason for choosing the
Darknet architecture is its lower processing requirement than other architectures
5.58 FLOPS ( as compared to 30.69 FLOPS on VGG-16 for 224 * 224 image size and 8.52
FLOPS in customized GoogleNet). The structure of Darknet-19 is given below:
For detection purposes, we replace the last convolution layer of this architecture
and instead add three 3 * 3 convolution layers every 1024 filters followed by 1 * 1
convolution with the number of outputs we need for detection.
For VOC we predict 5 boxes with 5 coordinates (tx, ty, tw, th, to (objectness
score)) each with 20 classes per box. So total number of filters is 125.
Darknet-19 architecture
Training:
The YOLOv2 is trained for two purposes :
For detection there are some modifications made in the Darknet-19 architecture
which we discussed above. The model is trained for 160 epochs on starting learning
rate 10-3, weight decay of 0.0005 and momentum of 0.9. The same strategy used for
training the model on both COCO and VOC.
Reference:
Like Article
Save Article
Previous
Next
Like Article
Save Article
Previous
Next
Python3
import nltknltk.download('all')
Now, having installed NLTK successfully in our system, let’s perform some basic
operations on text data using NLTK.
TokenizationTokenization refers to break down the text into smaller units. It
entails splitting paragraphs into sentences and sentences into words. It is one of
the initial steps of any NLP pipeline. Let us have a look at the two major kinds of
tokenization that NLTK provides:
Work TokenizationIt involves breaking down the text into words.
"I study Machine Learning on GeeksforGeeks." will be word-tokenized as ['I',
'study', 'Machine', 'Learning', 'on', 'GeeksforGeeks', '.']. Sentence
TokenizationIt involves breaking down the text into individual sentences.
Example:"I study Machine Learning on GeeksforGeeks. Currently, I'm studying NLP"
will be sentence-tokenized as ['I study Machine Learning on GeeksforGeeks.',
'Currently, I'm studying NLP.']In Python, both these tokenizations can be
implemented in NLTK as follows:
Python3
Output:
['GeeksforGeeks', 'is', 'a', 'great', 'learning', 'platform', '.','It', 'is',
'one', 'of', 'the', 'best', 'for', 'Computer', 'Science', 'students', '.']
['GeeksforGeeks is a great learning platform.', 'It is one of the best for Computer
Science students.']Stemming and Lemmatization When working with Natural Language,
we are not much interested in the form of words – rather, we are concerned with the
meaning that the words intend to convey. Thus, we try to map every word of the
language to its root/base form. This process is called canonicalization.
E.g. The words ‘play’, ‘plays’, ‘played’, and ‘playing’ convey the same action –
hence, we can map them all to their base form i.e. ‘play’.
Now, there are two widely used canonicalization techniques: Stemming and
Lemmatization.
Stemming Stemming generates the base word from the inflected word by removing the
affixes of the word. It has a set of pre-defined rules that govern the dropping of
these affixes. It must be noted that stemmers might not always result in
semantically meaningful base words. Stemmers are faster and computationally less
expensive than lemmatizers.
In the following code, we will be stemming words using Porter Stemmer – one of the
most widely used stemmers:
Python3
Output:
playplayplayplayWe can see that all the variations of the word ‘play’ have been
reduced to the same word – ‘play’. In this case, the output is a meaningful word,
‘play’. However, this is not always the case. Let us take an example.
Please note that these groups are stored in the lemmatizer; there is no removal of
affixes as in the case of a stemmer.
Python3
Output:
communThe stemmer reduces the word ‘communication’ to a base word ‘commun’ which is
meaningless in itself.
LemmatizationLemmatization involves grouping together the inflected forms of the
same word. This way, we can reach out to the base form of any word which will be
meaningful in nature. The base from here is called the Lemma.
Lemmatizers are slower and computationally more expensive than stemmers.
Example:'play', 'plays', 'played', and 'playing' have 'play' as the lemma. In
Python, both these tokenizations can be implemented in NLTK as follows:
Python3
from nltk.stem import WordNetLemmatizer# create an object of class
WordNetLemmatizerlemmatizer =
WordNetLemmatizer()print(lemmatizer.lemmatize("plays",
'v'))print(lemmatizer.lemmatize("played", 'v'))print(lemmatizer.lemmatize("play",
'v'))print(lemmatizer.lemmatize("playing", 'v'))
Output:
playplayplayplayPlease note that in lemmatizers, we need to pass the Part of Speech
of the word along with the word as a function argument.
Also, stemmers always result in meaningful base words. Let us take the same example
as we took in the case for stemmers.
Python3
from nltk.stem import WordNetLemmatizer # create an object of class
WordNetLemmatizerlemmatizer =
WordNetLemmatizer()print(lemmatizer.lemmatize("Communication", 'v'))
Output:
CommunicationPart of Speech TaggingPart of Speech (POS) tagging refers to assigning
each word of a sentence to its part of speech. It is significant as it helps give a
better syntactic overview of a sentence.
Example:"GeeksforGeeks is a Computer Science platform."Let's see how NLTK's POS
tagger will tag this sentence.In Python, both these tokenizations can be
implemented in NLTK as follows:
Python3
Last Updated :
03 Jan, 2024
Like Article
Save Article
Previous
Next
In a similar way, we can create word vectors for different words as well on the
basis of given features. The words with similar vectors are most likely to have the
same meaning or are used to convey the same sentiment.
Approaches for Text Representation1. Traditional ApproachThe conventional method
involves compiling a list of distinct terms and giving each one a unique integer
value, or id. and after that, insert each word’s distinct id into the sentence.
Every vocabulary word is handled as a feature in this instance. Thus, a large
vocabulary will result in an extremely large feature size. Common traditional
methods include:
1.1. One-Hot EncodingOne-hot encoding is a simple method for representing words in
natural language processing (NLP). In this encoding scheme, each word in the
vocabulary is represented as a unique vector, where the dimensionality of the
vector is equal to the size of the vocabulary. The vector has all elements set to
0, except for the element corresponding to the index of the word in the vocabulary,
which is set to 1.
Python3
Output:
Vocabulary: {'mat', 'the', 'bird', 'hat', 'on', 'in', 'cat', 'tree', 'dog'}Word to
Index Mapping: {'mat': 0, 'the': 1, 'bird': 2, 'hat': 3, 'on': 4, 'in': 5, 'cat':
6, 'tree': 7, 'dog': 8}One-Hot Encoded Matrix:cat: [0, 0, 0, 0, 0, 0, 1, 0, 0]in:
[0, 0, 0, 0, 0, 1, 0, 0, 0]the: [0, 1, 0, 0, 0, 0, 0, 0, 0]hat: [0, 0, 0, 1, 0, 0,
0, 0, 0]dog: [0, 0, 0, 0, 0, 0, 0, 0, 1]on: [0, 0, 0, 0, 1, 0, 0, 0, 0]the: [0, 1,
0, 0, 0, 0, 0, 0, 0]mat: [1, 0, 0, 0, 0, 0, 0, 0, 0]bird: [0, 0, 1, 0, 0, 0, 0, 0,
0]in: [0, 0, 0, 0, 0, 1, 0, 0, 0]the: [0, 1, 0, 0, 0, 0, 0, 0, 0]tree: [0, 0, 0, 0,
0, 0, 0, 1, 0]While one-hot encoding is a simple and intuitive method for
representing words in NLP, it has several disadvantages, which may limit its
effectiveness in certain applications.
One-hot encoding results in high-dimensional vectors, making it computationally
expensive and memory-intensive, especially with large vocabularies.It does not
capture semantic relationships between words; each word is treated as an isolated
entity without considering its meaning or context.It is restricted to the
vocabulary seen during training, making it unsuitable for handling out-of-
vocabulary words.1.2. Bag of Word (Bow)Bag-of-Words (BoW) is a text representation
technique that represents a document as an unordered set of words and their
respective frequencies. It discards the word order and captures the frequency of
each word in the document, creating a vector representation.
Python3
Output:
Bag-of-Words Matrix:[[0 1 1 1 0 0 1 0 1] [0 2 0 1 0 1 1 0 1] [1 0 0 1 1 0 1 1 1] [0
1 1 1 0 0 1 0 1]]Vocabulary (Feature Names): ['and' 'document' 'first' 'is' 'one'
'second' 'the' 'third' 'this']While BoW is a simple and interpretable
representation, below disadvantages highlight its limitations in capturing certain
aspects of language structure and semantics:
BoW ignores the order of words in the document, leading to a loss of sequential
information and context making it less effective for tasks where word order is
crucial, such as in natural language understanding.BoW representations are often
sparse, with many elements being zero resulting in increased memory requirements
and computational inefficiency, especially when dealing with large datasets.1.3.
Term frequency-inverse document frequency (TF-IDF)Term Frequency-Inverse Document
Frequency, commonly known as TF-IDF, is a numerical statistic that reflects the
importance of a word in a document relative to a collection of documents (corpus).
It is widely used in natural language processing and information retrieval to
evaluate the significance of a term within a specific document in a larger corpus.
TF-IDF consists of two components:
Term Frequency (TF): Term Frequency measures how often a term (word) appears in a
document. It is calculated using the formula:
Inverse Document Frequency (IDF): Inverse Document Frequency measures the
importance of a term across a collection of documents. It is calculated using the
formula:
The TF-IDF score for a term t in a document d is then given by multiplying the TF
and IDF values:
The higher the TF-IDF score for a term in a document, the more important that term
is to that document within the context of the entire corpus. This weighting scheme
helps in identifying and extracting relevant information from a large collection of
documents, and it is commonly used in text mining, information retrieval, and
document clustering.
Let’s Implement Term Frequency-Inverse Document Frequency (TF-IDF) using python
with the scikit-learn library. It begins by defining a set of sample documents. The
TfidfVectorizer is employed to transform these documents into a TF-IDF matrix. The
code then extracts and prints the TF-IDF values for each word in each document.
This statistical measure helps assess the importance of words in a document
relative to their frequency across a collection of documents, aiding in information
retrieval and text analysis tasks.
Python3
Output:
Document 1:dog: 0.3404110310756642lazy: 0.3404110310756642over:
0.3404110310756642jumps: 0.3404110310756642fox: 0.3404110310756642brown:
0.3404110310756642quick: 0.3404110310756642the: 0.43455990318254417Document 2:step:
0.3535533905932738single: 0.3535533905932738with: 0.3535533905932738begins:
0.3535533905932738miles: 0.3535533905932738thousand: 0.3535533905932738of:
0.3535533905932738journey: 0.3535533905932738TF-IDF is a widely used technique in
information retrieval and text mining, but its limitations should be considered,
especially when dealing with tasks that require a deeper understanding of language
semantics. For example:
TF-IDF treats words as independent entities and doesn’t consider semantic
relationships between them. This limitation hinders its ability to capture
contextual information and word meanings.Sensitivity to Document Length: Longer
documents tend to have higher overall term frequencies, potentially biasing TF-IDF
towards longer documents. 2. Neural Approach2.1. Word2VecWord2Vec is a neural
approach for generating word embeddings. It belongs to the family of neural word
embedding techniques and specifically falls under the category of distributed
representation models. It is a popular technique in natural language processing
(NLP) that is used to represent words as continuous vector spaces. Developed by a
team at Google, Word2Vec aims to capture the semantic relationships between words
by mapping them to high-dimensional vectors. The underlying idea is that words with
similar meanings should have similar vector representations. In Word2Vec every word
is assigned a vector. We start with either a random vector or one-hot vector.
There are two neural embedding methods for Word2Vec, Continuous Bag of Words (CBOW)
and Skip-gram.
2.2. Continuous Bag of Words(CBOW)
Continuous Bag of Words (CBOW) is a type of neural network architecture used in the
Word2Vec model. The primary objective of CBOW is to predict a target word based on
its context, which consists of the surrounding words in a given window. Given a
sequence of words in a context window, the model is trained to predict the target
word at the center of the window.
CBOW is a feedforward neural network with a single hidden layer. The input layer
represents the context words, and the output layer represents the target word. The
hidden layer contains the learned continuous vector representations (word
embeddings) of the input words.
The architecture is useful for learning distributed representations of words in a
continuous vector space.
The hidden layer contains the continuous vector representations (word embeddings)
of the input words.
The weights between the input layer and the hidden layer are learned during
training.The dimensionality of the hidden layer represents the size of the word
embeddings (the continuous vector space).
Python3
After applying the above neural embedding methods we get trained vectors of each
word after many iterations through the corpus. These trained vectors preserve
syntactical or semantic information and are converted to lower dimensions. The
vectors with similar meaning or semantic information are placed close to each other
in space.
Let’s understand with a basic example. The python code contains, vector_size
parameter that controls the dimensionality of the word vectors, and you can adjust
other parameters such as window based on your specific needs.
Note: Word2Vec models can perform better with larger datasets. If you have a large
corpus, you might achieve more meaningful word embeddings.
Python3
Output:
Vector representation of 'word': [-9.5800208e-03 8.9437785e-03 4.1664648e-03
9.2367809e-03 6.6457358e-03 2.9233587e-03 9.8055992e-03 -4.4231843e-03 -
6.8048164e-03 4.2256550e-03 3.7299085e-03 -5.6668529e-
03-------------------------------------------------------------- 2.8835384e-03 -
1.5386029e-03 9.9318363e-03 8.3507905e-03 2.4184163e-03 7.1170190e-03
5.8888551e-03 -5.5787875e-03]In practice, the choice between CBOW and Skip-gram
often depends on the specific characteristics of the data and the task at hand.
CBOW might be preferred when training resources are limited, and capturing
syntactic information is important. Skip-gram, on the other hand, might be chosen
when semantic relationships and the representation of rare words are crucial.
3. Pretrained Word-EmbeddingPre-trained word embeddings are representations of
words that are learned from large corpora and are made available for reuse in
various natural language processing (NLP) tasks. These embeddings capture semantic
relationships between words, allowing the model to understand similarities and
relationships between different words in a meaningful way.
3.1. GloVeGloVe is trained on global word co-occurrence statistics. It leverages
the global context to create word embeddings that reflect the overall meaning of
words based on their co-occurrence probabilities. this method, we take the corpus
and iterate through it and get the co-occurrence of each word with other words in
the corpus. We get a co-occurrence matrix through this. The words which occur next
to each other get a value of 1, if they are one word apart then 1/2, if two words
apart then 1/3 and so on.
Let us take an example to understand how the matrix is created. We have a small
corpus:
Corpus:It is a nice evening.Good Evening!Is it a nice
evening? itisaniceeveninggoodit0 is1+10 a1/2+11+1/20 nice1/3+1/21/2+1/31+1
0 evening1/4+1/31/3+1/41/2+1/21+10 good000010The upper half of the matrix will be
a reflection of the lower half. We can consider a window frame as well to calculate
the co-occurrences by shifting the frame till the end of the corpus. This helps
gather information about the context in which the word is used.
Initially, the vectors for each word is assigned randomly. Then we take two pairs
of vectors and see how close they are to each other in space. If they occur
together more often or have a higher value in the co-occurrence matrix and are far
apart in space then they are brought close to each other. If they are close to each
other but are rarely or not frequently used together then they are moved further
apart in space.
After many iterations of the above process, we’ll get a vector space representation
that approximates the information from the co-occurrence matrix. The performance of
GloVe is better than Word2Vec in terms of both semantic and syntactic capturing.
Python3
Output:
Similarity between 'learn' and 'learning' using GloVe: 0.802Similarity between
'india' and 'indian' using GloVe: 0.865Similarity between 'fame' and 'famous' using
GloVe: 0.5893.2. FasttextDeveloped by Facebook, FastText extends Word2Vec by
representing words as bags of character n-grams. This approach is particularly
useful for handling out-of-vocabulary words and capturing morphological variations.
Python3
import gensim.downloader as apifasttext_model = api.load("fasttext-wiki-news-
subwords-300") ## Load the pre-trained fastText model# Define word pairs to compute
similarity forword_pairs = [('learn', 'learning'), ('india', 'indian'), ('fame',
'famous')] # Compute similarity for each pair of wordsfor pair in
word_pairs: similarity = fasttext_model.similarity(pair[0],
pair[1]) print(f"Similarity between '{pair[0]}' and '{pair[1]}' using FastText:
{similarity:.3f}")
Output:
Similarity between 'learn' and 'learning' using Word2Vec: 0.642Similarity between
'india' and 'indian' using Word2Vec: 0.708Similarity between 'fame' and 'famous'
using Word2Vec: 0.5193.3. BERT (Bidirectional Encoder Representations from
Transformers)BERT is a transformer-based model that learns contextualized
embeddings for words. It considers the entire context of a word by considering both
left and right contexts, resulting in embeddings that capture rich contextual
information.
Python3
from transformers import BertTokenizer, BertModelimport torch # Load pre-trained
BERT model and tokenizermodel_name = 'bert-base-uncased'tokenizer =
BertTokenizer.from_pretrained(model_name)model =
BertModel.from_pretrained(model_name) word_pairs = [('learn', 'learning'),
('india', 'indian'), ('fame', 'famous')] # Compute similarity for each pair of
wordsfor pair in word_pairs: tokens = tokenizer(pair,
return_tensors='pt') with torch.no_grad(): outputs =
model(**tokens) # Extract embeddings for the [CLS] token cls_embedding =
outputs.last_hidden_state[:, 0, :] similarity =
torch.nn.functional.cosine_similarity(cls_embedding[0], cls_embedding[1],
dim=0) print(f"Similarity between '{pair[0]}' and '{pair[1]}' using BERT:
{similarity:.3f}")
Output:
Similarity between 'learn' and 'learning' using BERT: 0.930Similarity between
'india' and 'indian' using BERT: 0.957Similarity between 'fame' and 'famous' using
BERT: 0.956Considerations for Deploying Word Embedding ModelsYou need to use the
exact same pipeline during deploying your model as were used to create the training
data for the word embedding. If you use a different tokenizer or different method
of handling white space, punctuation etc. you might end up with incompatible
inputs.Words in your input that doesn’t have a pre-trained vector. Such words are
known as Out of Vocabulary Word(oov). What you can do is replace those words with
“UNK” which means unknown and then handle them separately.Dimension mis-match:
Vectors can be of many lengths. If you train a model with vectors of length say 400
and then try to apply vectors of length 1000 at inference time, you will run into
errors. So make sure to use the same dimensions throughout.Advantages and
Disadvantage of Word EmbeddingsAdvantagesIt is much faster to train than hand build
models like WordNet (which uses graph embeddings).Almost all modern NLP
applications start with an embedding layer.It Stores an approximation of
meaning.Disadvantages It can be memory intensive.It is corpus dependent. Any
underlying bias will have an effect on your model.It cannot distinguish between
homophones. Eg: brake/break, cell/sell, weather/whether etc.ConclusionIn
conclusion, word embedding techniques such as TF-IDF, Word2Vec, and GloVe play a
crucial role in natural language processing by representing words in a lower-
dimensional space, capturing semantic and syntactic information.
Frequently Asked Questions (FAQs)1. Does GPT use word embeddings?GPT uses context-
based embeddings rather than traditional word embeddings. It captures word meaning
in the context of the entire sentence.
2. What is the difference between Bert and word embeddings?BERT is contextually
aware, considering the entire sentence, while traditional word embeddings, like
Word2Vec, treat each word independently.
3. What are the two types of word embedding?Word embeddings can be broadly
evaluated in two categories, intrinsic and extrinsic. For intrinsic evaluation,
word embeddings are used to calculate or predict semantic similarity between words,
terms, or sentences.
4. How does word vectorization work?Word vectorization converts words into
numerical vectors, capturing semantic relationships. Techniques like TF-IDF,
Word2Vec, and GloVe are common.
5. What are the benefits of word embeddings?Word embeddings offer semantic
understanding, capture context, and enhance NLP tasks. They reduce dimensionality,
speed up training, and aid in language pattern recognition.
Last Updated :
05 Jan, 2024
Like Article
Save Article
Previous
Next
In this article, we will introduce a new variation of neural network which is the
Recurrent Neural Network also known as (RNN) that works better than a simple neural
network when data is sequential like Time-Series data and text data.
What is Recurrent Neural Network (RNN)?Recurrent Neural Network(RNN) is a type of
Neural Network where the output from the previous step is fed as input to the
current step. In traditional neural networks, all the inputs and outputs are
independent of each other. Still, in cases when it is required to predict the next
word of a sentence, the previous words are required and hence there is a need to
remember the previous words. Thus RNN came into existence, which solved this issue
with the help of a Hidden Layer. The main and most important feature of RNN is its
Hidden state, which remembers some information about a sequence. The state is also
referred to as Memory State since it remembers the previous input to the network.
It uses the same parameters for each input as it performs the same task on all the
inputs or hidden layers to produce the output. This reduces the complexity of
parameters, unlike other neural networks.
Recurrent Neural NetworkHow RNN differs from Feedforward Neural Network?Artificial
neural networks that do not have looping nodes are called feed forward neural
networks. Because all information is only passed forward, this kind of neural
network is also referred to as a multi-layer neural network.
Information moves from the input layer to the output layer – if any hidden layers
are present – unidirectionally in a feedforward neural network. These networks are
appropriate for image classification tasks, for example, where input and output are
independent. Nevertheless, their inability to retain previous inputs automatically
renders them less useful for sequential data analysis.
Recurrent Vs Feedfoward networksRecurrent Neuron and RNN UnfoldingThe fundamental
processing unit in a Recurrent Neural Network (RNN) is a Recurrent Unit, which is
not explicitly called a “Recurrent Neuron.” This unit has the unique ability to
maintain a hidden state, allowing the network to capture sequential dependencies by
remembering previous inputs while processing. Long Short-Term Memory (LSTM) and
Gated Recurrent Unit (GRU) versions improve the RNN’s ability to handle long-term
dependencies.
Recurrent NeuronRNN Unfolding
Types Of RNNThere are four types of RNNs based on the number of inputs and outputs
in the network.
One to One One to Many Many to One Many to Many One to One This type of RNN behaves
the same as any simple Neural network it is also known as Vanilla Neural Network.
In this Neural network, there is only one input and one output.
One to One RNNOne To Many In this type of RNN, there is one input and many outputs
associated with it. One of the most used examples of this network is Image
captioning where given an image we predict a sentence having Multiple words.
One to Many RNNMany to One In this type of network, Many inputs are fed to the
network at several states of the network generating only one output. This type of
network is used in the problems like sentimental analysis. Where we give multiple
words as input and predict only the sentiment of the sentence as output.
Many to One RNNMany to Many In this type of neural network, there are multiple
inputs and multiple outputs corresponding to a problem. One Example of this Problem
will be language translation. In language translation, we provide multiple words
from one language as input and predict multiple words from the second language as
output.
Many to Many RNNRecurrent Neural Network ArchitectureRNNs have the same input and
output architecture as any other deep neural architecture. However, differences
arise in the way information flows from input to output. Unlike Deep neural
networks where we have different weight matrices for each Dense network in RNN, the
weight across the network remains the same. It calculates state hidden state Hi
for every input Xi . By using the following formulas:
h= σ(UX + Wh-1 + B)
Y = O(Vh + C)
Hence
Y = f (X, h , W, U, V, B, C)
Here S is the State matrix which has element si as the state of the network at
timestep iThe parameters in the network are W, U, V, c, b which are shared across
timestep
Recurrent Neural Architecture
How does RNN work?The Recurrent Neural Network consists of multiple fixed
activation function units, one for each time step. Each unit has an internal state
which is called the hidden state of the unit. This hidden state signifies the past
knowledge that the network currently holds at a given time step. This hidden state
is updated at every time step to signify the change in the knowledge of the network
about the past. The hidden state is updated using the following recurrence
relation:-
The formula for calculating the current state:
where,
ht -> current state ht-1 -> previous state xt -> input stateFormula for applying
Activation function(tanh)
where,
whh -> weight at recurrent neuronwxh -> weight at input neuronThe formula for
calculating output:
Yt -> outputWhy -> weight at output layerThese parameters are updated using
Backpropagation. However, since RNN works on sequential data here we use an updated
backpropagation which is known as Backpropagation through time.
Backpropagation Through Time (BPTT)In RNN the neural network is in an ordered
fashion and since in the ordered network each variable is computed one at a time in
a specified order like first h1 then h2 then h3 so on. Hence we will apply
backpropagation throughout all these hidden time states sequentially.
Backpropagation Through Time (BPTT) In RNNL(θ)(loss function) depends on h3h3 in
turn depends on h2 and Wh2 in turn depends on h1 and Wh1 in turn depends on h0 and
Wwhere h0 is a constant starting state.
For simplicity of this equation, we will apply backpropagation on only one row
We already know how to compute this one as it is the same as any simple deep neural
network backpropagation.
Finally, we have
Where
Hence,
Python3
Input Generation:
Generated some example data using text.
Python3
text = "This is GeeksforGeeks a software training institute"chars =
sorted(list(set(text)))char_to_index = {char: i for i, char in
enumerate(chars)}index_to_char = {i: char for i, char in enumerate(chars)}
Python3
Python3
Model Building:
Build RNN Model using ‘relu’ and ‘softmax‘ activation function.
Python3
model = Sequential()model.add(SimpleRNN(50, input_shape=(seq_length, len(chars)),
activation='relu'))model.add(Dense(len(chars), activation='softmax'))
Model Compilation:
The model.compile line builds the neural network for training by specifying the
optimizer (Adam), the loss function (categorical crossentropy), and the training
metric (accuracy).
Python3
model.compile(optimizer='adam', loss='categorical_crossentropy',
metrics=['accuracy'])
Model Training:
Using the input sequences (X_one_hot) and corresponding labels (y_one_hot) for 100
epochs, the model is trained using the model.fit line, which optimises the model
parameters to minimise the categorical crossentropy loss.
Python3
output:
Epoch 1/100
2/2 [==============================] - 2s 54ms/step - loss: 2.8327 - accuracy:
0.0000e+00
Epoch 2/100
2/2 [==============================] - 0s 16ms/step - loss: 2.8121 - accuracy:
0.0000e+00
Epoch 3/100
2/2 [==============================] - 0s 16ms/step - loss: 2.7944 - accuracy:
0.0208
Epoch 4/100
2/2 [==============================] - 0s 16ms/step - loss: 2.7766 - accuracy:
0.0208
Epoch 5/100
2/2 [==============================] - 0s 15ms/step - loss: 2.7596 - accuracy:
0.0625
Epoch 6/100
2/2 [==============================] - 0s 13ms/step - loss: 2.7424 - accuracy:
0.0833
Epoch 7/100
2/2 [==============================] - 0s 13ms/step - loss: 2.7254 - accuracy:
0.1042
Epoch 8/100
2/2 [==============================] - 0s 12ms/step - loss: 2.7092 - accuracy:
0.1042
Epoch 9/100
2/2 [==============================] - 0s 11ms/step - loss: 2.6917 - accuracy:
0.1458
Epoch 10/100
2/2 [==============================] - 0s 12ms/step - loss: 2.6742 - accuracy:
0.1667
Epoch 11/100
2/2 [==============================] - 0s 10ms/step - loss: 2.6555 - accuracy:
0.1667
Epoch 12/100
2/2 [==============================] - 0s 16ms/step - loss: 2.6369 - accuracy:
0.1667
Epoch 13/100
2/2 [==============================] - 0s 11ms/step - loss: 2.6179 - accuracy:
0.1667
Epoch 14/100
2/2 [==============================] - 0s 11ms/step - loss: 2.5993 - accuracy:
0.1875
Epoch 15/100
2/2 [==============================] - 0s 17ms/step - loss: 2.5789 - accuracy:
0.2083
Epoch 16/100
2/2 [==============================] - 0s 11ms/step - loss: 2.5593 - accuracy:
0.2083
Epoch 17/100
2/2 [==============================] - 0s 16ms/step - loss: 2.5397 - accuracy:
0.2083
Epoch 18/100
2/2 [==============================] - 0s 20ms/step - loss: 2.5182 - accuracy:
0.2292
Epoch 19/100
2/2 [==============================] - 0s 18ms/step - loss: 2.4979 - accuracy:
0.2292
Epoch 20/100
2/2 [==============================] - 0s 11ms/step - loss: 2.4761 - accuracy:
0.2292
Epoch 21/100
2/2 [==============================] - 0s 13ms/step - loss: 2.4536 - accuracy:
0.2292
Epoch 22/100
2/2 [==============================] - 0s 17ms/step - loss: 2.4299 - accuracy:
0.2292
Epoch 23/100
2/2 [==============================] - 0s 10ms/step - loss: 2.4067 - accuracy:
0.2708
Epoch 24/100
2/2 [==============================] - 0s 27ms/step - loss: 2.3824 - accuracy:
0.2917
Epoch 25/100
2/2 [==============================] - 0s 22ms/step - loss: 2.3582 - accuracy:
0.2917
Epoch 26/100
2/2 [==============================] - 0s 10ms/step - loss: 2.3324 - accuracy:
0.2917
Epoch 27/100
2/2 [==============================] - 0s 10ms/step - loss: 2.3068 - accuracy:
0.3125
Epoch 28/100
2/2 [==============================] - 0s 10ms/step - loss: 2.2819 - accuracy:
0.3125
Epoch 29/100
2/2 [==============================] - 0s 11ms/step - loss: 2.2535 - accuracy:
0.3333
Epoch 30/100
2/2 [==============================] - 0s 10ms/step - loss: 2.2278 - accuracy:
0.3333
Epoch 31/100
2/2 [==============================] - 0s 12ms/step - loss: 2.1992 - accuracy:
0.3333
Epoch 32/100
2/2 [==============================] - 0s 12ms/step - loss: 2.1719 - accuracy:
0.3333
Epoch 33/100
2/2 [==============================] - 0s 13ms/step - loss: 2.1434 - accuracy:
0.3333
Epoch 34/100
2/2 [==============================] - 0s 14ms/step - loss: 2.1134 - accuracy:
0.3542
Epoch 35/100
2/2 [==============================] - 0s 14ms/step - loss: 2.0852 - accuracy:
0.3542
Epoch 36/100
2/2 [==============================] - 0s 15ms/step - loss: 2.0547 - accuracy:
0.3958
Epoch 37/100
2/2 [==============================] - 0s 18ms/step - loss: 2.0240 - accuracy:
0.4167
Epoch 38/100
2/2 [==============================] - 0s 24ms/step - loss: 1.9933 - accuracy:
0.5000
Epoch 39/100
2/2 [==============================] - 0s 14ms/step - loss: 1.9626 - accuracy:
0.5000
Epoch 40/100
2/2 [==============================] - 0s 14ms/step - loss: 1.9306 - accuracy:
0.5000
Epoch 41/100
2/2 [==============================] - 0s 16ms/step - loss: 1.9002 - accuracy:
0.5000
Epoch 42/100
2/2 [==============================] - 0s 15ms/step - loss: 1.8669 - accuracy:
0.5000
Epoch 43/100
2/2 [==============================] - 0s 14ms/step - loss: 1.8353 - accuracy:
0.5208
Epoch 44/100
2/2 [==============================] - 0s 22ms/step - loss: 1.8029 - accuracy:
0.5417
Epoch 45/100
2/2 [==============================] - 0s 15ms/step - loss: 1.7708 - accuracy:
0.5625
Epoch 46/100
2/2 [==============================] - 0s 10ms/step - loss: 1.7373 - accuracy:
0.5625
Epoch 47/100
2/2 [==============================] - 0s 12ms/step - loss: 1.7052 - accuracy:
0.6042
Epoch 48/100
2/2 [==============================] - 0s 12ms/step - loss: 1.6737 - accuracy:
0.6042
Epoch 49/100
2/2 [==============================] - 0s 14ms/step - loss: 1.6388 - accuracy:
0.6250
Epoch 50/100
2/2 [==============================] - 0s 12ms/step - loss: 1.6071 - accuracy:
0.6458
Epoch 51/100
2/2 [==============================] - 0s 10ms/step - loss: 1.5737 - accuracy:
0.6667
Epoch 52/100
2/2 [==============================] - 0s 12ms/step - loss: 1.5386 - accuracy:
0.6667
Epoch 53/100
2/2 [==============================] - 0s 11ms/step - loss: 1.5059 - accuracy:
0.6875
Epoch 54/100
2/2 [==============================] - 0s 17ms/step - loss: 1.4727 - accuracy:
0.6875
Epoch 55/100
2/2 [==============================] - 0s 14ms/step - loss: 1.4381 - accuracy:
0.6667
Epoch 56/100
2/2 [==============================] - 0s 13ms/step - loss: 1.4039 - accuracy:
0.6667
Epoch 57/100
2/2 [==============================] - 0s 15ms/step - loss: 1.3718 - accuracy:
0.6667
Epoch 58/100
2/2 [==============================] - 0s 10ms/step - loss: 1.3391 - accuracy:
0.6667
Epoch 59/100
2/2 [==============================] - 0s 11ms/step - loss: 1.3059 - accuracy:
0.6875
Epoch 60/100
2/2 [==============================] - 0s 11ms/step - loss: 1.2751 - accuracy:
0.6875
Epoch 61/100
2/2 [==============================] - 0s 10ms/step - loss: 1.2426 - accuracy:
0.6875
Epoch 62/100
2/2 [==============================] - 0s 10ms/step - loss: 1.2123 - accuracy:
0.6875
Epoch 63/100
2/2 [==============================] - 0s 9ms/step - loss: 1.1822 - accuracy:
0.6875
Epoch 64/100
2/2 [==============================] - 0s 10ms/step - loss: 1.1520 - accuracy:
0.7083
Epoch 65/100
2/2 [==============================] - 0s 11ms/step - loss: 1.1232 - accuracy:
0.7500
Epoch 66/100
2/2 [==============================] - 0s 13ms/step - loss: 1.0940 - accuracy:
0.7500
Epoch 67/100
2/2 [==============================] - 0s 13ms/step - loss: 1.0677 - accuracy:
0.7500
Epoch 68/100
2/2 [==============================] - 0s 11ms/step - loss: 1.0388 - accuracy:
0.7500
Epoch 69/100
2/2 [==============================] - 0s 10ms/step - loss: 1.0130 - accuracy:
0.7500
Epoch 70/100
2/2 [==============================] - 0s 12ms/step - loss: 0.9862 - accuracy:
0.7917
Epoch 71/100
2/2 [==============================] - 0s 12ms/step - loss: 0.9619 - accuracy:
0.8125
Epoch 72/100
2/2 [==============================] - 0s 11ms/step - loss: 0.9377 - accuracy:
0.8333
Epoch 73/100
2/2 [==============================] - 0s 11ms/step - loss: 0.9114 - accuracy:
0.8542
Epoch 74/100
2/2 [==============================] - 0s 12ms/step - loss: 0.8882 - accuracy:
0.8542
Epoch 75/100
2/2 [==============================] - 0s 11ms/step - loss: 0.8656 - accuracy:
0.8750
Epoch 76/100
2/2 [==============================] - 0s 11ms/step - loss: 0.8423 - accuracy:
0.8750
Epoch 77/100
2/2 [==============================] - 0s 19ms/step - loss: 0.8214 - accuracy:
0.8750
Epoch 78/100
2/2 [==============================] - 0s 13ms/step - loss: 0.7991 - accuracy:
0.8750
Epoch 79/100
2/2 [==============================] - 0s 14ms/step - loss: 0.7781 - accuracy:
0.8750
Epoch 80/100
2/2 [==============================] - 0s 13ms/step - loss: 0.7568 - accuracy:
0.8750
Epoch 81/100
2/2 [==============================] - 0s 15ms/step - loss: 0.7386 - accuracy:
0.8750
Epoch 82/100
2/2 [==============================] - 0s 20ms/step - loss: 0.7178 - accuracy:
0.8750
Epoch 83/100
2/2 [==============================] - 0s 17ms/step - loss: 0.7001 - accuracy:
0.8750
Epoch 84/100
2/2 [==============================] - 0s 21ms/step - loss: 0.6814 - accuracy:
0.8750
Epoch 85/100
2/2 [==============================] - 0s 20ms/step - loss: 0.6641 - accuracy:
0.8750
Epoch 86/100
2/2 [==============================] - 0s 18ms/step - loss: 0.6464 - accuracy:
0.8750
Epoch 87/100
2/2 [==============================] - 0s 18ms/step - loss: 0.6290 - accuracy:
0.8750
Epoch 88/100
2/2 [==============================] - 0s 13ms/step - loss: 0.6108 - accuracy:
0.8750
Epoch 89/100
2/2 [==============================] - 0s 16ms/step - loss: 0.5958 - accuracy:
0.8750
Epoch 90/100
2/2 [==============================] - 0s 15ms/step - loss: 0.5799 - accuracy:
0.8750
Epoch 91/100
2/2 [==============================] - 0s 17ms/step - loss: 0.5656 - accuracy:
0.8750
Epoch 92/100
2/2 [==============================] - 0s 31ms/step - loss: 0.5499 - accuracy:
0.8750
Epoch 93/100
2/2 [==============================] - 0s 15ms/step - loss: 0.5347 - accuracy:
0.8750
Epoch 94/100
2/2 [==============================] - 0s 17ms/step - loss: 0.5215 - accuracy:
0.8750
Epoch 95/100
2/2 [==============================] - 0s 16ms/step - loss: 0.5077 - accuracy:
0.8958
Epoch 96/100
2/2 [==============================] - 0s 15ms/step - loss: 0.4954 - accuracy:
0.9583
Epoch 97/100
2/2 [==============================] - 0s 11ms/step - loss: 0.4835 - accuracy:
0.9583
Epoch 98/100
2/2 [==============================] - 0s 12ms/step - loss: 0.4715 - accuracy:
0.9583
Epoch 99/100
2/2 [==============================] - 0s 15ms/step - loss: 0.4588 - accuracy:
0.9583
Epoch 100/100
2/2 [==============================] - 0s 10ms/step - loss: 0.4469 - accuracy:
0.9583
<keras.src.callbacks.History at 0x7bab7ab127d0>
Model Prediction:
Generated text using pre-trained model.
Python3
output:
1/1 [==============================] - 1s 517ms/step
1/1 [==============================] - 0s 75ms/step
1/1 [==============================] - 0s 101ms/step
1/1 [==============================] - 0s 93ms/step
1/1 [==============================] - 0s 132ms/step
1/1 [==============================] - 0s 143ms/step
1/1 [==============================] - 0s 140ms/step
1/1 [==============================] - 0s 144ms/step
1/1 [==============================] - 0s 125ms/step
1/1 [==============================] - 0s 60ms/step
1/1 [==============================] - 0s 38ms/step
1/1 [==============================] - 0s 34ms/step
1/1 [==============================] - 0s 29ms/step
1/1 [==============================] - 0s 34ms/step
1/1 [==============================] - 0s 32ms/step
1/1 [==============================] - 0s 32ms/step
1/1 [==============================] - 0s 38ms/step
1/1 [==============================] - 0s 32ms/step
1/1 [==============================] - 0s 36ms/step
1/1 [==============================] - 0s 31ms/step
1/1 [==============================] - 0s 31ms/step
1/1 [==============================] - 0s 31ms/step
1/1 [==============================] - 0s 31ms/step
1/1 [==============================] - 0s 32ms/step
1/1 [==============================] - 0s 31ms/step
1/1 [==============================] - 0s 23ms/step
1/1 [==============================] - 0s 27ms/step
1/1 [==============================] - 0s 21ms/step
1/1 [==============================] - 0s 23ms/step
1/1 [==============================] - 0s 20ms/step
1/1 [==============================] - 0s 20ms/step
1/1 [==============================] - 0s 22ms/step
1/1 [==============================] - 0s 24ms/step
1/1 [==============================] - 0s 21ms/step
1/1 [==============================] - 0s 22ms/step
1/1 [==============================] - 0s 21ms/step
1/1 [==============================] - 0s 23ms/step
1/1 [==============================] - 0s 24ms/step
1/1 [==============================] - 0s 23ms/step
1/1 [==============================] - 0s 22ms/step
1/1 [==============================] - 0s 22ms/step
1/1 [==============================] - 0s 23ms/step
1/1 [==============================] - 0s 21ms/step
1/1 [==============================] - 0s 22ms/step
1/1 [==============================] - 0s 25ms/step
1/1 [==============================] - 0s 24ms/step
1/1 [==============================] - 0s 21ms/step
1/1 [==============================] - 0s 21ms/step
1/1 [==============================] - 0s 20ms/step
1/1 [==============================] - 0s 20ms/step
Generated Text:
This is Geeks a software training instituteais is is is is
Frequently Asked Questions (FAQs):Q. 1 What is RNN?
Ans. Recurrent neural networks (RNNs) are a type of artificial neural network that
are primarily utilised in NLP (natural language processing) and speech recognition.
RNN is utilised in deep learning and in the creation of models that simulate
neuronal activity in the human brain.
Q. 2 Which type of problem can solved by RNN?
Ans. Modelling time-dependent and sequential data problems, like text generation,
machine translation, and stock market prediction, is possible with recurrent neural
networks. Nevertheless, you will discover that the gradient problem makes RNN
difficult to train. The vanishing gradients issue affects RNNs.
Q. 3 What are the types of RNN?
Ans. There are four types of RNN are:
One to OneOne to ManyMany to OneMany to ManyQ. 4 What is the differences between
RNN and CNN?
Ans. The following are the key distinctions between CNNs and RNNs: CNNs are
frequently employed in the solution of problems involving spatial data, like
images. Text and video data that is temporally and sequentially organised is better
analysed by RNNs. RNNs and CNNs are not designed alike.
Last Updated :
04 Dec, 2023
Like Article
Save Article
Previous
Next
Today, different Machine Learning techniques are used to handle different types of
data. One of the most difficult types of data to handle and the forecast is
sequential data. Sequential data is different from other types of data in the sense
that while all the features of a typical dataset can be assumed to be order-
independent, this cannot be assumed for a sequential dataset. To handle such type
of data, the concept of Recurrent Neural Networks was conceived. It is different
from other Artificial Neural Networks in its structure. While other networks
“travel” in a linear direction during the feed-forward process or the back-
propagation process, the Recurrent Network follows a recurrence relation instead of
a feed-forward pass and uses Back-Propagation through time to learn. The Recurrent
Neural Network consists of multiple fixed activation function units, one for each
time step. Each unit has an internal state which is called the hidden state of the
unit. This hidden state signifies the past knowledge that the network currently
holds at a given time step. This hidden state is updated at every time step to
signify the change in the knowledge of the network about the past. The hidden state
is updated using the following recurrence relation:-
[Tex]- The new hidden state[/Tex][Tex]- The old hidden state[/Tex][Tex]- The
current input[/Tex][Tex]- The fixed function with trainable weights[/Tex]
Note: Typically, to understand the concepts of a Recurrent Neural Network, it is
often illustrated in its unrolled form and this norm will be followed in this
post. At each time step, the new hidden state is calculated using the recurrence
relation as given above. This new generated hidden state is used to generate indeed
a new hidden state and so on. The basic work-flow of a Recurrent Neural Network is
as follows:-
Note that is the initial hidden state of the network. Typically, it is a vector of
zeros, but it can have other values also. One method is to encode the presumptions
about the data into the initial hidden state of the network. For example, for a
problem to determine the tone of a speech given by a renowned person, the person’s
past speeches’ tones may be encoded into the initial hidden state. Another
technique is to make the initial hidden state a trainable parameter. Although these
techniques add little nuances to the network, initializing the hidden state vector
to zeros is typically an effective choice. Working of each Recurrent Unit:
Take input the previously hidden state vector and the current input vector. Note
that since the hidden state and current input are treated as vectors, each element
in the vector is placed in a different dimension which is orthogonal to the other
dimensions. Thus each element when multiplied by another element only gives a non-
zero value when the elements involved are non-zero and the elements are in the same
dimension.Element-wise multiplies the hidden state vector by the hidden state
weights and similarly performs the element-wise multiplication of the current input
vector and the current input weights. This generates the parameterized hidden state
vector and the current input vector. Note that weights for different vectors are
stored in the trainable weight matrix.Perform the vector addition of the two
parameterized vectors and then calculate the element-wise hyperbolic tangent to
generate the new hidden state vector.
During the training of the recurrent network, the network also generates an output
at each time step. This output is used to train the network using gradient
descent.
The Back-Propagation involved is similar to the one used in a typical Artificial
Neural Network with some minor changes. These changes are noted as:- Let the
predicted output of the network at any time step be and the actual output be . Then
the error at each time step is given by:- The total error is given by the summation
of the errors at all the time steps. Similarly, the value can be calculated as the
summation of gradients at each time step. Using the chain rule of calculus and
using the fact that the output at a time step t is a function of the current hidden
state of the recurrent unit, the following expression arises:- Note that the weight
matrix W used in the above expression is different for the input vector and hidden
state vector and is only used in this manner for notational convenience. Thus the
following expression arises:- Thus, Back-Propagation Through Time only differs from
a typical Back-Propagation in the fact the errors at each time step are summed up
to calculate the total error.
Although the basic Recurrent Neural Network is fairly effective, it can suffer from
a significant problem. For deep networks, The Back-Propagation process can lead to
the following issues:-
Vanishing Gradients: This occurs when the gradients become very small and tend
towards zero.Exploding Gradients: This occurs when the gradients become too large
due to back-propagation.
The problem of Exploding Gradients may be solved by using a hack – By putting a
threshold on the gradients being passed back in time. But this solution is not seen
as a solution to the problem and may also reduce the efficiency of the network. To
deal with such problems, two main variants of Recurrent Neural Networks were
developed – Long Short Term Memory Networks and Gated Recurrent Unit Networks.
Recurrent Neural Networks (RNNs) are a type of artificial neural network that is
designed to process sequential data. Unlike traditional feedforward neural
networks, RNNs can take into account the previous state of the sequence while
processing the current state, allowing them to model temporal dependencies in data.
The key feature of RNNs is the presence of recurrent connections between the hidden
units, which allow information to be passed from one time step to the next. This
means that the hidden state at each time step is not only a function of the input
at that time step, but also a function of the previous hidden state.
In an RNN, the input at each time step is typically a vector representing the
current state of the sequence, and the output at each time step is a vector
representing the predicted value or classification at that time step. The hidden
state is also a vector, which is updated at each time step based on the current
input and the previous hidden state.
The basic RNN architecture suffers from the vanishing gradient problem, which can
make it difficult to train on long sequences. To address this issue, several
variants of RNNs have been developed, such as Long Short-Term Memory (LSTM) and
Gated Recurrent Unit (GRU) networks, which use specialized gates to control the
flow of information through the network and address the vanishing gradient problem.
Applications of RNNs include speech recognition, language modeling, machine
translation, sentiment analysis, and stock prediction, among others. Overall, RNNs
are a powerful tool for processing sequential data and modeling temporal
dependencies, making them an important component of many machine learning
applications.
The advantages of Recurrent Neural Networks (RNNs) are:
Ability to Process Sequential Data: RNNs can process sequential data of varying
lengths, making them useful in applications such as natural language processing,
speech recognition, and time-series analysis.Memory: RNNs have the ability to
retain information about the previous inputs in the sequence through the use of
hidden states. This enables RNNs to perform tasks such as predicting the next word
in a sentence or forecasting stock prices.Versatility: RNNs can be used for a wide
variety of tasks, including classification, regression, and sequence-to-sequence
mapping.Flexibility: RNNs can be combined with other neural network architectures,
such as Convolutional Neural Networks (CNNs) or feedforward neural networks, to
create hybrid models for specific tasks.
However, there are also some disadvantages of RNNs:
Vanishing Gradient Problem: The vanishing gradient problem can occur in RNNs,
particularly in those with many layers or long sequences, making it difficult to
learn long-term dependencies.Computationally Expensive: RNNs can be computationally
expensive, particularly when processing long sequences or using complex
architectures.Lack of Interpretability: RNNs can be difficult to interpret,
particularly in terms of understanding how the network is making predictions or
decisions.Overall, while RNNs have some disadvantages, their ability to process
sequential data and retain memory of previous inputs make them a powerful tool for
many machine learning applications.
Last Updated :
21 Apr, 2023
Like Article
Save Article
Previous
Next
Recurrent Neural Networks (RNN) are to the rescue when the sequence of information
is needed to be captured (another use case may include Time Series, next word
prediction, etc.). Due to its internal memory factor, it remembers past sequences
along with current input which makes it capable to capture context rather than just
individual words. For better understanding, please read the article Introduction to
Recurrent Neural Network and related articles in GeeksforGeeks
We will conduct Sentiment Analysis to understand text classification using
Tensorflow!
Importing Libraries and Dataset
Python3
Python3
# Getting reviews with words that come under 5000# most occurring words in the
entire# corpus of textual review datavocab_size = 5000(x_train, y_train), (x_test,
y_test) = imdb.load_data(num_words=vocab_size) print(x_train[0])
Output:
[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66,3941, 4, 173, 36, 256,
5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172,
112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50,
16, 6, 147, 2025, 19, 14, 22,
4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22,
17, 515, 17, 12, 16, 626, 18,
2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 2, 16, 480, 66, 3785, 33, 4, 130,
12, 16, 38, 619, 5, 25, 124,
..]
These are the index values of the words and hence we done see any reviews
Python3
Output:
['the', 'as', 'you', 'with', 'out', 'themselves', 'powerful', 'lets', 'loves',
'their', 'becomes', 'reaching', 'had', 'journalist', 'of', 'lot', 'from', 'anyone',
'to', 'have', 'after', 'out', 'atmosphere', 'never', 'more', 'room', 'and', 'it',
'so', 'heart', 'shows', 'to', 'years', 'of', 'every', 'never', 'going', 'and',
'help', 'moments', 'or', 'of', 'every', 'chest', 'visual', 'movie', 'except',
'her', 'was', 'several', 'of', 'enough', 'more', 'with', 'is', 'now', 'current',
'film', 'as', 'you', 'of', 'mine', 'potentially', 'unfortunately', 'of', 'you',
'than', 'him', 'that', 'with', 'out', 'themselves', 'her', 'get', 'for', 'was',
'camp', 'of', 'you', 'movie', 'sometimes', 'movie', 'that', 'with', 'scary', 'but',
'and', 'to', 'story', 'wonderful', 'that', 'in', 'seeing', 'in', 'character', 'to',
'of', '70s', 'and', 'with', 'heart', 'had', 'shadows', 'they', 'of', 'here',
'that', 'with', 'her', 'serious', 'to', 'have', 'does', 'when', 'from', 'why',
'what', 'have', 'critics', 'they', 'is', 'you', 'that', "isn't", 'one', 'will',
'very', 'to', 'as', 'itself', 'with', 'other', 'and', 'in', 'of', 'seen', 'over',
'and', 'for', 'anyone', 'of', 'and', 'br', "show's", 'to', 'whether', 'from',
'than', 'out', 'themselves', 'history', 'he', 'name', 'half', 'some', 'br', 'of',
'and', 'odd', 'was', 'two', 'most', 'of', 'mean', 'for', '1', 'any', 'an', 'boat',
'she', 'he', 'should', 'is', 'thought', 'and', 'but', 'of', 'script', 'you', 'not',
'while', 'history', 'he', 'heart', 'to', 'real', 'at', 'and', 'but', 'when',
'from', 'one', 'bit', 'then', 'have', 'two', 'of', 'script', 'their', 'with',
'her', 'nobody', 'most', 'that', 'with', "wasn't", 'to', 'with', 'armed', 'acting',
'watch', 'an', 'for', 'with', 'and', 'film', 'want', 'an']
Let’s check the range of the reviews we have in this dataset.
Python3
# Get the minimum and the maximum length of reviewsprint("Max length of a review::
", len(max((x_train+x_test), key=len)))print("Min length of a review:: ",
len(min((x_train+x_test), key=len)))
Output:
Max length of a review:: 2697
Min length of a review:: 70
We see that the longest review available is 2697 words and the shortest one is 70.
While working with Neural Networks, it is important to make all the inputs in a
fixed size. To achieve this objective we will pad the review sentences.
Python3
Python3
# fixing every word's embedding size to be 32embd_len = 32 # Creating a RNN
modelRNN_model =
Sequential(name="Simple_RNN")RNN_model.add(Embedding(vocab_size,
embd_len, input_length=max_words)) # In case of a
stacked(more than one layer of RNN)# use
return_sequences=TrueRNN_model.add(SimpleRNN(128, activation
='tanh', return_sequences=False))RNN_model.add(Dense(1,
activation='sigmoid')) # printing model summaryprint(RNN_model.summary()) #
Compiling
modelRNN_model.compile( loss="binary_crossentropy", optimizer='adam', metr
ics=['accuracy']) # Training the modelhistory = RNN_model.fit(x_train_,
y_train_, batch_size=64, epochs=5,
verbose=1, validation_data=(x_valid,
y_valid)) # Printing model score on test dataprint()print("Simple_RNN Score---> ",
RNN_model.evaluate(x_test, y_test, verbose=0))
Output:
The vanilla form of RNN gave us a Test Accuracy of 64.95%. Limitations of Simple
RNN are it is unable to handle long sentences well because of its vanishing
gradient problems.
Gated Recurrent Units (GRU)
GRUs are lesser know but equally robust algorithms to solve the limitations of
simple RNNs. Please read the article Gated Recurrent Unit Networks for a better
understanding of their work.
Python3
# Defining GRU modelgru_model =
Sequential(name="GRU_Model")gru_model.add(Embedding(vocab_size,
embd_len, input_length=max_words))gru_model.add(GRU(128,
activation='tanh', return_sequences=False))gru_mod
el.add(Dense(1, activation='sigmoid')) # Printing the
Summaryprint(gru_model.summary()) # Compiling the
modelgru_model.compile( loss="binary_crossentropy", optimizer='adam', metr
ics=['accuracy']) # Training the GRU modelhistory2 = gru_model.fit(x_train_,
y_train_, batch_size=64, epochs=5,
verbose=1, validation_data=(x_valid,
y_valid)) # Printing model score on test dataprint()print("GRU model Score---> ",
gru_model.evaluate(x_test, y_test, verbose=0))
Output:
Test Accuracy of GRU was found to be 88.14%. GRU is a form of RNN that are better
than simple RNN and are often faster than LSTM due to its relatively fewer training
parameters.
Long Short Term Memory (LSTM)
LSTM is better in terms of capturing the memory of sequential information better
than simple RNNs. To understand the theoretical aspects of LSTM please visit the
article Long Short Term Memory Networks Explanation. Due to increased complexity
than that of GRU, it is slower to train but in general, LSTMs give better accuracy
than GRUs.
Python3
# Defining LSTM modellstm_model =
Sequential(name="LSTM_Model")lstm_model.add(Embedding(vocab_size,
embd_len, input_length=max_words))lstm_model.add(LSTM
(128, activation='relu', return_sequences=Fal
se))lstm_model.add(Dense(1, activation='sigmoid')) # Printing Model
Summaryprint(lstm_model.summary()) # Compiling the
modellstm_model.compile( loss="binary_crossentropy", optimizer='adam', met
rics=['accuracy']) # Training the modelhistory3 = lstm_model.fit(x_train_,
y_train_, batch_size=64, epochs=5
, verbose=2, validation_data=(x_v
alid, y_valid)) # Displaying the model accuracy on test dataprint()print("LSTM
model Score---> ", lstm_model.evaluate(x_test, y_test, verbose=0))
Output:
Python3
# Defining Bidirectional LSTM modelbi_lstm_model =
Sequential(name="Bidirectional_LSTM")bi_lstm_model.add(Embedding(vocab_size,
embd_len, input_length=max_words))bi
_lstm_model.add(Bidirectional(LSTM(128, activat
ion='tanh', return_sequences=False)))bi_lstm_mo
del.add(Dense(1, activation='sigmoid')) # Printing model
summaryprint(bi_lstm_model.summary()) # Compiling model
summarybi_lstm_model.compile( loss="binary_crossentropy", optimizer='adam', metr
ics=['accuracy']) # Training the modelhistory4 = bi_lstm_model.fit(x_train_,
y_train_, batch_size=64, ep
ochs=5, verbose=2, validati
on_data=(x_test, y_test)) # Printing model score on test
dataprint()print("Bidirectional LSTM model Score--->
", bi_lstm_model.evaluate(x_test, y_test, verbose=0))
Output:
Last Updated :
14 Oct, 2022
Like Article
Save Article
Previous
Next
Python3
Now we have to count the words and store their frequency. For that we will use
dictionary.
Python3
# Functions to count the frequency# of the words in the whole text file def
counting_words(words): word_count = {} for word in words: if word in
word_count: word_count[word] +=
1 else: word_count[word] = 1 return word_count
Python3
# Calculating the probability of each worddef prob_cal(word_count_dict): probs =
{} m = sum(word_count_dict.values()) for key in
word_count_dict.keys(): probs[key] = word_count_dict[key] / m return
probs
The further code is divided into 5 main parts, that includes the creation of all
types of different words that are possible.
To do this, we can use :
LemmatizationDeletion of letterSwitching LetterReplace LetterInsert new Letter
Let’s see the code implementation of each point
To do Lemmatization we will be using pattern module. You can install it using the
below command
pip install pattern
Then you can the below code
Python3
# LemmWord: extracting and adding# root word i.e.Lemma using pattern moduleimport
patternfrom pattern.en import lemma, lexemefrom nltk.stem import
WordNetLemmatizer def LemmWord(word): return list(lexeme(wd) for wd in
word.split())[0]
Python3
Python3
Python3
def Replace_(word): split_l = [] replace_list = [] # Replacing the letter
one-by-one from the list of alphs for i in
range(len(word)): split_l.append((word[0:i], word[i:])) alphs =
'abcdefghijklmnopqrstuvwxyz' replace_list = [a + l + (b[1:] if len(b) > 1 else
'') for a, b in split_l if b for l in alphs] return
replace_list
Python3
Now, we have implemented all the five steps. It’s time to merge all the words (i.e.
all functions) formed by those steps.
To implement that we will be using 2 different functions
Python3
# Collecting all the words# in a set(so that no word will repeat)def colab_1(word,
allow_switches=True): colab_1 =
set() colab_1.update(DeleteLetter(word)) if
allow_switches: colab_1.update(Switch_(word)) colab_1.update(Replace_(wor
d)) colab_1.update(insert_(word)) return colab_1 # collecting words using by
allowing switchesdef colab_2(word, allow_switches=True): colab_2 =
set() edit_one = colab_1(word, allow_switches=allow_switches) for w in
edit_one: if w: edit_two = colab_1(w,
allow_switches=allow_switches) colab_2.update(edit_two) return
colab_2
Now, The main task is to extract the correct words among all. To do so we will be
using a get_corrections function.
Python3
# Only storing those values which are in the vocabdef get_corrections(word, probs,
vocab, n=2): suggested_word = [] best_suggestion = [] suggested_word =
list( (word in vocab and word) or
colab_1(word).intersection(vocab) or
colab_2(word).intersection( vocab)) # finding out the words with
high frequencies best_suggestion = [[s, probs[s]] for s in
list(reversed(suggested_word))] return best_suggestion
Now the code is ready, we can test it for any user input by the below code.
Let’s print top 3 suggestions made by the Autocorrect.
Python3
Output :
Enter any word:daedd
dared
daned
diedConclusion
So, we have implemented the basic auto-corrector using the NLTK Library and Python.
For further steps, we can work on the High level auto-corrector system which uses
the large amount of dataset and works more efficiently.
To enhance accuracy, we can also use transformers and more NLP related techniques
like n-grams, Tf-idf, and so on.
Last Updated :
21 Dec, 2022
Like Article
Save Article
Previous
Next
Python3
Python3
Step 3: Tokenization, involves splitting sentences and words from the body of the
text.Step 4: Making the bag of words via sparse matrix
Take all the different words of reviews in the dataset without repeating of
words.One column for each word, therefore there is going to be many columns.Rows
are reviewsIf a word is there in the row of a dataset of reviews, then the count of
the word will be there in the row of a bag of words under the column of the word.
Examples: Let’s take a dataset of reviews of only two reviews
Input : "dam good steak", "good food good service"
Output :
Python3
Python3
# Splitting the dataset into# the Training set and Test setfrom
sklearn.cross_validation import train_test_split # experiment with "test_size"# to
get better resultsX_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.25)
Step 7: Predicting Final Results via using .predict() method with attribute
X_test
Python3
# Predicting the Test set resultsy_pred = model.predict(X_test) y_pred
Note: Accuracy with the random forest was 72%.(It may be different when performed
an experiment with different test sizes, here = 0.25).Step 8: To know the accuracy,
a confusion matrix is needed.Confusion Matrix is a 2X2 Matrix.
TRUE POSITIVE : measures the proportion of actual positives that are correctly
identified. TRUE NEGATIVE : measures the proportion of actual positives that are
not correctly identified. FALSE POSITIVE : measures the proportion of actual
negatives that are correctly identified. FALSE NEGATIVE : measures the proportion
of actual negatives that are not correctly identified.
Python3
# Making the Confusion Matrixfrom sklearn.metrics import confusion_matrix cm =
confusion_matrix(y_test, y_pred) cm
Last Updated :
02 Nov, 2021
Like Article
Save Article
Previous
Next
Restaurant Review Analysis Using NLP and SQLite
Normally, a lot of businesses are remained as failures due to lack of profit, lack
of proper improvement measures. Mostly, restaurant owners face a lot of
difficulties to improve their productivity. This project really helps those who
want to increase their productivity, which in turn increases their business
profits. This is the main objective of this project.
What the project does is that the restaurant owner gets to know about drawbacks of
his restaurant such as most disliked food items of his restaurant by customer’s
text review which is processed with ML classification algorithm(Naive Bayes) and
its results gets stored in the database using SQLite.
Tools & Technologies Used:NLTKMachine LearningPythonTkinterSqlite3PandasStep-by-
step Implementation:Step 1: Importing Libraries and Initialization of data
Firstly, we import NumPy, matplotlib, pandas, nltk, re, sklearn, Tkinter, sqlite3
libraries which are used for data manipulation, text data processing, pattern
recognition, training the data, graphical user interfaces and manipulation of data
on the database.
Python3
import numpy as npimport matplotlib.pyplot as pltimport pandas as pdimport reimport
nltkfrom nltk.corpus import stopwordsfrom nltk.stem.porter import PorterStemmerfrom
sklearn.feature_extraction.text import CountVectorizerfrom sklearn.model_selection
import train_test_splitfrom sklearn.naive_bayes import GaussianNBfrom
sklearn.metrics import accuracy_scorefrom tkinter import *import sqlite3 dataset =
pd.read_csv('Restaurant_Reviews.tsv', delimiter='\t',
quoting=3)corpus = []rras_code = "Wyd^H3R"food_rev = {}food_perc = {} conn =
sqlite3.connect('Restaurant_food_data.db')c = conn.cursor() for i in range(0,
1000): review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i]) review =
review.lower() review = review.split() ps = PorterStemmer() all_stopwords
= stopwords.words('english') all_stopwords.remove('not') review =
[ps.stem(word) for word in review if not word in
set(all_stopwords)] review = ' '.join(review) corpus.append(review) cv =
CountVectorizer(max_features=1500)X = cv.fit_transform(corpus).toarray()y =
dataset.iloc[:, -1].values X_train, X_test, y_train, y_test =
train_test_split( X, y, test_size=0.20, random_state=0) classifier =
GaussianNB()classifier.fit(X_train, y_train) y_pred =
classifier.predict(X_test) variables = []clr_variables = [] foods = ["Idly",
"Dosa", "Vada", "Roti", "Meals", "Veg Biryani", "Egg Biryani", "Chicken
Biryani", "Mutton Biryani", "Ice Cream", "Noodles", "Manchooriya", "Orange
juice", "Apple Juice", "Pineapple juice", "Banana juice"] for i in
foods: food_rev[i] = [] food_perc[i] = [0.0, 0.0] def init_data(): conn
= sqlite3.connect('Restaurant_food_data.db') c = conn.cursor() for i in
range(len(foods)): c.execute("INSERT INTO item
VALUES(:item_name,:no_of_customers,\ :no_of_positives,:no_of_negatives,:pos_
perc,:neg_perc)", { 'item_name':
foods[i], 'no_of_customers':
"0", 'no_of_positives':
"0", 'no_of_negatives': "0", 'pos_perc':
"0.0%", 'neg_perc':
"0.0%" } ) conn.commit() conn.close()
Python3
root1 = Tk()main = "Restaurant Review Analysis System/"root1.title(main+"Welcome
Page") label = Label(root1, text="RESTAURANT REVIEW ANALYSIS
SYSTEM", bd=2, font=('Arial', 47, 'bold', 'underline')) ques =
Label(root1, text="Are you a Customer or Owner ???") cust = Button(root1,
text="Customer", font=('Arial', 20), padx=80, pady=20,
command=take_review) owner = Button(root1, text="Owner", font=('Arial',
20), padx=100, pady=20,
command=login) '''conn=sqlite3.connect('Restaurant_food_data.db')c=conn.cursor()c.e
xecute("CREATE TABLE item (Item_name text,No_of_customers text,\
No_of_positive_reviews text,No_of_negative_reviews text,Positive_percentage \
text,Negative_percentage text) ")conn.commit()conn.close()'''#c.execute("DELETE
FROM item") root1.attributes("-zoomed", True)label.grid(row=0,
column=0)ques.grid(row=1, column=0, sticky=W+E)ques.config(font=("Helvetica",
30))cust.grid(row=2, column=0)owner.grid(row=3,
column=0)conn.commit()conn.close()root1.mainloop()
Python3
def take_review(): root2 = Toplevel() root2.title(main+"give
review") label = Label(root2, text="RESTAURANT REVIEW ANALYSIS
SYSTEM", bd=2, font=('Arial', 47, 'bold', 'underline')) req1 =
Label(root2, text="Select the item(s) you have taken.....") conn =
sqlite3.connect('Restaurant_food_data.db') c = conn.cursor() chk_btns =
[] selected_foods = [] req2 = Label(root2, text="Give your review
below....") rev_tf = Entry(root2, width=125, borderwidth=5) req3 =
Label(root2, text="NOTE : Use not instead of n't.") global
variables variables = [] chk_btns = [] for i in
range(len(foods)): var = IntVar() chk = Checkbutton(root2,
text=foods[i],
variable=var) variables.append(var) chk_btns.append(chk) label.gr
id(row=0, column=0, columnspan=4) req1.grid(row=1, column=0, columnspan=4,
sticky=W+E) req1.config(font=("Helvetica", 30)) for i in
range(4): for j in range(4): c =
chk_btns[i*4+j] c.grid(row=i+3, column=j, columnspan=1,
sticky=W) selected_foods = [] submit_review = Button(root2, text="Submit
Review", font=( 'Arial', 20), padx=100, pady=20, command=lambda:
[ estimate(rev_tf.get()), root2.destroy()]) root2.attributes("-zoomed",
True) req2.grid(row=7, column=0, columnspan=4,
sticky=W+E) req2.config(font=("Helvetica", 20)) rev_tf.grid(row=8, column=1,
rowspan=3, columnspan=2, sticky=S) req3.grid(row=11, column=1,
columnspan=2) submit_review.grid(row=12, column=0,
columnspan=4) conn.commit() conn.close() # Processing and storing the datadef
estimate(s): conn = sqlite3.connect('Restaurant_food_data.db') c =
conn.cursor() review = re.sub('[^a-zA-Z]', ' ', s) review =
review.lower() review = review.split() ps = PorterStemmer() all_stopwords
= stopwords.words('english') all_stopwords.remove('not') review =
[ps.stem(word) for word in review if not word in
set(all_stopwords)] review = ' '.join(review) X =
cv.transform([review]).toarray() res = classifier.predict(X) # list if
"not" in review: res[0] = abs(res[0]-1) selected_foods = [] for i
in range(len(foods)): if variables[i].get() ==
1: selected_foods.append(foods[i]) c.execute("SELECT *,oid FROM
item") records = c.fetchall() for i in records: rec =
list(i) if rec[0] in selected_foods: n_cust = int(rec[1])
+1 n_pos = int(rec[2]) n_neg =
int(rec[3]) if res[0] == 1: n_pos +=
1 else: n_neg +=
1 pos_percent = round((n_pos/n_cust)*100,
1) neg_percent = round((n_neg/n_cust)*100,
1) c.execute("""UPDATE item SET
Item_name=:item_name,No_of_customers\ =:no_of_customers,No_of_positive_r
eviews=:no_of_positives,\ No_of_negative_reviews=:no_of_negatives,Positi
ve_percentage\ =:pos_perc,Negative_percentage=:neg_perc where
oid=:Oid""", { 'item_name':
rec[0], 'no_of_customers':
str(n_cust), 'no_of_positives':
str(n_pos), 'no_of_negatives':
str(n_neg), 'pos_perc': str(pos_percent)
+"%", 'neg_perc': str(neg_percent)
+"%", 'Oid': foods.index(rec[0])
+1 } ) selected_foods =
[] conn.commit() conn.close()
Python3
Python3
Python3
def access_data(): root5 =
Toplevel() root5.title(main+"Restaurant_Database") label = Label(root5,
text="RESTAURANT REVIEW ANALYSIS SYSTEM", bd=2, font=('Arial', 47,
'bold', 'underline')) title1 = Label(root5, text="S.NO", font=('Arial', 10,
'bold', 'underline')) title2 = Label(root5, text="FOOD ITEM",
font=( 'Arial', 10, 'bold', 'underline')) title3 = Label(root5,
text="NO.OF CUSTOMERS", font=('Arial', 10, 'bold',
'underline')) title4 = Label(root5, text="NO.OF POSITIVE
REVIEWS", font=('Arial', 10, 'bold', 'underline')) title5
= Label(root5, text="NO.OF NEGATIVE REVIEWS", font=('Arial', 10,
'bold', 'underline')) title6 = Label(root5, text="POSITIVE
RATE", font=('Arial', 10, 'bold', 'underline')) title7 =
Label(root5, text="NEGATIVE RATE", font=('Arial', 10, 'bold',
'underline')) label.grid(row=0, column=0, columnspan=7) title1.grid(row=1,
column=0) title2.grid(row=1, column=1) title3.grid(row=1,
column=2) title4.grid(row=1, column=3) title5.grid(row=1,
column=4) title6.grid(row=1, column=5) title7.grid(row=1, column=6) conn
= sqlite3.connect('Restaurant_food_data.db') c =
conn.cursor() c.execute("SELECT *,oid from item") records =
c.fetchall() pos_rates = [] for record in records: record =
list(record) pos_rates.append(float(record[-3][:-1])) max_pos =
max(pos_rates) min_pos = min(pos_rates) for i in
range(len(records)): rec_list = list(records[i]) if str(max_pos)+"%"
== rec_list[-3]: rec_lab = [Label(root5, text=str(rec_list[-1]),
fg="green")] for item in rec_list[:-1]: lab =
Label(root5, text=item, fg="green") rec_lab.append(lab) elif
str(min_pos)+"%" == rec_list[-3]: rec_lab = [Label(root5,
text=str(rec_list[-1]), fg="red")] for item in rec_list[:-
1]: lab = Label(root5, text=item,
fg="red") rec_lab.append(lab) else: rec_lab =
[Label(root5, text=str(rec_list[-1]))] for item in rec_list[:-
1]: lab = Label(root5,
text=item) rec_lab.append(lab) for j in
range(len(rec_lab)): rec_lab[j].grid(row=i+2, column=j) exit_btn =
Button(root5, text="Exit", command=root5.destroy) exit_btn.grid(row=len(records)
+5, column=3) conn.commit() conn.close() root5.attributes("-zoomed",
True)
Python3
def clr_itemdata(): root6 =
Toplevel() root6.title(main+"clear_item_data") label = Label(root6,
text="RESTAURANT REVIEW ANALYSIS SYSTEM", bd=2, font=('Arial', 47,
'bold', 'underline')) req1 = Label(root6, text="Pick the items to clear
their corresponding\ item data....") chk_list = [] global
clr_variables clr_variables = [] for i in range(len(foods)): var
= IntVar() chk = Checkbutton(root6, text=foods[i],
variable=var) clr_variables.append(var) chk_list.append(chk)
label.grid(row=0, column=0, columnspan=4) req1.grid(row=1, column=0,
columnspan=4, sticky=W+E) req1.config(font=("Helvetica", 30)) for i in
range(4): for j in range(4): c =
chk_list[i*4+j] c.grid(row=i+3, column=j, columnspan=1,
sticky=W) clr_item = Button(root6, text="Clear", font=( 'Arial',
20), padx=100, pady=20, command=lambda: [ clr_data(),
root6.destroy()]) clr_item.grid(row=8, column=0,
columnspan=4) root6.attributes("-zoomed", True) def clr_alldata(): confirm =
messagebox.askquestion( "Confirmation", "Are you sure to delete all
data??") if confirm == "yes": conn =
sqlite3.connect('Restaurant_food_data.db') c = conn.cursor() for i in
range(len(foods)): c.execute("""UPDATE item SET
Item_name=:item_name,No_of_customers\ =:no_of_customers,No_of_positive_r
eviews=:no_of_positives,\ No_of_negative_reviews=:no_of_negatives,Positi
ve_percentage=:\ pos_perc,Negative_percentage=:neg_perc where
oid=:Oid""", { 'item_name':
foods[i], 'no_of_customers':
"0", 'no_of_positives':
"0", 'no_of_negatives':
"0", 'pos_perc':
"0.0%", 'neg_perc':
"0.0%", 'Oid':
i+1 } ) conn.commit() conn.
close()
Clearing food item data
Finally, this is my idea to increase the productivity of businesses with
technology. With this, the business problems get shut down by the improvement of
productivity.
Project Applications in Real-Life:It can be used in any food
restaurant/hotel.Effective in food improvement measurements that directly improve
one’s business.User-friendly.No chance of business loss.
Last Updated :
05 Oct, 2021
Like Article
Save Article
Previous
Next
In today’s era, companies work hard to make their customers happy. They launch new
technologies and services so that customers can use their products more. They try
to be in touch with each of their customers so that they can provide goods
accordingly. But practically, it’s very difficult and non-realistic to keep in
touch with everyone. So, here comes the usage of Customer Segmentation.
Customer Segmentation means the segmentation of customers on the basis of their
similar characteristics, behavior, and needs. This will eventually help the company
in many ways. Like, they can launch the product or enhance the features
accordingly. They can also target a particular sector as per their behaviors. All
of these lead to an enhancement in the overall market value of the company.
Customer Segmentation using Unsupervised Machine Learning in Python
Today we will be using Machine Learning to implement the task of Customer
Segmentation.
Import Libraries
The libraries we will be required are :
Pandas – This library helps to load the data frame in a 2D array format.Numpy –
Numpy arrays are very fast and can perform large computations.Matplotlib / Seaborn
– This library is used to draw visualizations.Sklearn – This module contains
multiple libraries having pre-implemented functions to perform tasks from data
preprocessing to model development and evaluation.
Python3
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn
as sb from sklearn.preprocessing import StandardScaler, LabelEncoderfrom
sklearn.cluster import KMeans import warningswarnings.filterwarnings('ignore')
Importing Dataset
The dataset taken for the task includes the details of customers includes their
marital status, their income, number of items purchased, types of items purchased,
and so on.
Python3
df = pd.read_csv('new.csv')df.head()
Output:
Python3
df.shape
Output:
(2240, 25)(2240, 25)
To get the information of the dataset like checking the null values, count of
values, etc. we will use .info() method.
Data Preprocessing
Python3
df.info()
Output:
Python3
df.describe().T
Output:
Python3
Python3
for col in df.columns: temp = df[col].isnull().sum() if temp >
0: print(f'Column {col} contains {temp} null values.')
Output:
Column Income contains 24 null values.
Now, once we have the count of the null values and we know the values are very less
we can drop them (it will not affect the dataset much).
Python3
df = df.dropna()print("Total missing values are:", len(df))
Output:
Total missing values are: 2216
To find the total number of unique values in each column we can use data.unique()
method.
Python3
df.nunique()
Output:
Here we can observe that there are columns which contain single values in the whole
column so, they have no relevance in the model development.
Also dataset has a column Dt_Customer which contains the date column, we can
convert into 3 columns i.e. day, month, year.
Python3
Now we have all the important features, we can now drop features like
Z_CostContact, Z_Revenue, Dt_Customer.
Python3
df.drop(['Z_CostContact', 'Z_Revenue',
'Dt_Customer'], axis=1, inplace=True)
Python3
Output:
['Education', 'Marital_Status', 'Accepted']
['Income']
To get the count plot for the columns of the datatype – object, refer the code
below.
Python3
Python3
df['Marital_Status'].value_counts()
Output:
Now lets see the comparison of the features with respect to the values of the
responses.
Python3
plt.subplots(figsize=(15, 10))for i, col in enumerate(objects): plt.subplot(2,
2, i + 1) sb.countplot(df[col], hue=df['Response'])plt.show()
Output:
Label Encoding
Label Encoding is used to convert the categorical values into the numerical values
so that model can understand it.
Python3
Python3
Output:
Standardization
Standardization is the method of feature scaling which is an integral part of
feature engineering. It scales down the data and making it easier for the machine
learning model to learn from it. It reduces the mean to ‘0’ and the standard
deviation to ‘1’.
Python3
Segmentation
We will be using T-distributed Stochastic Neighbor Embedding. It helps in
visualizing high-dimensional data. It converts similarities between data points to
joint probabilities and tries to minimize the values to low-dimensional embedding.
Python3
from sklearn.manifold import TSNEmodel = TSNE(n_components=2,
random_state=0)tsne_data = model.fit_transform(df)plt.figure(figsize=(7,
7))plt.scatter(tsne_data[:, 0], tsne_data[:, 1])plt.show()
Output:
There are certainly some clusters which are clearly visual from the 2-D
representation of the given data. Let’s use the KMeans algorithm to find those
clusters in the high dimensional plane itself
KMeans Clustering can also be used to cluster the different points in a plane.
Python3
Here inertia is nothing but the sum of squared distances within the clusters.
Python3
Output:
Here by using the elbow method we can say that k = 6 is the optimal number of
clusters that should be made as after k = 6 the value of the inertia is not
decreasing drastically.
Python3
Scatterplot will be used to see all the 6 clusters formed by KMeans Clustering.
Python3
plt.figure(figsize=(7, 7))sb.scatterplot(tsne_data[:, 0], tsne_data[:, 1],
hue=segments)plt.show()
Output:
Last Updated :
21 Nov, 2022
Like Article
Save Article
Previous
When did we see a video on youtube let’s say it was funny then the next time you
open your youtube app you get recommendations of some funny videos in your feed
ever thought about how? This is nothing but an application of Machine Learning
using which recommender systems are built to provide personalized experience and
increase customer engagement.
In this article, we will try to build a very basic recommender system that can
recommend songs based on which songs you hear.
Importing Libraries & Dataset
Python libraries make it very easy for us to handle the data and perform typical
and complex tasks with a single line of code.
Pandas – This library helps to load the data frame in a 2D array format and has
multiple functions to perform analysis tasks in one go.Numpy – Numpy arrays are
very fast and can perform large computations in a very short
time.Matplotlib/Seaborn – This library is used to draw visualizations.Sklearn –
This module contains multiple libraries having pre-implemented functions to perform
tasks from data preprocessing to model development and evaluation.
Python3
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn
as sb from sklearn.metrics.pairwise import cosine_similarityfrom
sklearn.feature_extraction.text import CountVectorizerfrom sklearn.manifold import
TSNE import warningswarnings.filterwarnings('ignore')
The dataset we are going to use contains data about songs released in the span of
around 100 years. Along with some general information about songs some scientific
measures of sound are also provided like loudness, acoustics, speechiness, and so
on.
Python3
tracks = pd.read_csv('tracks_records.csv')tracks.head()
Output:
First five rows of the datasetData Cleaning
Data Cleaning is one of the important steps without which data will be of no use
because the raw data contains a lot of noises that must be removed else the
observations made from it will be inaccurate and if we are building a model upon it
then it’s performance will be poor as well. Steps included in the data cleaning are
outlier removal, null value imputation, and fixing the skewness of the data.
Python3
tracks.shape
Output:
(586672, 19)
Python3
tracks.info()
Output:
Basic information about the columns of the dataset
Now. let’s check if there are null values in the columns of our data frame.
Python3
tracks.isnull().sum()
Output:
Number of null values in each column
The genre of music is a very important indicator of the type of music which is why
we will remove such rows with null values. We could have imputed then as well but
we have a huge dataset of around 6 lakh rows so, removing 50,000 won’t affect much
(depending upon the case).
Python3
tracks.dropna(inplace = True)tracks.isnull().sum().plot.bar()plt.show()
Output:
After removing rows containing null values
Now let’s remove some columns which we won’t be using to build our recommender
system.
Python3
Python3
Output:
Scatter plot of the output of t-SNE
Here we can observe some clusters.
Formation of clusters in 2-D space
As we know multiple versions of the same song are released hence we need to remove
the different versions of the same sone as we are building a content-based
recommender system behind which the main worker is the cosine similarity function
our system will recommend the versions of the same song if available and that is
not what we want.
Python3
tracks['name'].nunique(), tracks.shape
Output:
(408902, (536847, 17))
So, our concern was right so, let’s remove the duplicate rows based upon the song
names.
Python3
tracks = tracks.sort_values(by=['popularity'],
ascending=False)tracks.drop_duplicates(subset=['name'], keep='first', inplace=True)
Python3
plt.figure(figsize = (10,
5))sb.countplot(tracks['release_year'])plt.axis('off')plt.show()
Output:
Countplot of the number of songs in subsequent years
Here we can see a boom in the music industry from the year 1900 to somewhere around
1990.
Python3
Output:
10
There is a total of 10 such columns with float values in them. Let’s draw their
distribution plot to get insights into the distribution of the data.
Python3
plt.subplots(figsize = (15, 5))for i, col in enumerate(floats): plt.subplot(2, 5,
i + 1) sb.distplot(tracks[col])plt.tight_layout()plt.show()
Output:
Distribution plot of the continuous features
Some of the features have normal distribution while some data distribution is
skewed as well.
Python3
%%capturesong_vectorizer = CountVectorizer()song_vectorizer.fit(tracks['genres'])
As the dataset is too large computation cost/time will to too high so, we will show
the implementation of the recommended system by using the most popular 10,000
songs.
Python3
Below is a helper function to get similarities for the input song with each song in
the dataset.
Python3
def get_similarities(song_name, data): # Getting vector for the input
song. text_array1 = song_vectorizer.transform(data[data['name']==song_name]
['genres']).toarray() num_array1 =
data[data['name']==song_name].select_dtypes(include=np.number).to_numpy() # We
will store similarity for each row of the dataset. sim = [] for idx, row in
data.iterrows(): name = row['name'] # Getting vector for current
song. text_array2 = song_vectorizer.transform(data[data['name']==name]
['genres']).toarray() num_array2 =
data[data['name']==name].select_dtypes(include=np.number).to_numpy() #
Calculating similarities for text as well as numeric features text_sim =
cosine_similarity(text_array1, text_array2)[0][0] num_sim =
cosine_similarity(num_array1, num_array2)[0][0] sim.append(text_sim +
num_sim) return sim
To calculate the similarity between the two vectors we have used the concept of
cosine similarity.
Python3
def recommend_songs(song_name, data=tracks): # Base case if tracks[tracks['name']
== song_name].shape[0] == 0: print('This song is either not so popular or
you\ have entered invalid_name.\n Some songs you may like:\n') for song
in data.sample(n=5)
['name'].values: print(song) return data['similarity_factor'] =
get_similarities(song_name, data) data.sort_values(by=['similarity_factor',
'popularity'], ascending = [False,
False], inplace=True) # First song will be the input song
itself as the similarity will be highest. display(data[['name', 'artists']][2:7])
Now, it’s time to see the recommender system at work. Let’s see which songs are
recommender system will recommend if he/she listens to the famous song ‘Shape of
you’.
Python3
recommend_songs('Shape of You')
Output:
Recommended songs if you hear ‘Shape of you’
Let’s try this on one more song.
Python3
recommend_songs('Love Someone')
Output:
Recommended songs if you hear ‘Love Someone’
Below shown is the case if the song name entered is incorrect.
Python3
Output:
If the input song name is not in the datasetConclusion
Although this model requires a lot of changes before it can be used in any real-
world music app or website. But this is just an overview of how recommendation
systems are built and used.
Last Updated :
01 Nov, 2022
Like Article
Save Article
Previous
Next
Python3
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.datasets import
make_blobs
Python3
Python3
Output:
{0: {'center': array([0.06919154, 1.78785042]), 'points': []}, 1: {'center':
array([ 1.06183904, -0.87041662]), 'points': []}, 2: {'center': array([-1.11581855,
0.74488834]), 'points': []}}Plot the random initialize center with data points
Python3
plt.scatter(X[:,0],X[:,1])plt.grid(True)for i in clusters: center = clusters[i]
['center'] plt.scatter(center[0],center[1],marker = '*',c = 'red')plt.show()
Output:
Data points with random centerThe plot displays a scatter plot of data points
(X[:,0], X[:,1]) with grid lines. It also marks the initial cluster centers (red
stars) generated for K-means clustering.
Define Euclidean distance
Python3
Python3
Python3
Python3
clusters = assign_clusters(X,clusters)clusters = update_clusters(X,clusters)pred =
pred_cluster(X,clusters)
Python3
Python3
Python3
X, y = load_iris(return_X_y=True)
Elbow Method Finding the ideal number of groups to divide the data into is a basic
stage in any unsupervised algorithm. One of the most common techniques for figuring
out this ideal value of k is the elbow approach.
Python3
Python3
Output:
Elbow MethodFrom the above graph, we can observe that at k=2 and k=3 elbow-like
situation. So, we are considering K=3
Build the Kmeans clustering model
Python3
kmeans = KMeans(n_clusters = 3, random_state = 2)kmeans.fit(X)
Output:
KMeansKMeans(n_clusters=3, random_state=2)Find the cluster center
Python3
kmeans.cluster_centers_
Output:
array([[5.006 , 3.428 , 1.462 , 0.246 ], [5.9016129 ,
2.7483871 , 4.39354839, 1.43387097], [6.85 , 3.07368421, 5.74210526,
2.07105263]])Predict the cluster group:
Python3
pred = kmeans.fit_predict(X)pred
Output:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,
1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 1, 2, 1, 2, 2, 1,
1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1],
dtype=int32)Plot the cluster center with data points
Python3
plt.figure(figsize=(12,5))plt.subplot(1,2,1)plt.scatter(X[:,0],X[:,1],c = pred,
cmap=cm.Accent)plt.grid(True)for center in kmeans.cluster_centers_: center =
center[:2] plt.scatter(center[0],center[1],marker = '^',c =
'red')plt.xlabel("petal length (cm)")plt.ylabel("petal width
(cm)") plt.subplot(1,2,2) plt.scatter(X[:,2],X[:,3],c = pred,
cmap=cm.Accent)plt.grid(True)for center in kmeans.cluster_centers_: center =
center[2:4] plt.scatter(center[0],center[1],marker = '^',c =
'red')plt.xlabel("sepal length (cm)")plt.ylabel("sepal width (cm)")plt.show()
Output:
K-means clusteringThe subplot on the left display petal length vs. petal width with
data points colored by clusters, and red markers indicate K-means cluster centers.
The subplot on the right show sepal length vs. sepal width similarly.
ConclusionIn conclusion, K-means clustering is a powerful unsupervised machine
learning algorithm for grouping unlabeled datasets. Its objective is to divide data
into clusters, making similar data points part of the same group. The algorithm
initializes cluster centroids and iteratively assigns data points to the nearest
centroid, updating centroids based on the mean of points in each cluster.
Frequently Asked Questions (FAQs)1. What is k-means clustering for data analysis?K-
means is a partitioning method that divides a dataset into ‘k’ distinct, non-
overlapping subsets (clusters) based on similarity, aiming to minimize the variance
within each cluster.
2.What is an example of k-means in real life?Customer segmentation in marketing,
where k-means groups customers based on purchasing behavior, allowing businesses to
tailor marketing strategies for different segments.
3. What type of data is k-means clustering model?K-means works well with numerical
data, where the concept of distance between data points is meaningful. It’s
commonly applied to continuous variables.
4.Is K-means used for prediction?K-means is primarily used for clustering and
grouping similar data points. It does not predict labels for new data; it assigns
them to existing clusters based on similarity.
5.What is the objective of k-means clustering?The objective is to partition data
into ‘k’ clusters, minimizing the intra-cluster variance. It seeks to form groups
where data points within each cluster are more similar to each other than to those
in other clusters.
Last Updated :
21 Dec, 2023
Like Article
Save Article
Previous
python3
import numpy as npimport matplotlib.pyplot as pltimport cv2 %matplotlib inline #
Read in the imageimage = cv2.imread('images/monarch.jpg') # Change color to RGB
(from BGR)image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) plt.imshow(image)
Now we have to prepare the data for K means. The image is a 3-dimensional shape but
to apply k-means clustering on it we need to reshape it to a 2-dimensional array.
Code:
python3
# Reshaping the image into a 2D array of pixels and 3 color values (RGB)pixel_vals
= image.reshape((-1,3)) # Convert to float typepixel_vals = np.float32(pixel_vals)
Now we will implement the K means algorithm for segmenting an image.
Code: Taking k = 3, which means that the algorithm will identify 3 clusters in the
image.
python3
#the below line of code defines the criteria for the algorithm to stop running,
#which will happen is 100 iterations are run or the epsilon (which is the required
accuracy) #becomes 85%criteria = (cv2.TERM_CRITERIA_EPS +
cv2.TERM_CRITERIA_MAX_ITER, 100, 0.85) # then perform k-means clustering with
number of clusters defined as 3#also random centres are initially choosed for k-
means clusteringk = 3retval, labels, centers = cv2.kmeans(pixel_vals, k, None,
criteria, 10, cv2.KMEANS_RANDOM_CENTERS) # convert data into 8-bit valuescenters =
np.uint8(centers)segmented_data = centers[labels.flatten()] # reshape data into the
original image dimensionssegmented_image =
segmented_data.reshape((image.shape)) plt.imshow(segmented_image)
Output:
As you can see with an increase in the value of k, the image becomes clearer and
distinct because the K-means algorithm can classify more classes/cluster of colors.
K-means clustering works well when we have a small dataset. It can segment objects
in images and also give better results. But when it is applied on large datasets
(more number of images), it looks at all the samples in one iteration which leads
to a lot of time being taken up.