Natural Language Processing Neural Networks
Natural Language Processing Neural Networks
Hide Answer
Artificial Intelligence (AI) is a field of Computer Science focuses on creating systems that can
perform tasks that would typically require human intelligence, such as recognizing speech,
understanding natural language, making decisions, and learning. We use AI to build various
applications, including image and speech recognition, natural language processing (NLP),
robotics, and machine learning models like neural networks.
2.
Hide Answer
Machine learning and Artificial Intelligence (AI) are closely related but distinct fields within
the broader domain of computer science. AI includes not only machine learning but also
other approaches, like rule-based systems, expert systems, and knowledge-based systems,
which do not necessarily involve learning from data. Many state-of-the-art AI systems are
built upon machine learning techniques, as these approaches have proven to be highly
effective in tackling complex, data-driven problems.
3.
Hide Answer
Deep learning is a subfield of machine learning that focuses on the development of artificial
neural networks with multiple layers, also known as deep neural networks. These networks
are particularly effective in modeling complex, hierarchical patterns and representations in
data. Deep learning is inspired by the structure and function of the human brain, specifically
the biological neural networks that make up the brain.
4.
Hide Answer
Neural networks are one of many types of ML algorithms that are used to model complex
patterns in data. They are composed of three layers — input layer, hidden layer, and output
layer.
5.
Explain TensorFlow.
Hide Answer
TensorFlow is an open-source platform developed by Google designed primarily for high-
performance numerical computation. It offers a collection of workflows that can be used to
develop and train models to make machine learning robust and efficient. TensorFlow is
customizable, and thus, helps developers create experiential learning architectures and work
on the same to produce desired results.
6.
Hide Answer
Cognitive computing is a type of AI that mimics human thought processes.We use this form
of computing to solve problems that are complex for traditional computer systems. Some
major benefits of cognitive computing are:
7.
Hide Answer
Natural Language Processing (NLP) and Natural Language Understanding (NLU) are two
closely related subfields within the broader domain of Artificial Intelligence (AI), focused on
the interaction between computers and human languages. Although they are often used
interchangeably, they emphasize different aspects of language processing.
NLP deals with the development of algorithms and techniques that enable computers to
process, analyze, and generate human language. NLP covers a wide range of tasks,
including text analysis, sentiment analysis, machine translation, summarization, part-of-
speech tagging, named-entity recognition, and more. The goal of NLP is to enable
computers to effectively handle text and speech data, extract useful information, and
generate human-like language outputs.
While, NLU is a subset of NLP that focuses specifically on the comprehension and
interpretation of meaning from human language inputs. NLU aims to disambiguate the
nuances, context, and intent in human language, helping machines grasp not just the
structure but also the underlying meaning, sentiment, and purpose. NLU tasks may include
sentiment analysis, question-answering, intent recognition, and semantic parsing.
8.
Hide Answer
Some examples of weak AI include rule-based systems and decision trees. Basically, those
systems that require an input come under weak AI. On the other hand, a strong AI includes
neural networks and deep learning, as these systems and functions can teach themselves to
solve problems.
9.
Hide Answer
Data mining is the process of discovering patterns, trends, and useful information from large
datasets using various algorithms, statistical methods, and machine learning techniques. It
has gained significant importance due to the growth of data generation and storage
capabilities. The need for data mining arises from several aspects, including decision-
making.
10.
Hide Answer
Healthcare -It is used to predict patient outcomes, detection of fraud and abuse, measure
the effectiveness of certain treatments, and develop patient and doctor relationships.
Finance -The finance and banking industry depends on high-quality, reliable data. It can be
used to predict stock prices, predict loan payments and determine credit ratings.
Retail- It is used to predict consumer behavior, noticing buying patterns to improve customer
service and satisfaction.
11.
Hide Answer
1. Language understanding - This defines the ability to interpret the meaning of a piece
of text
12.
LSTM stands for Long Short-Term Memory, and it is a type of recurrent neural
network (RNN) architecture that is widely used in artificial intelligence and natural language
processing. LSTM networks have been successfully used in a wide range of applications,
including speech recognition, language translation, and video analysis, among others.
13.
Hide Answer
Artificial Narrow Intelligence (ANI), also known as Weak AI, refers to AI systems that are
designed and trained to perform a specific task or a narrow range of tasks. These systems
are highly specialized and can perform their designated task with a high degree of accuracy
and efficiency. This type of technology is also known as Weak AI.
14.
Hide Answer
A data cube is a multidimensional (3D) representation of data that can be used to support
various types of analysis and modeling. Data cubes are often used in machine learning and
data mining applications to help identify patterns, trends, and correlations in complex
datasets.
15.
Hide Answer
Model accuracy refers to how often a model correctly predicts the outcome of a specific task
on a given dataset. Model performance, on the other hand, is a broader term that
encompasses various aspects of a model's performance, including its accuracy, precision,
recall, F1 score, AUC-ROC, etc. Depending on the problem you're solving, one metric may
be more important than the other.
16.
Hide Answer
Generative Adversarial Network (GAN) are a class of deep learning models that consist of
two primary components working together in a competitive setting. GANs are used to
generate new, synthetic data that closely resemble a given real-world dataset. The two main
components of a GAN are:
Generator: The generator is a neural network that takes random noise as input and
generates synthetic data samples. The aim of the generator is to produce realistic data that
mimic the distribution of the real-world data. As the training progresses, the generator
becomes better at generating data that closely resemble the original dataset, without actually
replicating any specific instances.
17.
Hide Answer
Deep learning models involve handling various types of data, which require specific data
structures to store and manipulate the data efficiently. Some of the most common data
structures used in deep learning are:
Tensors: Tensors are multi-dimensional arrays and are the fundamental data structure used
in deep learning frameworks like TensorFlow and PyTorch. They are used to represent a
wide variety of data, including scalars, vectors, matrices, or higher-dimensional arrays.
Matrices: Matrices are two-dimensional arrays and are a special case of tensors. They are
widely used in linear algebra operations that are common in deep learning, such as matrix
multiplication, transpose, and inversion.
Vectors: Vectors are one-dimensional arrays and can also be regarded as a special case of
tensors. They are used to represent individual data points, model parameters, or
intermediate results during calculations.
Arrays: Arrays are fixed-size, homogeneous data structures that can store elements in a
contiguous memory location. Arrays can be one-dimensional (similar to vectors) or multi-
dimensional (similar to matrices or tensors).
18.
Hide Answer
The hidden layer in a neural network is responsible for mapping the input to the output. The
hidden layer's function is to extract and learn features from the input data that are relevant
for the given task. These features are then used by the output layer to make predictions or
classifications.
In other words, the hidden layer acts as a "black box" that transforms the input data into a
form that is more useful for the output layer.
19.
Mention some advantages of neural networks.
Hide Answer
Neural networks can detect non-linear relationships between variables and can
identify all types of interactions between predictor variables.
Neural networks can handle large amounts of data and extract meaningful insights
from it. This makes them useful in a variety of applications, such as image
recognition, speech recognition, and natural language processing.
Neural networks are able to filter out noise and extract meaningful features from
data. This makes them useful in applications where the data may be noisy or contain
irrelevant information.
Neural networks can adapt to changes in the input data and adjust their parameters
accordingly. This makes them useful in applications where the input data is dynamic
or changes over time.
20.
Hide Answer
The main difference between stemming and lemmatization is that stemming is a rule-based
process, while lemmatization is a more sophisticated, dictionary-based approach.
21.
Hide Answer
Extraction-based: It does not take new phrases and words; instead, it uses the already
existing phrases and words and presents only that. Extraction-based summarization ranks
all the sentences according to the relevance and understanding of the text and presents you
with the most important sentences.
Abstraction-based: It creates phrases and words, puts them together, and makes a
meaningful word or sentence. Along with that, abstraction-based summarization adds the
most important facts found in the text. It tries to find out the meaning of the whole text and
presents the meaning to you.
22.
Hide Answer
Corpus in NLP refers to a large collection of texts. A corpus can be used for various tasks
such as building dictionaries, developing statistical models, or simply for reading
comprehension.
23.
Hide Answer
Binarizing of data is the process of converting data features of any entity into vectors of
binary numbers to make classifier algorithms more productive. The binarizing technique is
used for the recognition of shapes, objects, and characters. Using this, it is easy to
distinguish the object of interest from the background in which it is found.
24.
Hide Answer
Perception is the process of interpreting sensory information, and there are three main types
of perception: visual, auditory, and tactile.
Vision: It is used in the form of face recognition, medical imaging analysis, 3D scene
modeling, video recognition, human pose tracking, and many more
Auditory: Machine Auditory has a wide range of applications, such as speech synthesis,
voice recognition, and music recording. These solutions are integrated into voice assistants
and smartphones.
Tactile: With this, machines are able to acquire intelligent reflexes and better interact with the
environment.
25.
Hide Answer
Decision trees have some advantages, such as being easy to understand and interpret, but
they also have some disadvantages, such as being prone to overfitting.
26.
Hide Answer
The marginalization process is used to eliminate certain variables from a set of data, in order
to make the data more manageable. In probability theory, marginalization involves
integrating over a subset of variables in a joint distribution to obtain the distribution of the
remaining variables. The process essentially involves "summing out" the variables that are
not of interest, leaving only the variables that are desired.
27.
Hide Answer
An artificial neural network is a ML algorithm that is used to simulate the workings of the
human brain. ANNs consist of interconnected nodes (also known as neurons) that process
and transmit information in a way that mimics the behavior of biological neurons.
The primary function of an artificial neural network is to learn from input data, such as
images, text, or numerical values, and then make predictions or classifications based on that
data. ANNs can be used for a wide range of tasks, such as image recognition, natural
language processing, and predictive analytics.
28.
Hide Answer
Cognitive computing is a subfield of AI that focuses on creating systems that can mimic
human cognition and perform tasks that require human-like intelligence. The primary goal of
cognitive computing is to enable computers to interact more naturally with humans,
understand complex data, reason, learn from experience, and make decisions
autonomously.
There is no strict categorization of cognitive computing types; however, the key capabilities
and technologies associated with cognitive computing can be grouped as follows:
NLP: NLP techniques enable cognitive computing systems to understand, process, and
generate human language in textual or spoken form.
Computer Vision: Computer vision deals with the interpretation and understanding of visual
information, such as images and videos. In cognitive computing, it is used to extract useful
information from visual data, recognize objects, understand scenes, and analyze emotions
or expressions.
29.
Hide Answer
Deep learning frameworks are software libraries and tools designed to simplify the
development, training, and deployment of deep learning models. They provide a range of
functionalities that support the implementation of complex neural networks and the execution
of mathematical operations required for their training and inference processes. Some
popular deep learning frameworks are TensorFlow, Keras, and PyTorch.
30.
Hide Answer
Speech recognition and video recognition are two distinct areas within AI and involve
processing and understanding different types of data. While they share some commonalities
in terms of using machine learning and pattern recognition techniques, they differ in the data,
algorithms, and objectives associated with each domain.
Speech Recognition focuses on the automatic conversion of spoken language into textual
form. This process involves understanding and transcribing the spoken words, phrases, and
sentences from an audio signal.
Video Recognition deals with the analysis and understanding of visual information in the
form of videos. This process primarily involves extracting meaningful information from a
series of image frames, such as detecting objects, recognizing actions, identifying scenes,
and tracking moving objects.
31.
Hide Answer
A pooling layer is a type of layer used in a convolutional neural network (CNN). Pooling
layers downsample the input feature maps by summary pooled areas. This reduces the
dimensionality of the feature map and makes the CNN more robust to small changes in the
input.
32.
Hide Answer
Boltzmann machines are a type of energy-based model which learn a probability distribution
by simulating a system of diverging and converging nodes. These nodes act like neurons in
a neural network, and can be used to build deep learning models.
33.
Hide Answer
Regular grammar is a type of grammar that specifies a set of rules for how strings can be
formed from a given alphabet. These rules can be used to generate new strings or to check
if a given string is valid.
34.
Hide Answer
There are many ways to obtain data for NLP projects. Some common sources of data
include texts, transcripts, social media posts, and reviews. You can also use web scraping
and other methods to collect data from the internet.
35.
Hide Answer
Regular expressions are a type of syntax used to match patterns in strings. They can be
used to find, replace, or extract text. In layman's terms, regular expressions are a way to
describe patterns in data. They are commonly used in programming, text editing, and data
processing tasks to manipulate and extract text in a more efficient and precise way.
36.
How is NLTK different from spaCy?
Hide Answer
Both NLTK and spaCy are popular NLP libraries in Python, but they have some key
differences:
NLTK is a general-purpose NLP library that provides a wide range of tools and algorithms for
basic NLP tasks such as tokenization, stemming, and part-of-speech tagging. NLTK also has
tools for text classification, sentiment analysis, and machine translation. In contrast, spaCy
focuses more on advanced NLP tasks such as named entity recognition, dependency
parsing, and semantic similarity.
spaCy is generally considered to be faster and more efficient than NLTK due to its optimized
Cython-based implementation. spaCy is designed to process large volumes of text quickly
and efficiently, making it well-suited for production environments.
37.
Hide Answer
There are several powerful tools and libraries available for Natural Language Processing
(NLP) tasks, which cater to various needs like text processing, tokenization, sentiment
analysis, machine translation, among others. Some of the best NLP tools and libraries
include:
NLTK: NLTK is a popular Python library for working with human language data. It provides
easy-to-use interfaces to over 50 corpora and lexical resources, along with text processing
libraries for classification, tokenization, stemming, tagging, parsing, and more.
spaCy: spaCy is a modern, high-performance, and industry-ready NLP library for Python. It
offers state-of-the-art algorithms for fast and accurate text processing, and includes features
like part-of-speech tagging, named entity recognition, dependency parsing, and word
vectors.
Gensim: Gensim is a Python library designed for topic modeling and document similarity
analysis. It specializes in unsupervised semantic modeling and is particularly useful for tasks
like topic extraction, document comparison, and information retrieval.
38.
Hide Answer
Yes, chatbots are derived from NLP. NLP is used to process and understand human
language so that chatbots can respond in a way that is natural for humans.
39.
Hide Answer
Embedding is a technique to represent data in a vector space so that similar data points are
close together. Some techniques to accomplish embedding are word2vec and GloVe.
Word2vec: It is used to find similar words which have similar dimensions and, consequently,
help bring context. It helps in establishing the association of a word with another similar
meaning word through the created vectors.
GloVe: It is used for word representation. GloVe is developed for generating word
embeddings by aggregating global word-word co-occurrence matrices from a corpus. The
result shows the linear structure of the word in vector space.
1.
Hide Answer
2.
Hide Answer
Gradient descent is a popular optimization algorithm that is used to find the minimum of a
function iteratively. It's widely used in machine learning and deep learning for training models
by minimizing the error or loss function, which measures the difference between the
predicted and actual values.
3.
Hide Answer
Ensuring fair comparison: Normalization brings all features to a comparable range, mitigating
the effect of different magnitudes or units of measurement, and ensuring that each feature
contributes equally to the model's predictions.
Faster convergence: Gradient-based optimization algorithms can converge faster when data
are normalized, as the search space becomes more uniformly scaled and the gradients have
a more consistent magnitude.
Reducing numerical issues: Normalizing data can help prevent numerical issues like over- or
underflow that may arise when dealing with very large or very small numbers during
calculations.
4.
Hide Answer
Sigmoid: Maps the input to a value between 0 and 1, allowing for smooth gradient updates.
However, it suffers from the vanishing gradient problem and is not zero-centered.
Tanh: Maps the input to a value between -1 and 1, providing a zero-centered output. Like the
sigmoid function, it can also suffer from the vanishing gradient problem.
ReLU (Rectified Linear Unit): Outputs 0 for negative input values and retains the input for
positive values. It helps alleviate the vanishing gradient problem and has faster computation
time, but the output is not zero-centered and can suffer from the dying ReLU issue.
5.
Hide Answer
Data augmentation is a technique used to increase the amount of data available for training
a machine learning model. This is especially important for deep learning models, which
require large amounts of data to train.
6.
Hide Answer
The Swish function is an activation function. It is a smooth, non-linear, and differentiable
function that has been shown to outperform some of the traditional activation functions, like
ReLU, in certain deep learning tasks.
7.
Hide Answer
Forward propagation is the process of computing the output of a neural network given an
input. Forward propagation involves passing an input through the network, layer by layer,
until the output is produced. Each layer applies a transformation to the output of the previous
layer using a set of weights and biases. The activation function is applied to the transformed
output, producing the final output of the layer.
On the other hand, backpropagation is the process of computing the gradient of the loss
function with respect to the weights of the network. It is used to update the weights and
biases of the network during the training process. It involves calculating the gradient of the
loss function with respect to each weight and bias in the network. The gradient is then used
to update the weights and biases using an optimization algorithm such as gradient descent.
8.
Hide Answer
Classification is a type of supervised learning task in machine learning and statistics, where
the objective is to assign input data points to one of several predefined categories or labels.
In a classification problem, the model is trained on a dataset with known labels and learns to
predict the category to which a new, unseen data point belongs. Examples of classification
tasks include spam email detection, image recognition, and medical diagnosis.
Pattern recognition: Classification algorithms are capable of identifying and learning complex
patterns in data, enabling them to predict the category of new inputs accurately.
Anomaly detection: Classification models can be used to detect unusual or anomalous data
points that don't fit the learned patterns.
9.
Convolutional neural networks are a type of neural network that is well-suited for image
classification tasks. In classification, the model learns to classify input data into one or more
predefined classes or categories based on the features of the data. There are various
benefits of classification, and it has numerous practical applications in different fields, such
as:
Object Recognition: It is used in image and speech recognition to identify objects, faces, or
voices.
Sentiment Analysis: It helps understand the polarity of textual data, which can be used to
gauge customer feedback, opinions, and emotions.
Email Spam Filtering: It can be used to classify emails into a spam or non-spam categories
to improve email communication.
10.
Hide Answer
Autoencoders are a type of neural network that is used for dimensionality reduction. The
different types of autoencoders include Denoising, Sparse, Undercomplete, etc.
Sparse Autoencoder: This has a sparsity penalty, a value close to zero but not exactly zero.
It is applied on the hidden layer in addition to the reconstruction error, which prevents
overfitting.
Undercomplete Autoencoder: This does not need any regularization because they maximize
the probability of data rather than copying the input to the output.
11.
Hide Answer
12.
Hide Answer
LSTM stands for Long Short-Term Memory. It is a neural network architecture that is used for
modeling time series data. LSTM has three main components:
The forget gate: This gate decides how much information from the previous state is to be
retained in the current state.
The input gate: This gate decides how much new information from the current input is to be
added to the current state.
The output gate: This gate decides what information from the current state is to be output.
13.
Hide Answer
Transfer learning is a machine learning technique where you use knowledge from one
domain and apply it to another domain. This is usually done to accelerate the learning
process or to improve performance.
Learn from smaller datasets: If you have a small dataset, you can use transfer learning to
learn from a larger dataset in the same domain. This will help you to build better models.
Learn from different domains: You can use transfer learning to learn from different domains.
For example, if you want to build a computer vision model, you can use knowledge from the
medical domain.
Better performance: Transfer learning can help you to improve the performance of your
models and apply it on other domains to build better models.
Pre-trained models: If you use a pre-trained model, you can save time and resources. This is
because you don’t have to train the model from scratch.
Use of fine-tune models: You can fine-tune models using transfer learning. Also, you can
adapt the model to your specific needs.
14.
Hide Answer
The cost/loss function is an important part of machine learning that maps a set of input
parameters to a real number that represents the cost or loss. The cost/loss function is used
for optimization problems. The goal of optimization is to find the set of input parameters that
minimize the cost/loss function.
15.
Epoch, batch, and iteration are all important terms in machine learning. Epoch refers to the
number of times the training dataset is used to train the model; Batch refers to the number of
training samples used in one iteration; Iteration is the number of times the training algorithm
is run on the training dataset.
16.
Explain dropouts.
Hide Answer
Dropout is a method used to prevent the overfitting of a neural network. It refers to dropping
out some neural network units. The process is similar to that of natural reproduction, where
distinct genes combine to produce offspring while the other genes are dropped out instead of
strengthening their co-adaptation.
17.
Hide Answer
As more layers are added and the distance from the final layer increases, backpropagation
is not as helpful in sending information to the lower layers. As a result, the information is sent
back, and the gradients start disappearing and becoming small in relation to network
weights. These disappearing gradients are known as vanishing gradients.
18.
Hide Answer
Batch gradient descent is an optimization algorithm that calculates the gradient of the cost
function with respect to the weights of the model for each training batch. The weights are
updated in the direction that decreases the cost function.
19.
Hide Answer
20.
Hide Answer
One of the biggest drawbacks of Machine learning is that it can be biased if the data used to
train the algorithm is not representative of the real world. For example, if an algorithm is
trained using data that is mostly from one gender or one race, it may be biased against other
genders or races.
Algorithm selection
Data acquisition
21.
Hide Answer
Sentiment analysis is the process of analyzing text to determine the emotional tone of the
text in NLP. This can be helpful in customer service to understand how customers are
feeling, or in social media to understand the general public sentiment about a topic.
22.
Hide Answer
Breadth-First Search (BFS) and Depth-First Search (DFS) are two algorithms used for graph
traversal.
BFS algorithm starts from the root node (or any other selected node) and visits all the nodes
at the same level before moving to the next level.
On the other hand, DFS algorithm starts from the root node (or any other selected node) and
explores as far as possible along each branch before backtracking.
23.
Hide Answer
Supervised learning involves training a model with labeled data, where both input features
and output labels are provided. The model learns the relationship between inputs and
outputs to make predictions for unseen data. Common supervised learning tasks include
classification and regression.
Unsupervised learning, on the other hand, uses unlabeled data where only input features
are provided. The model seeks to discover hidden structures or patterns in the data, such as
clusters or data representations. Common unsupervised learning tasks include clustering,
dimensionality reduction, and anomaly detection.
24.
Hide Answer
Text extraction is the process of extracting text from images or other sources. This can be
done with OCR (optical character recognition) or by converting the text to a format that can
be read by a text-to-speech system.
25.
Hide Answer
1. They can be biased if the data used to train the model is not representative of the
real world.
2. Linear models can also be overfit if the data used to train the model is too small.
3. Linear models assume a linear relationship between the input features and the output
variable, which may not hold in reality. This can lead to poor predictions and
decreased model performance.
26.
Hide Answer
Artificial intelligence interview questions like this can be easy and difficult at the same time
as you may know the answers but not on the tip of your tongue. Hence, a quick refresher
can help a lot. Reducing dimensionality refers to the reduction of the number of random
variables. This can be achieved by different techniques including principal component
analysis, low variance filter, missing values ratio, high correlation filter, random forest, and
others.
27.
Hide Answer
This is a popular AI interview question. A cost function is a scalar function that helps to
identify how wrong an AI model is with regard to its ability to determine the relationship
between X and Y. In other words, it tells us the neural network’s error factor.
The neural network works better when the cost function is lower. For instance, it takes the
output predicted by the neural network and the actual output and then computes how
incorrect the model was in its prediction.
So, the cost function will give a lower number if the predictions don’t differ too much from the
actual values and vice-versa
28.
Hide Answer
Learning rate: It refers to the speed with which the network gets familiar with its parameters
Momentum: This parameter enables coming out of the local minima and smoothening jumps
during gradient descent
The number of epochs: This parameter refers to the number of times the whole training
dataset is fed to the network during training. One must increase the number of epochs until a
decrease in validation accuracy is noticed, even if there is an increase in training accuracy,
which is called overfitting.
Number of hidden layers: This parameter specifies the number of layers between the input
and output layers.
Number of neurons in each hidden layer: This parameter specifies the number of neurons in
each hidden layer.
Activation functions: Activation functions are responsible for determining a neuron's output
based on the weighted sum of its inputs. Widely used activation functions include Sigmoid,
ReLU, Tanh, and others.
29.
Hide Answer
Intermediate tensors are temporary data structures in a computational graph that store
intermediate results when executing a series of operations in Artificial Intelligence,
particularly in deep learning frameworks. These tensors represent the values produced
during the forward pass of a neural network while processing input data before reaching the
final output.
Yes, sessions have a lifetime, which starts when the session is created and ends when the
session is closed or the script is terminated. In TensorFlow 1.x, sessions were used to
execute and manage operations in a computational graph. A session allowed the allocation
of memory for tensor values and held necessary resources to execute the operations. In
TensorFlow 2.x, sessions and computational graphs have been replaced with a more
dynamic and eager execution approach, allowing for simpler and more Pythonic code.
30.
Hide Answer
Exploding variables are a phenomenon in which the magnitude of a variable grows rapidly
over time, often leading to numerical instability and overflow errors. This can happen when a
variable is repeatedly multiplied or divided by a value that is greater than 1 or less than -1.
As a result, the variable's value grows exponentially or collapses to zero, causing
computational problems.
31.
Hide Answer
Linear regression is a basic tool in statistical learning, but it cannot be used to build a deep
learning model. Deep learning models require non-linear functions to learn complex patterns
in data.
32.
Hide Answer
Hyperparameters are parameters that are not learned by the model. They are set by the
user and used to control the model's behavior.
33.
Hide Answer
An Artificial Super Intelligence system is not one that has been achieved yet. Also known as
Super AI, it is a hypothetical system that can surpass human intelligence and execute any
task better than a human. The concept of ASI suggests that such an AI can exceed all
human intelligence. It can even take complex decisions in harsh conditions and think just like
a human would, or even better, develop emotional, sensible relationships.
34.
Hide Answer
Overfitting occurs when a model learns the training data too well, including capturing noise
and random fluctuations. This often results in a model that performs poorly on unseen or
validation data. Techniques to prevent overfitting include:
Cross-validation
35.
Hide Answer
36.
What is the difference between full listing hypothesis and minimum redundancy hypothesis?
Hide Answer
Full listing hypothesis states that all possible values of a variable should be listed in the data
dictionary. Minimum redundancy hypothesis states that all values of a variable should be
listed in the data dictionary, but that only the most important values should be listed multiple
times.
1.
Hide Answer
Step 1: Give weights (x,y) random values and then compute the error, also called Sum of
Squares Error (SSE).
Step 2: Compute the gradient or the change in SSE when you change the value of the
weights (x,y) by a small amount. This step helps us identify the direction in which we must
move x and y to minimize SSE.
Step 3: Adjust the weights with the gradients for achieving optimal values for the minimal
SSE.
Step 4: Change the weights for predicting and calculating the new error. Step 5: Repeat
steps 2 and 3 till the time making more adjustments stops producing significant error
reduction.
These types of artificial intelligence interview questions help hiring managers properly guage
a candidate’s expertise in this domain. Hence, you must thoroughly understand such
questions and enlist all steps properly to move ahead.
2.
Write a function to create one-hot encoding for categorical variables in a Pandas DataFrame
Hide Answer
3.
Hide Answer
4.
Hide Answer
There are a number of ways to handle an imbalanced dataset, such as using different
algorithms, weighting the classes, or oversampling the minority class.
Algorithm selection: Some algorithms are better suited to handle imbalanced data than
others. For example, decision trees and random forests tend to work well on imbalanced
data, while algorithms like logistic regression or support vector machines may struggle.
Class weighting: By assigning higher weights to the minority class, you can make the
algorithm give more importance to it during training. This can help prevent the algorithm from
always predicting the majority class.
Oversampling: You can create synthetic samples of the minority class by randomly
duplicating existing samples or generating new samples based on the existing ones. This
can balance the class distribution and help the algorithm learn more about the minority class.
5.
Hide Answer
The vanishing gradient problem is a difficulty encountered when training artificial neural
networks using gradient-based learning methods. This problem is resolved by replacing the
activation function of the network. You can use the Long Short-Term Memory (LSTM)
network to solve the problem.
It has three gates called input, forgets, and output gates. Here forget gates constantly
observe what information needs to be dropped going through the network. In this way, we
have short and long-term memory. So, we can transfer the information through the network
and retrieve it even at the last stage to identify the context of prediction.
6.
Hide Answer
7.
Write a Python function to sort a list of numbers using the merge sort algorithm
Hide Answer
8.
Hide Answer
Sigmoid and softmax functions are used in classification problems. Sigmoid maps values to
a range of 0-1, which is useful for binary classification problems. Softmax maps values to a
range of 0-1 and also ensures that all values sum to 1, which is useful for multi-class
classification problems.
9.
Implement a Python function to calculate the sigmoid activation function value for any given
input.
Hide Answer
10.
Write a Python function to calculate R-squared (coefficient of determination) given true and
predicted values.
Hide Answer
11.
Hide Answer
Pragmatic analysis is a process of analyzing text data in order to determine the speaker's
intention. This is useful in many applications, such as customer service and market
research. Here, the main focus is always on what was said to reconsider what is intentionally
driving the various aspects of language that require real-world knowledge. It helps you to
discover this intentional effect by applying a set of rules that characterize cooperative
dialogues. Basically, it means abstracting the meaningful use of language in situations.
12.
Hide Answer
Hide Answer
Parsing is the process of breaking down a string of text into smaller pieces, or tokens. This
can be done using a regex, or a more sophisticated tool like a parser combinator. There are
various techniques for parsing in NLP, including rule-based approaches, statistical
approaches, and machine learning-based approaches. Some common parsing algorithms
include the Earley parser, the CYK parser, and the chart parser. These algorithms use
various methods such as probability models, tree-based representations, and context-free
grammars to parse a text and identify its grammatical structure.
14.
Implement a Python function to calculate the precision and recall of a binary classifier, given
true positive, false positive, true negative, and false negative values.
Hide Answer
15.
Hide Answer
A human brain learns from its experiences or from the past experiences it has in its memory.
Just like the human brain, Limited Memory Artificial Intelligence learns from past data
already in the memory and makes decisions on their behalf. But this data is stored for some
specific time, and they cannot add it to their information center. Self-Driving is one of the
best technology examples of Limited Memory AI. Self Driving cars can store data during
driving, like how many vehicles are moving around them, vehicle speed, and the traffic lights.
From their experiences, they understand how to drive properly on the road in heavy and
moderate traffic. Few companies are focused on these types of technologies.
16.
Write a Python function to compute the Euclidean distance between two points.
Hide Answer
17.
Describe the differences between stochastic gradient descent (SGD) and mini-batch
gradient descent.
Hide Answer
Stochastic gradient descent (SGD) updates the model's weights using the gradient
calculated from a single training example. It converges faster because of frequent weight
updates; however, it can have a noisy convergence due to high variance in gradients.
Mini-batch gradient descent calculates the gradient using a small batch of training examples.
It strikes a balance between the computational efficiency of batch gradient descent and the
faster convergence of SGD. The noise in weight updates is reduced, leading to a more
stable convergence.
18.
Implement a function to calculate precision, recall, and F1-score given an input of actual and
predicted labels.
Hide Answer
19.
Hide Answer
Understanding data: You need to understand the distribution of your data to decide which
standardization technique is appropriate. For example, if the data is normally distributed, you
can use z-score normalization.
20.
Hide Answer
Here's a basic implementation of Naïve Bayes Classifier in Python using the scikit-learn
library. This example demonstrates the process of loading a dataset, splitting it into training
and testing sets, fitting the model, and calculating its accuracy.
21.
Hide Answer
22.
Hide Answer
Entropy is unpredictability in the data; the more uncertainty, the higher the entropy will be.
Entropy is used by information gain to make decisions. If the entropy is fewer, the
information will be big.
Information gain is used in random forests and decision trees to decide the best split. Thus,
the bigger the information gain, the better the split and the shorter the entropy.
The entropy is used to calculate the information gain of a dataset before and after a split.
Entropy is the calculation of the probability of suspense in the data. The main purpose is to
reduce entropy and increase information gain. The feature having the maximum information
is considered essential by the algorithm and is used for training the model.
23.
Hide Answer
Here's a basic implementation of the Random Forest Regressor in Python using the scikit-
learn library. This example demonstrates the process of loading a dataset, splitting it into
training and testing sets, fitting the model, and calculating the predictions.
24.
Hide Answer
Kernel tricks are a technique used in Artificial Intelligence, particularly in machine learning
algorithms, to transform a non-linearly separable problem into a linearly separable one. They
are commonly used in Support Vector Machines (SVMs) and other kernel-based algorithms
for solving complex classification or regression tasks.
The main idea behind kernel tricks is to map the input data from a lower-dimensional space
to a higher-dimensional space, in which the data points become linearly separable. This
mapping is done using a mathematical function called the kernel function.
25.
Write a code for K-nearest algorithm in Python.
Hide Answer
Here's a basic implementation of the K-Nearest Neighbors (KNN) algorithm in Python using
the scikit-learn library. This example demonstrates the process of loading a dataset, splitting
it into training and testing sets, fitting the model, and calculating its accuracy.
26.
Hide Answer
Organize the data into a table with the category head mentioned below
All the rows must organize from the poorest to the richest.
Fill the '% of Population that is richer' column by adding all terms in 'Fraction of Population'
below that row.
Calculate the Score for each of the rows. The formula for the Score is:
Score = Fraction of Income * (Fraction of Population + 2 * % of Population that is richer).
Next, add all the terms in the ‘Score’ column. Let us call it ‘Sum.’
Using the formula calculate the Gini coefficient: = 1 –Sum.
Deep learning is the branch of machine learning which is based on artificial neural
network architecture which makes it capable of learning complex patterns and relationships
within data. An artificial neural network or ANN uses layers of interconnected nodes called
neurons that work togeather to process and learn from the input data.
In a fully connected Deep neural network, there is an input layer and one or more hidden
layers connected one after the other. Each neuron receives input from the previous layer
neurons or the input layer. The output of one neuron becomes the input to other neurons in
the next layer of the network, and this process continues until the final layer produces the
output of the network. The layers of the neural network transform the input data through a
series of nonlinear transformations, allowing the network to learn complex representations of
the input data.
Today Deep learning has become one of the most popular and visible areas of machine
learning, due to its success in a variety of applications, such as computer vision, natural
language processing, and Reinforcement learning.
An artificial neural network is inspired by the networks and functionalities of human biological
neurons. it is also known as neural networks or neural nets. ANN uses layers of
interconnected nodes called artificial neurons that work together to process and learn the
input data. The starting layer artificial neural network is known as the input layer, it takes
input from external input sources and transfers it to the next layer known as the hidden layer
where each neuron received inputs from previous layer neurons and computes the weighted
sum, and transfers to the next layer neurons. These connections are weighted means effects
of the inputs from the previous layer are optimized more or less by assigning different-
different weights to each input and it is adjusted during the training process by optimizing
these weights for better performance of the model. The output of one neuron becomes the
input to other neurons in the next layer of the network, and this process continues until the
final layer produces the output of the network.
Machine learning and deep learning both are subsets of artificial intelligence but there are
many similarities and differences between them.
Machine Learning Deep Learning
Apply statistical algorithms to learn the Uses artificial neural network architecture
hidden patterns and relationships in the to learn the hidden patterns and
dataset. relationships in the dataset.
Takes less time to train the model. Takes more time ta o train the model.
Less complex and easy to interpret the More complex, it works like the black box
result. interpretations of the result are not easy.
Deep learning has many applications, and it can be broadly divided into computer vision,
natural language processing (NLP), and reinforcement learning.
Computer vision: Deep learning employs neural networks with several layers,
which enables it used for automated learning and recognition of complex patterns in
images. and machines can perform image classification, image segmentation, object
detection, and image generation task accurately. It has greatly increased the
precision and effectiveness of computer vision algorithms, enabling a variety of uses
in industries including healthcare, transportation, and entertainment.
Deep learning has made significant advancements in various fields, but there are still some
challenges that need to be addressed. Here are some of the main challenges in deep
learning:
1. Data availability: It requires large amounts of data to learn from. For using deep
learning it’s a big concern to gather as much data for training.
4. Interpretability: Deep learning models are complex, it works like a black box. it is very
difficult to interpret the result.
5. Overfitting: when the model is trained again and again, it becomes too specialized for
the training data, leading to overfitting and poor performance on new data.
The concept of artificial neural networks comes from biological neurons found in animal
brains So they share a lot of similarities in structure and function wise.
Learning: In biological neurons, learning occurs in the cell body or soma which has a
nucleus that helps to process the signals. If the signals are strong enough to reach
the threshold, an action potential is generated that travels through the axons. This is
achieved by synaptic plasticity, which is the ability of synapses to strengthen or
weaken over time, in response to increases or decreases in their activity. In artificial
neural networks, the learning process is called backpropagations, which adjusts the
weight between the nodes based on the difference or cost between the predicted and
actual outputs.
Activation: In biological neurons, activation is the firing rate of the neuron which
happens when the signals are strong enough to reach the threshold. and in artificial
neural networks, activations are done by mathematical functions known as
activations functions which map the input to the output.
Biological neurons to Artificial neurons
Deep learning can be used for supervised, unsupervised as well as reinforcement machine
learning. it uses a variety of ways to process these.
8. What is a Perceptron?
Perceptron is one of the simplest Artificial neural network architectures. It was introduced by
Frank Rosenblatt in 1957s. It is the simplest type of feedforward neural network, consisting
of a single layer of input nodes that are fully connected to a layer of output nodes. It can
learn the linearly separable patterns. it uses slightly different types of artificial neurons
known as threshold logic units (TLU). it was first introduced by McCulloch and Walter Pitts in
the 1940s. it computes the weighted sum of its inputs and then applies the step function to
compare this weighted sum to the threshold. the most common step function used in
perceptron is the Heaviside step function.
A perceptron has a single layer of threshold logic units with each TLU connected to all
inputs. When all the neurons in a layer are connected to every neuron of the previous layer,
it is known as a fully connected layer or dense layer. During training, The weights of the
perceptron are adjusted to minimize the difference between the actual and predicted value
using the perceptron learning rule i.e
Here, x_i and w_i are the ith input feature and the weight of the ith input feature.
The differences between the single-layer perceptron and multilayer perceptron are as
follows:
Architecture: A single-layer perceptron has only one layer of neurons, which takes
the input and produces an output. While a multilayer perceptron has one or more
hidden layers of neurons between the input and output layers.
Complexity: A single-layer perceptron is a simple linear classifier that can only learn
linearly separable patterns. While a multilayer perceptron can learn more complex
and nonlinear patterns by using nonlinear activation functions in the hidden layers.
A feedforward neural network (FNN) is a type of artificial neural network, in which the
neurons are arranged in layers, and the information flows only in one direction, from the
input layer to the output layer, without any feedback connections. The term “feedforward”
means information flows forward through the neural network in a single direction from the
input layer through one or more hidden layers to the output layer without any loops or cycles.
In a feedforward neural network (FNN) the weight is updated after the forward pass. During
the forward pass, the input is fed and it computes the prediction after the series of nonlinear
transformations to the input. then it is compared with the actual output and errors are
calculated.
During the backward pass also known as backpropagation, Based on the differences, the
error is first propagated back to the output layer, where the gradient of the loss function with
respect to the output is computed. This gradient is then propagated backward through the
network to compute the gradient of the loss function with respect to the weights and biases
of each layer. Here chain rules of calculus are applied with respect to weight and bias to find
the gradient. These gradients are then used to update the weights and biases of the network
so that it can improve its performance on the given task.
Originally developed for use in video games and other graphical applications, GPUs have
grown in significance in a number of disciplines, such as artificial intelligence, machine
learning, and scientific research, where they are used to speed up computationally
demanding tasks like training deep neural networks.
One of the main benefits of GPUs is their capacity for parallel computation, which uses a
significant number of processing cores to speed up complicated calculations. Since high-
dimensional data manipulations and matrix operations are frequently used in machine
learning and other data-driven applications, these activities are particularly well suited for
them.
12. What are the different layers in ANN? What is the notation for representing a node
of a particular layer?
There are commonly three different types of layers in an artificial neural network (ANN):
Input Layer: This is the layer that receives the input data and passes it on to the
next layer. The input layer is typically not counted as one of the hidden layers of the
network.
Hidden Layers: The input layer is the one that receives input data and transfers it to
the next layer. Usually, the input layer is not included in the list of the hidden layers of
the neural network.
Output Layer: This is the output-producing layer of the network. A binary
classification problem might only have one output neuron, but a multi-class
classification problem might have numerous output neurons, one for each class. The
number of neurons in the output layer depends on the type of problem being solved.
instance, the input layer’s first node may be written as whereas the third hidden
layer’s second node might be written as With this notation, it is simple to refer to
specific network nodes to understand the structure of the network as a whole.
In deep learning and neural networks, In the forward pass or propagation, The input data
propagates through the input layer to the hidden layer to the output layer. During this
process, each layer of the neural network performs a series of mathematical operations on
the input data and transfers it to the next layer until the output is generated.
Once the forward propagation is complete, the backward propagation, also known
as backpropagation or back prop, is started. During the backward pass, the generated
output is compared to the actual output and based on the differences between them the
error is measured and it is propagated backward through the neural network layer. Where
the gradient of the loss function with respect to the output is computed. This gradient is then
propagated backward through the network to compute the gradient of the loss function with
respect to the weights and biases of each layer. Here chain rules of calculus are applied with
respect to weight and bias to find the gradient. These gradients are then used to update the
weights and biases of the network so that it can improve its performance on the given task.
In simple terms, the forward pass involves feeding input data into the neural network to
produce an output, while the backward pass refers to utilizing the output to compute the
error and modify the network’s weights and biases.
The cost function is the mathematical function that is used to measure the quality of
prediction during training in deep neural networks. It measures the differences between the
generated output of the forward pass of the neural network to the actual outputs, which are
known as losses or errors. During the training process, the weights of the network are
adjusted to minimize the losses. which is achieved by computing the gradient of the cost
function with respect to weights and biases using backpropagation algorithms.
The cost function is also known as the loss function or objective function. In deep learning,
different -different types of cost functions are used depending on the type of problem and
neural network used. Some of the common cost functions are as follows:
Binary Cross-Entropy for binary classification measures the difference between the
predicted probability of the positive outcome and the actual outcome.
Mean Squared Error for regression to measure the average squared difference
between actual and predicted outputs.
15. What are activation functions in deep learning and where it is used?
Deep learning uses activation functions, which are mathematical operations that are
performed on each neuron’s output in a neural network to provide nonlinearity to the
network. The goal of activation functions is to inject non-linearity into the network so that it
can learn the more complex relationships between the input and output variables.
In other words, the activation function in neural networks takes the output of the preceding
linear operation (which is usually the weighted sum of input values i.e w*x+b) and mapped it
to a desired range because the repeated application of weighted sum (i.e w*x +b) will result
in a polynomial function. The activation function transformed the linear output into non-linear
output which makes the neural network capable to approximate more complex tasks.
In deep learning, To compute the gradients of the loss function with respect to the network
weights during backpropagation, activation functions must be differentiable. As a result, the
network may use gradient descent or other optimization techniques to find the optimal
weights to minimize the loss function.
Although several activation functions, such as ReLU, and Hardtanh, contain point
discontinuities, they are still differentiable almost everywhere. The gradient is not defined at
the point of discontinuity, This does not have a substantial impact on the network’s overall
gradient because the gradient at these points is normally set to zero or a small value.
16. What are the different different types of activation functions used in deep
learning?
In deep learning, several different-different types of activation functions are used. Each of
them has its own strength and weakness. Some of the most common activation functions
are as follows.
Sigmoid function: It maps any value between 0 and 1. It is mainly used in binary
classification problems. where it maps the output of the preceding hidden layer into
the probability value.
Softmax function: It is the extension of the sigmoid function used for multi-class
classification problems in the output layer of the neural network, where it maps the
output of the previous layer into a probability distribution across the classes, giving
each class a probability value between 0 and 1 with the sum of the probabilities over
all classes is equal to 1. The class which has the highest probability value is
considered as the predicted class.
ReLU (Rectified Linear Unit) function: It is a non-linear function that returns the
input value for positive inputs and 0 for negative inputs. Deep neural networks
frequently employ this function since it is both straightforward and effective.
Leaky ReLU function: It is similar to the ReLU function, but it adds a small slope for
negative input values to prevent dead neurons.
In neural networks, there is a method known as backpropagation is used while training the
neural network for adjusting weights and biases of the neural network. It computes the
gradient of the cost functions with respect to the parameters of the neural network and then
updates the network parameters in the opposite direction of the gradient using optimization
algorithms with the aim of minimizing the losses.
During the training, in forward pass the input data passes through the network and
generates output. then the cost function compares this generated output to the actual output.
then the backpropagation computes the gradient of the cost function with respect to the
output of the neural network. This gradient is then propagated backward through the network
to compute the gradient of the loss function with respect to the weights and biases of each
layer. Here chain rules of differentiations are applied with respect to the parameters of each
layer to find the gradient.
Once the gradient is computed, The optimization algorithms are used to update the
parameters of the network. Some of the most common optimization algorithms are
stochastic gradient descent (SGD), mini-batch, etc.
The goal of the training process is to minimize the cost function by adjusting the weights and
biases during the backpropagation.
18. How the number of hidden layers and number of neurons per hidden layer are
selected?
There is no one-size-fits-all solution to this problem, hence choosing the number of hidden
layers and neurons per hidden layer in a neural network is often dependent on practical
observations and experimentation. There are, however, a few general principles and
heuristics that may be applied as a base.
The number of hidden layers can be determined by the complexity of the problem
being solved. Simple problems can be solved with just one hidden layer whereas
more complicated problems may require two or more hidden levels. However adding
more layers also increases the risk of overfitting, so the number of layers should be
chosen based on the trade-off between model complexity and generalization
performance.
The number of neurons per hidden layer can be determined based on the number of
input features and the desired level of model complexity. There is no hard and fast
rule, and the number of neurons can be adjusted based on the results of
experimentation and validation.
In practice, it is often useful to start with a simple model and gradually increase its
complexity until the desired performance is achieved. This process can involve adding more
hidden layers or neurons or experimenting with different architectures and hyperparameters.
It is also important to regularly monitor the training and validation performance to detect
overfitting and adjust the model accordingly.
Overfitting is a problem in machine learning that occurs when the model learns to fit the
training data too close to the point that it starts catching up on noise and unimportant
patterns. Because of this, the model performs well on training data but badly on fresh,
untested data, resulting in poor generalization performance.
1. Simplify the model: Overfitting may be less likely in a simpler model with fewer
layers and parameters. In practical applications, it is frequently beneficial, to begin
with a simple model and progressively increase its complexity until the desired
performance is attained.
A complete cycle of deep learning model training utilizing the entire training dataset is
called an epoch. Each training sample in the dataset is processed by the model during a
single epoch, and its weights and biases are adjusted in response to the estimated loss or
error. The number of epochs will range from 1 to infinite. User input determines it. It is
always an Integral value.
Iteration refers to the procedure of running a batch of data through the model, figuring out
the loss, and changing the model’s parameters. Depending on the number of batches in the
dataset, one or more iterations can be possible within a single epoch.
A batch in deep learning is a subset of the training data that is used to modify the weights of
a model during training. In batch training, the entire training set is divided into smaller
groups, and the model is updated after analyzing each batch. An epoch can be made up of
one or more batches.
The batch size will be more than one and always less than the number of samples.
Batch size is a hyperparameter, it is set by the user. where the number of iterations
per epoch is calculated by dividing the total number of training samples by the
individual batch size.
Deep learning training datasets are often separated into smaller batches, and the model
analyses each batch sequentially, one at a time, throughout each epoch. On the validation
dataset, the model performance can be assessed after each epoch. This helps in monitoring
the model’s progress.
For example: Let’s use 5000 training samples in the training dataset. Furthermore, we want
to divide the dataset into 100 batches. If we choose to use five epochs, the total number of
iterations will be as follows:
The learning rate in deep learning is a hyperparameter that controls how frequently the
optimizer adjusts the neural network’s weights when it is being trained. It determines the
step size to which the optimizer frequently updates the model parameters with respect to the
loss function. so, that losses can be minimized during training.
With the high learning rate, the model may converge fast, but it may also overshoot or
bounce around the ideal solution. On the other hand, a low learning rate might make the
model converge slowly, but it could also produce a solution that is more accurate.
Choosing the appropriate learning rate is crucial for the successful training of deep neural
networks.
Cross-entropy is the commonly used loss function in deep learning for classification
problems. The cross-entropy loss measures the difference between the real probability
distribution and the predicted probability distribution over the classes.
The formula for the Cross-Entropy loss function for the K classes will be:
Here, Y and are actual and predicted values for a single instance. k represents a
particular class and is a subset of K.
Gradient descent is the core of the learning process in machine learning and deep learning.
It is the method used to minimize the cost or loss function by iteratively adjusting the model
parameters i.e. weight and biases of the neural layer. The objective is to reduce this
disparity, which is represented by the cost function as the difference between the model’s
anticipated output and the actual output.
The gradient is the vector of its partial derivatives with respect to its inputs, which indicates
the direction of the steepest ascent (positive gradient) or steepest descent (negative
gradient) of the function.
In deep learning, The gradient is the partial derivative of the objective or cost function with
respect to its model parameters i.e. weights or biases, and this gradient is used to update
the model’s parameters in the direction of the negative gradient so that it can reduce the cost
function and increase the performance of the model. The magnitude of the update is
determined by the learning rate, which controls the step size of the update.
Regularization
Data augmentation
Transfer learning
Hyperparameter tuning
There are several variants of gradient descent that differ in the way the step size or learning
rate is chosen and the way the updates are made. Here are some popular variants:
Stochastic Gradient Descent (SGD): In SGD, only one training example is used to
compute the gradient and update the parameters at each iteration. This can be faster
than batch gradient descent but may lead to more noise in the updates.
There are different-different types of neural networks used in deep learning. Some of the
most important neural network architectures are as follows;
7. Attention Mechanism
9. Transformers
27. What is the difference between Shallow Networks and Deep Networks?
Deep networks and shallow networks are two types of artificial neural networks that can
learn from data and perform tasks such as classification, regression, clustering, and
generation.
Shallow networks: A shallow network has a single hidden layer between the input
and output layers, whereas a deep network has several hidden layers. Because they
have fewer parameters, they are easier to train and less computationally expensive
than deep networks. Shallow networks are appropriate for basic or low-complexity
tasks where the input-output relationships are relatively straightforward and do not
require extensive feature representation.
Deep Networks: Deep networks, also known as deep neural networks, can be
identified by the presence of many hidden layers between the input and output
layers. The presence of multiple layers enables deep networks to learn hierarchical
data representations, capturing detailed patterns and characteristics at different
levels of abstraction. It has a higher capacity for feature extraction and can learn
more complex and nuanced relationships in the data. It has given state-of-the-art
results in many machine learning and AI tasks.
A deep learning framework is a collection of software libraries and tools that provide
programmers a better deep learning model development and training possibilities. It offers a
high-level interface for creating and training deep neural networks in addition to lower-level
abstractions for implementing special functions and topologies. TensorFlow, PyTorch, Keras,
Caffe, and MXNet are a few of the well-known frameworks for deep learning.
Deep neural networks experience the vanishing or exploding gradient descent problem when
the gradients of the cost function with respect to the parameters of the model either become
too small (vanishing) or too big (exploding) during training.
In the case of vanishing gradient descent, The adjustments to the weights and biases made
during the backpropagation phase are no longer meaningful because of very small values.
As a result, the model could perform poorly because it fails to pick up on key aspects of the
data.
In the case of exploding gradient descent, The model surpasses its optimal levels and fails
to converge to a reasonable solution because the updates to the weights and biases get too
big.
Some of the techniques like Weight initialization, normalization methods, and careful
selection of activation functions can be used to deal with these problems.
Gradient clipping is a technique used to prevent the exploding gradient problem during the
training of deep neural networks. It involves rescaling the gradient when its norm exceeds a
certain threshold. The idea is to clip the gradient, i.e., set a maximum value for the norm of
the gradient, so that it does not become too large during the training process. This technique
ensures that the gradients don’t become too large and prevent the model from diverging.
Gradient clipping is commonly used in recurrent neural networks (RNNs) to prevent the
exploding gradient problem.
The amount of the previous gradient that should be integrated into the current update is
determined by the momentum term, a hyperparameter. While a low momentum number
makes the model more sensitive to changes in gradient direction, a high momentum value
indicates that the model will continue to move in the same direction for longer periods of
time.
Zero Initialization: As the name suggests, the initial value of each weight is set to
zero during initialization. As a result, all of their derivatives with respect to the loss
function are identical, resulting in the same value for each weight in subsequent
iterations. The hidden units are also symmetric as a consequence, which may cause
training to converge slowly or perhaps prohibit learning altogether.
Xavier Initialization: It sets the initial weights to be drawn from a normal distribution
with a mean of zero and a variance of 1/fanavg, where fanavg = (fanin+fanout)/2 is
the number of input neurons. This method is commonly used for activation functions
like the sigmoid function, softmax function, or tanh function. it is also known as Glorot
Initialization.
This can be done by replacing the output layer of the pre-trained model with a new layer that
is suitable for our problem or freezing some of the layers of the pre-trained model and only
training the remaining layers on the new task or dataset. The goal is to modify the pre-
trained network’s weights by further training in order to adapt it to the new dataset and task.
This procedure enables the network to learn the important characteristics of the new task.
The basic objective of fine-tuning is to adapt the pre-trained network to the new job and
dataset. This may involve changing the network design or modifying hyperparameters like
the learning rate.
Batch Normalization is the technique used in deep learning. To prevent the model from
vanishing/exploding gradient descent problems It normalizes and scales the inputs before or
after the activation functions of each hidden layer. So, the distributions of inputs have zero
means and 1 as standard deviation. It computes the mean and standard deviation of each
mini-batch input and applies it to normalization so that it is known as batch normalization.
Because the weights of the layer must be changed to adjust for the new distribution, it can
be more difficult for the network to learn when the distribution of inputs to a layer changes.
This can result in a slower convergence and less precision. By normalizing the inputs to
each layer, batch normalization reduces internal covariate shifts. This helps the network to
learn more effectively and converge faster by ensuring that the distribution of inputs to each
layer stays consistent throughout training.
Dropout is one of the most popular regularization techniques used in deep learning to
prevent overfitting. The basic idea behind this is to randomly drop out or set to zero some of
the neurons of the previously hidden layer so that its contribution is temporarily removed
during the training for both forward and backward passes.
In each iteration, neurons for the dropout are selected randomly and their values are set to
zero so that it doesn’t affect the downstream neurons of upcoming next-layer neurons during
the forward pass, And during the backpropagation, there is no weight update for these
randomly selected neurons in current iterations. In this way, a subset of randomly selected
neurons is completely ignored during that particular iteration.
This makes the network learn more robust features only and prevents overfitting when the
networks are too complex and capture noises during training.
During testing, all the neurons are used and their outputs are scaled or multiplied by the
dropout probability to ensure that the overall behaviour of the network is consistent during
training.
Convolutional Neural Networks (CNNs) are the type of neural network commonly used for
Computer Vision tasks like image processing, image classification, object detection, and
segmentation tasks. It applies filters to the input image to detect patterns, edges, and
textures and then uses these features to classify the image.
It is the type of feedforward neural network (FNN) used to extract features from grid-like
datasets by applying different types of filters also known as the kernel. For example visual
datasets like images or videos where data patterns play an extensive role. It uses the
process known as convolution to extract the features from images.
It is composed of multiple layers including the convolution layer, the pooling layer, and the
fully connected layer. In the convolutional layers, useful features are extracted from the input
data by applying a kernel, The kernel value is adjusted during the training process, and it
helps to identify patterns and structures within the input data.
The pooling layers then reduce the spatial dimensionality of the feature maps, making them
more manageable for the subsequent layers. Finally, the fully connected layers use the
extracted features to make a prediction or classification.
In CNNs, It is used to extract the feature from the input dataset. It processes the input
images using a set of learnable filters known as kernels. The kernels size are usually smaller
like 2×2, 3×3, or 5×5. It computes the dot product between kernel weight and the
corresponding input image patch, which comes when sliding over the input image data. The
output of this layer is referred ad feature maps.
Convolution is an effective method because it enables CNN to extract local features while
keeping the spatial relationships between the features in the input data. This is especially
helpful in the processing of images where the location of features within an image is often
just as important as the features themselves.
In order to extract the most relevant features from the input data, during the training process,
the value in the kernel is optimized. When the kernel is applied to the input data, it moves
over the data in the form of a sliding window, performing element-wise multiplication at each
position and adding the results to create a single output value.
Stride is the number of pixels or units that a kernel is moved across the input data while
performing convolution operations in Convolutional Neural Networks (CNNs). It is one of the
hyperparameters of a CNN that can be manipulated to control the output feature map’s size.
During the forward pass, we slide each filter through the entire input image matrix step by
step, where each step is known as stride (which can have a value of 2, 3, or even 4 for high-
dimensional images), and we compute the dot product between the kernel weights and
patch from input volume.
The pooling layer is a type of layer that usually comes after one or more convolutional layers
in convolutional neural networks (CNNs). The primary objective of the pooling layer is
to reduce the spatial dimensionality of the feature maps while maintaining the most crucial
characteristics produced with the convolution operations. Its main function is to reduce the
size of the spatial dimensionality which makes the computation fast reduces memory and
also prevents overfitting. It also helps to make the features more invariant to small
translations in the input data, which can improve the model’s robustness to changes in the
input data.
Two common types of pooling layers are max pooling and average pooling. In max pooling,
the maximum value within each subregion is selected and propagated to the output feature
map. In average pooling, the average value within each subregion is calculated and used as
the output value.
There are two main types of padding: same padding and valid padding.
Same Padding: The term “same padding” describes the process of adding padding
to an image or feature map such that the output has the same spatial dimensions as
the input. The same padding adds additional rows and columns of pixels around the
edges of the input data so that the size of the output feature map will be the same as
the size of the input data. This is achieved by adding rows and columns of pixels with
a value of zero around the edges of the input data before the convolution operation.
Valid Padding: Convolutional neural networks (CNNs) employ the valid padding
approach to analyze the input data without adding any extra rows or columns of
pixels around the input data’s edges. This means that the size of the output feature
map is smaller than the size of the input data. Valid padding is used when it is
desired to reduce the size of the output feature map in order to reduce the number of
parameters in the model and improve its computational efficiency.
42. Write the formula for finding the output shape of the Convolutional Neural
Networks model.
The formula for calculating the output shape for the same padding
Where,
= Strides
stride then Let’s calculate the output shape with the formula :
Data augmentation is a technique used in deep learning during the preprocessing for making
little variation in the training dataset, So, that model can improve its generalization ability
with a greater variety of data changes. It is also used to increase the training dataset
samples by creating a modified version of the original dataset.
In CNNs, data augmentation is often carried out by randomly applying a series of image
transformations to the initial training images. that are as follows:
Rotation
Scaling
Flipping
Cropping
Sharing
Translation
Adding noise
Deconvolution is a deep learning method for upscale feature maps in a convolutional neural
network (CNN). During the convolution, Kernel slides over the input to extract the
important features and shrink the output, while in deconvolution, the kernel slides over the
output to generate a larger, more detailed output. Briefly, we can say that deconvolution is
the opposite of convolution operations.
45. What is the difference between object detection and image segmentation?
Object detection and image segmentation are both computer vision tasks used to analyze
and understand images, but they differ in their goals and output.
46. What are Recurrent Neural Networks (RNNs) and How it works?
Recurrent Neural Networks are the type of artificial neural network that is specifically
designed to work with sequential data or time series data. It is specifically used in natural
language processing tasks like language translation, speech recognition, sentiment analysis,
natural language generation, summary writing, etc. It is different from the feedforward neural
networks means in RNN the input data not only flow in a single direction but it also has a
loop or cycle within its architecture which has the “memory” that preserve the information
over time. This makes the RNN capable of data where context is important like the natural
languages.
The basic concept of RNNs is that they analyze input sequences one element at a time while
maintaining track in a hidden state that contains a summary of the sequence’s previous
elements. The hidden state is updated at each time step based on the current input and the
previous hidden state. This allows RNNs to capture the temporal dependencies between
elements of the sequence and use that information to make predictions.
Working: The fundamental component of an RNN is the recurrent neuron, which receives as
inputs the current input vector and the previous hidden state and generates a new hidden
state as output. And this output hidden state is then used as the input for the next recurrent
neuron in the sequence. An RNN can be expressed mathematically as a sequence of
equations that update the hidden state at each time step:
ht= f(Uht-1+Wxt+b)
Where,
f = Activation functions
And the output of the RNN at each time step will be:
yt = g(Vht+c)
Where,
y = Output at time t
g = activation function
Here, W, U, V, b, and c are the learnable parameters and it is optimized during the
backpropagation.
Backpropagation through time (BPTT) is a technique for updating the weights of a recurrent
neural network (RNN) over time by applying the backpropagation algorithm to the unfolded
network. It enables the network to learn from the data’s temporal dependencies and adapt its
behaviour accordingly. Forward Pass: The input sequence is fed into the RNN one element
at a time, starting from the first element. Each input element is processed through the
recurrent connections, and the hidden state of the RNN is updated.
1. Given a sequence of inputs and outputs, the RNN is unrolled into a feed-forward
network with one layer per time step.
2. The network of the RNN is initialized with some initial hidden state that contains
information about the previous inputs and hidden states in the sequence. It computes
the outputs and the hidden states for each time step by applying the recurrent
function.
3. The network computes the difference between the predicted and expected outputs
for each time step and adds it up across the entire series.
4. The gradients of the error with respect to the weights are calculated by the network
by applying the chain rule from the last time step to the first time step, propagating
the error backwards through time. The loss is then backpropagated through time,
starting from the last time step and moving backwards in time. So, this is known as
Backpropagation through time (BPTT).
5. The network’s weights are updated using an optimization algorithm, such as gradient
descent or its variants, which takes gradients and a learning rate into account.
During the backpropagation process, the gradients at each time step are obtained and used
to update the weights of the recurrent networks. The accumulation of gradients over multiple
time steps enables the RNN to learn and capture dependencies and patterns in sequential
data.
LSTM stands for Long Short-Term Memory. It is the modified version of RNN (Recurrent
Neural Network) that is designed to address the vanishing and exploding gradient problems
that can occur during the training of traditional RNNs. LSTM selectively remembers and
forgets information over the multiple time step which gives it a great edge in capturing the
long-term dependencies of the input sequence.
RNN has a single hidden state that passes through time, which makes it difficult for the
network to learn long-term dependencies. To address this issue LSTM uses a memory cell,
which is a container that holds information for an extended period of time. This memory cell
is controlled by three gates i.e. input gate, forget gate, and the output gate. These gates
regulate which information should be added, removed, or output from the memory cell.
LSTMs function by selectively passing or retaining information from one-time step to the next
using the combination of memory cells and gating mechanisms. The LSTM cell is made up
of a number of parts, such as:
Cell state (C): This is where the data from the previous step is kept in the LSTM’s
memory component. It is passed through the LSTM cell via gates that control the
flow of information into and out of the cell.
Hidden state (h): This is the output of the LSTM cell, which is a transformed version
of the cell state. It can be used to make predictions or be passed on to another
LSTM cell later on in the sequence.
Forget gate (f): The forget gate removes the data that is no longer relevant in the
cell state. The gate receives two inputs, xt (input at the current time) and ht-
1 (previous hidden state), which are multiplied with weight matrices, and bias is
added. The result is passed via an activation function, which gives a binary output i.e.
True or False.
Input Gate(i): The input gate uses as input the current input and the previous hidden
state and applies a sigmoid activation function to determine which parts of the input
should be added to the cell state. The output of the input gate (again a fraction
between 0 and 1) is multiplied by the output of the tanh block that produces the new
values that are added to the cell state. This gated vector is then added to the
previous cell state to generate the current cell state
Output Gate(o): The output gate extracts the important information from the current
cell state and delivers it as output. First, The tanh function is used in the cell to
create a vector. Then, the information is regulated using the sigmoid function and
filtered by the values to be remembered using inputs ht-1 and xt. At last, the values of
the vector and the regulated values are multiplied to be sent as an output and input
to the next cell.
Ans: GRU stands for Gated Recurrent Unit. GRUs are recurrent neural networks (RNNs)
that can process sequential data such as text, audio, or time series.GRU uses gating
mechanisms to control the flow of information in and out of the network, allowing it to learn
from the temporal dependencies in the data and adjust its behaviour accordingly.
GRU is similar to LSTM in that it uses gating mechanisms, but it has a simpler architecture
with fewer gates, making it computationally more efficient and easier to train. It uses two
types of Gates: the reset gate (r) and the update gate (z)
1. Rest Gate (r): It determines which parts of the previous hidden state should be
forgotten or reset. It takes The update gate decides which parts of the current hidden
state should be updated with new information from the current input. Similar to the
reset gate, it takes the previous hidden state and the current input as inputs and
outputs a value between 0 and 1 for each element of the hidden state.
2. Update Gate (z): It decides which part of the current hidden state should be updated
with the new information from the current input. It takes the previous hidden state and
the current input as inputs and the outputs value between 0 and 1 for each element
of the hidden state.
GRU models have been demonstrated to be useful in NLP applications such as language
modelling, sentiment analysis, machine translation, and text generation. They are especially
beneficial when it is critical to record long-term dependencies and grasp the context. GRU is
a popular choice in NLP research and applications due to its simplicity and computational
efficiency.
An encoder-decoder network is a kind of neural network that can learn to map an input
sequence to a different length and structure output sequence. It is made up of two primary
parts: an encoder and a decoder.
Decoder: Decoder is another neural network that takes the encoded vector as input
and generates an output sequence (such as another sentence, an image, or a video)
that is related to the input sequence. The decoder generates an output and modifies
its internal hidden state based on the encoded vector and previously generated
outputs at each step.
The training process of an Encoder-Decoder network involves feeding pairs of input and
target sequences to the model and minimizing the difference between the predicted output
sequence and the true target sequence using a suitable loss function. Encoder-Decoder
networks are used for a variety of tasks, such as machine translation (translating text from
one language to another), text summarization, chatbots, and image captioning (turning
pictures into meaningful phrases).
Autoencoders are a type of neural network architecture used for unsupervised learning tasks
like dimensionality reduction, feature learning, etc. Autoencoders work on the principle of
learning a low-dimensional representation of high-dimensional input data by compressing it
into a latent representation and then reconstructing the input data from the compressed
representation. It consists of two main parts an encoder and a decoder. The encoder maps
an input to a lower-dimensional latent representation, while the decoder maps the latent
representation back to the original input space. In most cases, neural networks are used to
create the encoder and decoder, and they are trained in parallel to reduce the difference
between the original input data and the reconstructed data.
Generative Adversarial Networks (GANs) are a type of neural network architecture used for
unsupervised learning tasks like image synthesis and generative modeling. It is composed of
two neural networks: Generator and Discriminator. The generator takes the random
distributions mainly the Gaussian distribution as inputs and generates the synthetic data,
while the discriminator takes both real and synthetic data as input and predicts whether the
input is real or synthetic. The goal of the generator is to generate synthetic data that is
identical to the input data. and the discriminator guesses whether the input data is real or
synthetic.
An attention mechanism is a type of neural network that employs a separate attention layer
within an Encoder-Decoder neural network to allow the model to focus on certain areas of
the input while executing a task. It accomplishes this by dynamically assigning weights to
various input components, reflecting their relative value or relevance. This selective attention
enables the model to concentrate on key information, capture dependencies, and
understand data linkages.
The attention mechanism is especially useful for tasks that need sequential or structured
data, such as natural language processing, where long-term dependencies and contextual
information are critical for optimal performance. It allows the model to selectively attend
the important features or contexts, which increases the model’s capacity to manage
complicated linkages and dependencies in the data, resulting in greater overall performance
in various tasks.
Feed-Forward Neural Networks: Following the attention layers, the model applies a
point-wise feed-forward neural network to each position separately. This enables the
model to learn complex non-linear correlations in the data.
Transfer learning is a machine learning approach that involves implementing the knowledge
and understanding gained by training a model on one task and applying that knowledge to
another related task. The basic idea behind transfer learning is that a model that has been
trained on a big, diverse dataset may learn broad characteristics that are helpful for many
different tasks and can then be modified or fine-tuned to perform a specific task with a
smaller, more specific dataset.
1. Fine-tuning: Fine-tuning is used to adapt a pre-trained model that has already been
trained on a big dataset and refine it with further training on a new smaller dataset
that is specific to the present task. With fine-tuning, weights of the pre-trained model
can be adjusted according to the new present task while training on the new dataset.
This can improve the performance of the model on the new task.
2. Feature extraction: In this case, the features of the pre-trained model are extracted,
and these extracted features can be used as the input for the new model. This can
be useful when the new task involves a different input format than the original task.
5. One-shot learning: This involves applying information gained from previous tasks to
train a model on just one or a small number of samples of a new problem.
Deep learning techniques like distributed and parallel training are used to accelerate the
training process of bigger models. Through the use of multiple computing resources,
including CPUs, GPUs, or even multiple machines, these techniques distribute the training
process in order to speed up training and improve scalability.
When storing a complete dataset or model on a single machine is not feasible, multiple
machines must be used to store the data or model. When the model is split across multiple
machines, then it is known as model parallelism. In model parallelism, different parts of the
model are assigned to different devices or machines. Each device or machine is responsible
for computing the forward and backward passes for the part of the model assigned to it.
When the data is too big that it is distributed across multiple machines, it is known as data
parallelism. Distributed training is used to simultaneously train the model on multiple
devices, each of which processes a separate portion of the data. In order to update the
model parameters, the results are combined, which speed-up convergence and improve the
performance of the model.
Parallel training, involves training multiple instances of the same model on different devices
or machines. Each instance trains on a different subset of the data and the results are
combined periodically to update the model parameters. This technique can be particularly
useful for training very large models or dealing with very large datasets.
Both parallel and distributed training need specialized hardware and software configurations,
and performance may benefit from careful optimization. However, they may significantly cut
down on the amount of time needed to train deep neural networks.