We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 68
120 Important Deep
Learning Interview
Questions+ Answers
Notes
Ef
-- Amar SharmaImportant Deep Learning Interview Questions with Answers
1, What is deep learning? How is it different from machine
learning?
Answer:
Deep learning is a subset of machine learning that uses
neural networks with multiple layers to automatically learn
representations from data.
Key differences:
Deep learning requires large datasets and computational
power.
It learns features directly from data, whereas traditional
machine learning often requires feature engineering.
Deep learning algorithms are typically based on neural
|| networks with many hidden layers.
2. What is a neural network, and how does it work?
Answer:
A neural network is a computational model inspired by the
human brain, consisting of layers of interconnected nodes
(neurons).
Input layer receives data.Hidden layers perform computations and learn features.
Output layer provides predictions.
The network learns by adjusting weights using a process
called backpropagation and an optimization algorithm like
gradient descent.
3, What is backpropagation?
Answer:
Backpropagation is an algorithm used to train neural
networks by minimizing the error.
The error from the output layer is propagated backward
through the network.
|| Gradients are computed for each weight using the chain rule.
Weights are updated using an optimizer Ce.g., SGD or Adam)
to reduce the error.
4. What are activation functions, and why are they
important?
Answer:
Activation functions introduce non-linearity to neuralnetworks, enabling them to model complex relationships.
Common functions:
ReLU (Rectified Linear Unit): Fast convergence, avoids
vanishing gradient issues,
Sigmoid: Output between 0 and |, used for binary
classification.
Softmax: Outputs probabilities for multiclass classification.
Tanh: Outputs between -! and |, centered around zero.
S. What is overfitting, and how can it be prevented?
Answer:
Overfitting occurs when a model performs well on training
data but poorly on unseen data.
Prevention techniques:
Use regularization methods like LI/L2 (Ridge, Lasso).
Apply dropout layers.
Reduce model complexity.
Use more training data or data augmentation.Perform early stopping during training.
6. What is the difference between batch size, epochs, and
iterations?
Answer:
Batch size: Number of samples processed before updating
the model's weights.
Epoch: One complete pass through the entire training
dataset.
Iteration: One batch update during training. For example, if
you have 1000 samples and a batch size of 100, there will be
10 iterations per epoch.
7, What is the vanishing gradient problem, and how can it
be mitigated?
Answer:
The vanishing gradient problem occurs when gradients
become very small in deep networks, slowing or stopping
learning.
Mitigation techniques:Use activation functions like ReLU.
Initialize weights properly (e.g., Xavier or He initialization).
Use batch normalization.
Build networks with skip connections (e.g., ResNet).
8. What is transfer learning?
Answer:
Transfer learning involves using a pre-trained model on a
new task. Instead of training from scratch, the model's pre-
trained weights are fine-tuned for the target task,
This is useful when data is limited and for tasks like image
recognition or natural language processing.
9. Explain the difference between CNNs and RNNs.
Answer:
CNNs (Convolutional Neural Networks): Designed for spatial
data like images. They use convolutional layers to capture
spatial hierarchies.
RNNs (Recurrent Neural Networks); Designed for sequential
data like time series or text. They have memory cells tocapture temporal dependencies.
10, What are gradient descent and its variants?
Answer:
Gradient descent is an optimization algorithm used to
minimize the loss function.
Common variants:
Batch Gradient Descent: Uses the entire dataset for each
update (slow for large datasets).
Stochastic Gradient Descent (SGD): Uses one sample per
update (faster but noisy).
Mini-batch Gradient Descent: Uses a subset (batch) of the
data for each update (balances speed and stability).
Adam Optimizer: Combines momentum and adaptive learning
rates for efficient training.
I), What is the role of the loss function in neural networks?
Answer:
The loss function measures the difference between thepredicted output and the actual target value. It guides the
optimization process by providing a metric for minimizing the
error.
Common loss functions:
Mean Squared Error (MSE): For regression tasks,
Binary Cross-Entropy: For binary classification.
Categorical Cross-Entropy: For multi-class classification.
12, What are weight initialization techniques, and why are
they important?
Answer:
Weight initialization techniques help ensure faster
convergence and avoid issues like vanishing/exploding
gradients.
Random Initialization: Assigns random values to weights.
Xavier Initialization: Keeps the variance of activations
constant across layers.
He Initialization: Optimized for ReLU activations,13, What is the difference between LI and L2 regularization?
Answer:
LI Regularization: Adds the absolute value of weights to the
loss function (Lasso). Encourages sparsity, making some
weights zero.
L2 Regularization: Adds the squared value of weights to the
loss function (Ridge). Penalizes large weights and prevents
overfitting.
14, What are autoencoders, and how are they used?
Answer:
Autoencoders are neural networks used for unsupervised
learning, designed to reconstruct input data. They have an
encoder (to compress data) and a decoder (to reconstruct
it).
Applications:
Dimensionality reduction,
Anomaly detection.
Denoising data.1S. What is the role of batch normalization?
Answer:
Batch normalization normalizes the input of each layer to
improve stability and convergence during training.
Benefits:
Reduces internal covariate shift.
Allows for higher learning rates.
Acts as a regularizer, reducing the need for dropout.
16, What is a recurrent neural network (RNN), and how does
it handle sequential data?
Answer:
RNNs are designed to process sequences of data by
maintaining a hidden state that captures information about
previous time steps. They handle dependencies in data like
time series, text, or speech,
Variants like LSTM (Long Short-Term Memory) and GRU
(Gated Recurrent Unit) address issues like vanishing
gradients.
17, What is the purpose of dropout in deep learning?Answer:
Dropout is a regularization technique that randomly sets a
fraction of neurons to zero during training.
Prevents overfitting by introducing noise.
Encourages the network to learn more robust features.
13. What are GANs (Generative Adversarial Networks)?
Answer:
GANs are neural networks consisting of two components:
Generator: Creates fake data resembling real data.
Discriminator: Distinguishes between real and fake data.
They are trained together, improving the generator's ability to
create realistic data.
Applications:
Image generation,
Style transfer.
Data augmentation.19, What is the difference between supervised, unsupervised,
and reinforcement learning?
Answer:
Supervised Learning: The model learns from labeled data
(eg,, classification, regression).
Unsupervised Learning: The model identifies patterns in
unlabeled data (e.g., clustering, dimensionality reduction).
Reinforcement Learning: The model learns by interacting
with the environment and receiving feedback in the form of
rewards or penalties.
20. What are attention mechanisms in deep learning?
Answer:
Attention mechanisms allow the model to focus on relevant
| parts of the input while making predictions.
Example: In machine translation, the attention mechanism
helps the model focus on specific words in the source
sentence while translating.
Applications:
Transformer models like BERT and GPT.
Image captioning.Text summarization.
21, What are the main components of a convolutional neural
network (CNN)?
Answer:
Convolutional Layers: Extract features by applying filters over
the input.
Pooling Layers: Reduce the spatial dimensions of feature
maps (e.g., max pooling).
Fully Connected Layers: Combine high-level features for
classification or regression.
Dropout/Bias Layers: Prevent overfitting and improve
generalization.
22, What is the difference between a feedforward neural
network and a recurrent neural network?
Answer:
Feedforward Neural Network CFNN): Processes input data inone direction, without loops. Ideal for tasks like image
recognition.
Recurrent Neural Network (RNN): Processes sequential data
with feedback loops to maintain memory. Used for time-
series and language modeling.
23. What are LSTMs and GRUs? How are they different?
Answer:
LSTMs (Long Short-Term Memory): Use gates Cinput, forget,
output) to maintain long-term dependencies in sequences.
GRUs (Gated Recurrent Units): A simplified version of
LSTMs, combining forget and input gates into one update
|| gate.
GRUs are computationally faster, while LSTMs handle
complex dependencies better.
24, What is the difference between parameterized and non-
parameterized layers?
Answer:
Parameterized Layers: Contain trainable parameters (e..,Dense, Convolutional layers).
Non-parameterized Layers: Do not contain trainable
parameters but modify data (e.g., Activation, Pooling layers).
25, What is the exploding gradient problem, and how is it
mitigated?
Answer:
Exploding gradients occur when large gradient values cause
instability during training.
Solutions:
Gradient clipping: Restrict gradients to a maximum value.
Use better initialization methods.
Use architectures like LSTMs/GRUs for sequential data.
26, What is the purpose of the softmax function?
Answer:
Softmax converts raw scores (logits) into probabilities that
sum to 1.
Used in the output layer for multi-class classification.Formula:
Softmax(xi) = exp(xi) / sum(expGy)) for all j.
27. What is the difference between supervised pretraining
and self-supervised learning?
Answer:
Supervised Pretraining: The model is trained on a related
labeled dataset, then fine-tuned on the target dataset.
Self-Supervised Learning: The model generates pseudo-labels
from data (e.g., predicting masked tokens in BERT) and
learns representations without explicit labels.
| 23. What is the Transformer architecture, and how does it
work?
Answer:
The Transformer is a deep learning architecture designed for
sequence-to-sequence tasks. It uses:
Self-Attention Mechanism: To focus on relevant parts of
input sequences.
Positional Encoding: To maintain order in input sequences.It replaced RNNs for tasks like machine translation (eg,
BERT, GPT models).
29, What are the main challenges in training deep neural
networks?
Answer:
Vanishing/exploding gradients.
Overfitting on training data.
High computational cost.
Difficulty in hyperparameter tuning.
Data scarcity or imbalance.
30, What is the difference between model-based and data-
based parallelism in deep learning?
Answer:
Model-based Parallelism: Splits the model across multiple
devices (e.g,, splitting layers of a large neural network).Data-based Parallelism: Splits the data into batches
processed in parallel across devices.
31, What is transfer learning, and why is it important in
deep learning?
Answer:
Transfer learning involves using a pre-trained model on a
related task and fine-tuning it for a target task.
Benefits:
Reduces training time.
Requires less data for the target task.
Leverages learned features from a larger dataset (eg.,
ImageNet).
32, What is the purpose of an activation function in a neural
network?
Answer:
Activation functions introduce non-linearity into the network,
enabling it to learn complex patterns.
Common activation functions:ReLU (Rectified Linear Unit): max(0, x).
Sigmoid: Outputs values between 0 and |.
Tanh: Outputs values between -! and |.
Softmax: Converts outputs into probabilities.
33. What is knowledge distillation in deep learning?
Answer:
Knowledge distillation transfers knowledge from a large,
complex model (teacher) to a smaller, simpler model
(student) without significant performance loss.
Steps:
Train the teacher model.
Use the teacher's soft predictions to train the student.
34, What is the role of learning rate scheduling in training
deep learning models?
Answer:
Learning rate scheduling adjusts the learning rate during
training to balance convergence speed and stability.Types of schedules:
Step decay: Reduce the learning rate at fixed intervals.
Exponential decay: Multiply the learning rate by a factor at
each step.
Cyclic learning rates: Oscillate the learning rate within a
range.
35, What are the differences between instance
normalization, batch normalization, and layer normalization?
Answer:
Batch Normalization: Normalizes activations across a batch
of data. Useful for training stability.
Instance Normalization: Normalizes activations for each
sample. Often used in style transfer tasks.
Layer Normalization: Normalizes across features for each
sample. Effective for RNNs and transformer architectures.
36, What is the vanishing gradient problem, and how do
activation functions like ReLU address it?Answer:
Vanishing gradients occur when gradients shrink
exponentially during backpropagation, preventing effective
weight updates.
ReLU: Avoids vanishing gradients by allowing gradients to
pass unchanged for positive values, as its derivative is either
0 orl.
37, What are the differences between Adam and SGD
optimizers?
Answer:
SGD (Stochastic Gradient Descent): Updates weights using
the gradient of the loss function. Slower convergence.
| Adam (Adaptive Moment Estimation): Combines momentum
and adaptive learning rates for faster convergence and
improved stability.
38, What are attention heads in the transformer model?
Answer:
Attention heads in transformers allow the model to focus on
different parts of the input simultaneously.Multi-head attention splits the queries, keys, and values into
multiple parts, computes attention independently, and
combines results for better contextual understanding.
349, What is the difference between gradient clipping and
gradient normalization?
Answer:
Gradient Clipping: Limits the magnitude of gradients to a
pre-defined threshold to prevent exploding gradients.
Gradient Normalization: Scales gradients to have a consistent
magnitude, improving training stability.
| 40. What is the difference between early stopping and
checkpointing in training?
Answer:
Early Stopping: Stops training when performance on a
validation set stops improving, preventing overfitting.
Checkpointing: Saves model weights periodically during
training. Useful for recovering from interruptions or selecting
the best-performing model.4I. What is the difference between the encoder and decoder
in seguence-to-sequence models?
Answer:
Encoder: Processes the input sequence and encodes it into a
fixed-length vector or context.
Decoder: Takes the encoded context and generates the
output sequence step by step.
Examples: Used in machine translation (e.g., English to
French).
42, What is the role of positional encoding in transformers?
Answer:
|| Transformers do not process data sequentially, so positional
encoding is added to input embeddings to provide
information about the order of tokens.
Positional encodings are sinusoidal functions of different
frequencies,
43. What are the challenges of deploying deep learning
models in production?
Answer:High inference latency and memory usage.
Ensuring model robustness to real-world data.
Scalability under high traffic.
Maintaining model versioning and reproducibility.
Monitoring performance drift over time.
44, What is Layerwise Relevance Propagation (LRP)?
Answer:
LRP is an explainability technique for neural networks, It
decomposes the output prediction back to the input features
to show their relevance.
| It helps interpret model decisions and is used in sensitive
domains like healthcare.
4S. What is the difference between semantic segmentation
and instance segmentation?
Answer:
Semantic Segmentation: Classifies each pixel of an image
into a category but does not differentiate between individual
objects.Instance Segmentation: Identifies individual objects of the
same class and segments them separately.
46, What is a dilated convolution, and where is it used?
Answer:
A dilated convolution Calso called atrous convolution)
expands the receptive field of convolution filters by inserting
spaces between kernel elements.
Used in:
Semantic segmentation (e.g., DeepLab).
Audio and time-series data analysis.
47. What are the benefits of using cosine similarity over dot
product for measuring vector similarity?
Answer:
Cosine Similarity: Measures the cosine of the angle between
two vectors, focusing on orientation rather than magnitude.
Benefits:
Better for normalized embeddings.Prevents large magnitude differences from dominating the
similarity score.
48. What is zero-shot learning, and how does it work?
Answer:
2Zero-shot learning enables a model to make predictions for
classes it has not seen during training.
Mechanism:
Leverages a shared semantic space (e.g., word embeddings)
to transfer knowledge from seen to unseen classes.
49. What is a Siamese network, and where is it used?
Answer:
4 Siamese network uses two identical subnetworks to
compare inputs by learning a similarity metric.
Applications:
Face verification.
One-shot learning.
Signature verification.50. What is the purpose of weight initialization in deep
learning?
Answer:
Proper weight initialization prevents vanishing or exploding
gradients and accelerates convergence.
Xavier Initialization: Suitable for activations like sigmoid or
tanh.
He Initialization: Designed for ReLU activation functions.
SI. What are vanishing and exploding gradients, and how do
they impact deep learning models?
Answer:
Vanishing Gradients: Gradients become very small, causing
weights to update slowly and halting learning.
Exploding Gradients: Gradients become very large, leading to
unstable updates and possible divergence.
Solutions:
Use activation functions like ReLU.
Implement gradient clipping.Use batch normalization or better initialization methods like
He initialization.
$2, What are the differences between data augmentation
and data synthesis?
Answer:
Data Augmentation: Applies transformations to existing data
(eg., rotations, flips, noise), It enhances diversity without
altering class distribution.
Data Synthesis: Generates entirely new data using
techniques like GANs or simulations. Useful for handling
imbalanced or rare classes.
$3, What are the key differences between RNNs, GRUs, and
LSTMs?
Answer:
RNNs; Process sequential data but suffer from vanishing
gradients for long sequences.
GRUs (Gated Recurrent Units): Simplified LSTMs with fewer
parameters; combine the forget and input gates.LSTMs (Long Short-Term Memory): Use separate forget,
input, and output gates to handle long-term dependencies
effectively.
54, What is the purpose of gradient accumulation in deep
learning?
Answer:
Gradient accumulation splits the batch into smaller micro-
batches to compute gradients iteratively, then updates the
weights after processing all micro-batches.
Benefits:
Reduces memory usage for large models or small GPUs.
Simulates larger batch sizes for better convergence.
SS. What are capsule networks, and how do they differ from
CNNs?
Answer:
Capsule networks model spatial relationships between
features using vectors, instead of scalars like CNNs.
Advantages:
Better handling of spatial hierarchies.Preserves orientation and pose information.
Example: Used in tasks like image classification with fewer
training examples.
S6. What are deep reinforcement learning (DRL) and its
applications?
Answer:
DRL combines deep learning and reinforcement learning,
where agents learn optimal policies through trial and error.
Applications:
Game playing (eg., AlphaGo, Dota 2).
Robotics and control systems.
Autonomous vehicles.
$7. How does dropout work in deep learning, and why is it
effective?
Answer:
Dropout randomly disables a fraction of neurons during
training, preventing overfitting by reducing co-dependencies
among neurons,
During inference, the full network is used with scaled-downweights.
$8. What is label smoothing, and why is it used?
Answer:
Label smoothing replaces hard labels (eg., | or 0) with
smoothed probabilities (e.g. 0.9 and 0.1).
Benefits:
Reduces overconfidence in predictions.
Helps the model generalize better.
Example: Common in image classification with cross-entropy
loss.
59. What is the difference between dense and sparse
embeddings?
Answer:
Dense Embeddings: Low-dimensional, continuous-valued
vectors (e.g., Word2Vec, BERT). Compact and efficient for
downstream tasks.
Sparse Embeddings: High-dimensional, mostly zero vectors
(e.g., one-hot encoding). Inefficient but straightforward,60. What is the difference between teacher forcing and free-
running in sequence models?
Answer:
Teacher Forcing: During training, the model uses the ground
truth as input for the next time step. Speeds up convergence
but can lead to exposure bias.
Free-Running: During inference, the model uses its own
predictions as inputs. Better simulates real-world usage.
61, What is the purpose of skip connections in deep neural
networks?
Answer:
Skip connections, like those used in ResNet, allow gradients
to flow more easily through the network, mitigating the
vanishing gradient problem. They also enable the model to
learn identity mappings for shallow layers.
62, What are adversarial examples, and how do they affect
deep learning models?
Answer:Adversarial examples are inputs deliberately perturbed to fool
a model into making incorrect predictions, They expose
vulnerabilities in deep learning models and highlight the need
|| for robust training techniques like adversarial training.
63, How does attention mechanism work in deep learning
models?
Answer:
The attention mechanism assigns different weights to
different parts of the input sequence, focusing on the most
relevant features for a specific task. For example, in
translation tasks, attention aligns words between the source
and target languages.
64. What are variational autoencoders (VAEs), and how do
|| they differ from standard autoencoders?
Answer:
VAEs generate new data by learning a probabilistic latent
space. Unlike standard autoencoders, they optimize a
variational lower bound using both reconstruction loss and a
KL divergence term to ensure smooth latent space
representations.
65. What is the difference between transfer learning and
fine-tuning?Answer:
Transfer Learning: Reusing a pre-trained model's features
without modifying its weights.
Fine-Tuning: Adapting a pre-trained model to a specific task
by training some or all of its layers with a new dataset.
66. What is the concept of gradient penalty in GANs?
Answer:
Gradient penalty is used to enforce the Lipschitz continuity
condition in Wasserstein GANs. It adds a penalty to the loss
function based on the gradient norm of the discriminator,
stabilizing training.
| 67. What are self-supervised learning and its applications?
Answer:
Self-supervised learning creates labels from raw data itself
to train models without manual annotations.
Applications:
Pre-training models like BERT and SimCLR.
Applications in computer vision and NLP.68. What is label imbalance, and how can it be addressed in
deep learning?
Answer:
Label imbalance occurs when classes in a dataset are not
equally represented.
Solutions:
Oversampling minority classes.
Undersampling majority classes.
Using class weights in the loss function,
69, What are group normalization and layer normalization?
Answer:
Group Normalization: Normalizes activations within groups of
channels, Effective for small batch sizes.
Layer Normalization: Normalizes across all features of a
single data point. Common in NLP tasks.
70. What are hyperparameter optimization techniques in deep
learning?Answer:
Grid Search.
Random Search,
Bayesian Optimization.
Hyperband or Population-Based Training.
Tools like Optuna and Ray Tune help automate this process.
71, What is knowledge distillation, and why is it useful?
Answer:
Knowledge distillation transfers knowledge from a large,
complex model (teacher) to a smaller, faster model
(student). It improves inference speed while retaining high
accuracy.
72, What is the difference between BatchNorm, LayerNorm,
and InstanceNorm?
Answer:
BatchNorm: Normalizes over a mini-batch of samples.
LayerNorm: Normalizes across features of a single sample.InstanceNorm: Normalizes across spatial dimensions for each
sample, commonly used in style transfer.
73, What is spectral normalization, and why is it used in
GANs?
Answer:
Spectral normalization constrains the Lipschitz constant of
the discriminator by normalizing its weight matrices. It
stabilizes training and prevents mode collapse.
74. What is the SWA (Stochastic Weight Averaging)
technique?
Answer:
| SWA averages weights from multiple SGD steps during
training. It improves generalization by converging to Hat
minima in the loss landscape.
7S. What is a Softmax bottleneck problem in language
models?
Answer:
The Softmax bottleneck limits the expressiveness of language
models due to its restricted output distribution, Techniques
like adaptive Softmax and Mixture of Softmax help addressthis issue.
76. How does the transformer architecture handle long
sequences efficiently?
Answer:
Transformers use self-attention mechanisms that process
sequences in parallel, unlike RNNs. They can model long-
range dependencies without sequential computation.
77. What is a mixture of experts (MoE) model?
Answer:
An MoE model combines several sub-models (experts) and
uses a gating mechanism to assign weights to each expert
for a given input. It is computationally efficient for scaling
| large models.
78. What are the main differences between Mask R-CNN
and Faster R-CNN?
Answer:
Faster R-CNN: Detects objects and generates bounding
boxes.
Mask R-CNN: Extends Faster R-CNN by adding a mask headfor pixel-wise segmentation.
79, What is the purpose of cosine annealing in learning rate
scheduling?
Answer:
Cosine annealing gradually decreases the learning rate
following a cosine curve. It helps achieve better convergence
by encouraging the model to settle into a minimum slowly.
30. What is the difference between active learning and
semi-supervised learning?
Answer:
Active Learning: Identifies the most informative samples to
label from an unlabeled pool.
Semi-Supervised Learning: Combines a small labeled dataset
with a large unlabeled dataset to improve performance.3). What is the purpose of gradient clipping, and when is it
used?
Answer:
Gradient clipping limits the gradient magnitude to prevent
exploding gradients, commonly used in RNNs and deep
networks, It stabilizes training when gradients become
excessively large.
82. What is focal loss, and why is it useful?
Answer:
Focal loss is designed to address class imbalance by down-
weighting the loss for well-classified examples and focusing
on hard-to-classify examples.
Formula:
FL(pt) = -C) - pt)*y * log(pt)
where y controls the focusing effect.33. What is dilated convolution, and how does it differ from
standard convolution?
Answer:
Dilated convolution increases the receptive field without
increasing the number of parameters by introducing spaces
between kernel elements. It is useful in tasks like semantic
segmentation.
34, What is weight regularization, and how does it work?
Answer:
Weight regularization reduces overfitting by penalizing large
weights.
LI Regularization: Adds |w] to the loss.
L2 Regularization (Weight Decay): Adds w*2 to the loss.
85. What are the advantages of using mixed precision
training?Answer:
Reduces memory usage.
Increases training speed.
Achieves comparable accuracy by using lower precision (e.9.,
FPI6) for calculations and higher precision (eg, FP32) for
key operations.
36. How does early stopping prevent overfitting?
Answer:
Early stopping halts training when performance on a
validation set stops improving. It prevents the model from
overfitting to the training data by stopping at an optimal
point.
87. What is the difference between supervised pretraining
and self-supervised pretraining?Answer:
Supervised Pretraining: Pretraining on a labeled dataset
before fine-tuning on a specific task.
Self-Supervised Pretraining: Pretraining using self-generated
labels without human annotations, commonly used in NLP
and vision.
38. What is neural architecture search (NAS)?
Answer:
NAS is an automated process to find optimal neural network
architectures. Techniques include:
Reinforcement learning.
Evolutionary algorithms.
Gradient-based methods.89. What is an encoder-decoder architecture?
Answer:
An encodey-decoder is used in sequence-to-sequence tasks.
Encoder: Compresses input into a latent representation.
Decoder: Generates output from the latent representation.
Examples: Translation and summarization.
90. What is the purpose of cosine similarity in NLP tasks?
Answer:
Cosine similarity measures the similarity between two
vectors by computing the cosine of the angle between them.
Commonly used for comparing word or sentence embeddings.
491. What is the difference between Seg2Seg models with and
without attention?
Answer:Without Attention: Encodes the entire input into a fixed-
length vector, limiting performance on long sequences.
With Attention: Dynamically focuses on relevant parts of the
input sequence for better performance.
92. How does transfer learning benefit small datasets?
Answer:
Transfer learning leverages features learned from a large
dataset, reducing the need for extensive data, It avoids
overfitting and improves generalization on small datasets.
93. What are transposed convolutions, and where are they
used?
Answer:
Transposed (deconvolutional) convolutions increase spatial
resolution, often used in generative tasks like image super-
resolution or semantic segmentation.94, What are GNNs (Graph Neural Networks), and where are
they applied?
Answer:
GNNs work on graph-structured data, propagating
information between nodes.
Applications:
Social networks.
Molecular analysis.
Recommendation systems.
95. What is the purpose of masked language models
(MLMs)?
Answer:
MLMs, like BERT, predict missing words in a sentence by
masking parts of the input. This bidirectional understanding
improves performance on NLP tasks.96. What is layer-wise learning rate scaling?
Answer:
Layer-wise learning rate scaling assigns different learning
rates to different layers, often smaller rates for pre-trained
layers and larger rates for newly added layers.
97. What is knowledge graph embedding?
Answer:
Knowledge graph embedding represents entities and
relationships in a knowledge graph as low-dimensional
vectors.
Applications:
Question answering.
Recommendation systems.98. What is feature pyramid network (FPN)?
Answer:
FPN builds a multi-scale feature hierarchy by combining
low-resolution, semantically strong features with high-
resolution, spatially precise features. Common in object
detection.
99. What is the difference between a sparse and dense
layer?
Answer:
Sparse Layer: Uses sparse matrices to save memory and
computational resources.
Dense Layer: Fully connected, requiring more resources but
capturing all feature interactions.
100. What is capsule routing in capsule networks?Answer:
Capsule routing ensures that lower-layer capsules send their
outputs to higher-layer capsules based on agreement scores.
This process preserves spatial hierarchies in data.
101. What is the vanishing gradient problem, and how is it
mitigated?
Answer:
The vanishing gradient problem occurs when gradients
become extremely small during backpropagation, preventing
effective weight updates in earlier layers. This often happens
in deep networks with activation functions like sigmoid or
tanh,
Mitigation Strategies:
1, Use ReLU activation functions: ReLU avoids vanishing
gradients by having a constant gradient for positive values.2. Batch normalization: Normalizes layer inputs to stabilize
and maintain gradients.
3, Residual connections: Allow gradients to low directly
through skip connections in deep networks (e.g, ResNet).
102. Explain the concept of teacher forcing in RNNs. Why is
it useful?
Answer:
Teacher forcing is a technique used in sequence-to-seguence
models where the actual target output is used as input to
the next time step during training, instead of the predicted
output.
Advantages:
Speeds up convergence by providing ground-truth inputs.
Reduces exposure bias (the discrepancy between training and
inference).Challenges:
At inference, the model might struggle without ground-truth
inputs, Scheduled sampling can gradually reduce reliance on
teacher forcing.
103. What is the difference between label smoothing and
hard labels?
Answer:
Hard Labels: Assign a one-hot encoding for the target
classes (eg., LI, 0, 01).
Label Smoothing: Modifies hard labels by assigning a small
probability to incorrect classes to make the model less
confident in its predictions.
For example: [0.9, 0.05, 0.05].
Advantages of Label Smoothing:
Improves generalization by preventing overconfidence.
Mitigates overfitting, especially on noisy data.104, What is knowledge distillation, and how is it applied?
Answer:
Knowledge distillation transfers knowledge from a large,
complex model (teacher) to a smaller, simpler model
student).
How it works:
The student model is trained to mimic the teacher's
softened output probabilities instead of the hard labels.
Loss Function:
L = (1 - a) * cross_entropyly, 2) +a *
KL_divergence(softmax(2_teacher/T),
softmax(2_student/T))
where T is the temperature and a balances the loss terms.
Applications:
Deploying efficient models on resource-constrained devices.
Model compression.10S, What is gradient centralization, and why is it used?
Answer:
Gradient centralization normalizes gradients by subtracting
their mean before updating weights.
Benefits:
Improves optimization stability.
Helps models converge faster.
Reduces variance in gradients, especially in deep networks.
Commonly used in conjunction with optimizers like SGD or
Adam,
106, What is the difference between transductive and
inductive learning?
Answer:Transductive Learning: Learns to predict labels only for the
given test data, without generalizing to unseen data.
Example: Graph-based semi-supervised learning.
Inductive Learning: Learns a general function or model that
can make predictions on unseen data, Example: Most deep
learning models like CNNs or RNNs.
107. What are adversarial examples, and how do you defend
against them?
Answer:
Adversarial examples are inputs deliberately perturbed to
|| deceive a model into making incorrect predictions while
appearing unchanged to humans.
Defenses:
1, Adversarial training: Train the model on adversarially
perturbed data.
2. Gradient masking: Obfuscate gradients to make it harder
for attackers to compute perturbations.3. Input preprocessing: Techniques like JPEG compression or
Gaussian noise addition can reduce adversarial effects.
108. What are transformer models, and how do they differ
from RNNs?
Answer:
Transformer models use self-attention mechanisms to
process sequences, unlike RNNs that process inputs
sequentially.
Key Differences:
1, Parallelism: Transformers process all input tokens
simultaneously, while RNNs process sequentially.
2. Long-term dependencies: Transformers capture long-range
dependencies better using attention.
3. Efficiency: Transformers are more efficient with GPUs due
to parallelization but require more memory.Examples: BERT, GPT, TS.
109. What is the purpose of positional encoding in
transformers?
Answer:
Positional encoding allows transformers, which lack inherent
sequence awareness, to incorporate the order of tokens in a
sequence.
Formula for sinusoidal encoding:
PE(pos, 2i) = sin(pos / 10000°(2i/d))
PE(pos, 2i+1) = cos(pos / 10000"(2i/d))
where pos is the position, i is the dimension, and d is the
embedding size.
110, What is the concept of layer normalization, and how
does it differ from batch normalization?
Answer:Layer Normalization: Normalizes inputs across features within
a single training example. Commonly used in NLP and
transformers.
Formula:
y = & - mean) / sqrt(variance + &)
Batch Normalization: Normalizes inputs across the batch for
each feature. Common in CNNs.
Differences:
Batch normalization depends on batch size; layer
normalization does not.
Layer normalization is more effective in sequence-based
models,IN, What are attention mechanisms, and why are they
important?
Answer:
Attention mechanisms allow models to focus on specific
parts of the input sequence when making predictions,
assigning varying importance (weights) to different tokens
or elements.
Types of Attention:
1, Self-Attention; Helps capture relationships within a single
sequence.
2. Cross-Attention: Used in sequence-to-sequence models to
relate input and output sequences.
Importance:
Captures long-range dependencies.
Enhances interpretability by showing what the model is
focusing on.
Forms the backbone of transformer architectures.12, What are capsule networks, and how do they differ from
traditional CNNs?
Answer:
Capsule networks are designed to model spatial hierarchies
by encoding the pose and orientation of features in addition
to their presence.
Key Differences:
Capsules: Groups of neurons represent the probability and
parameters of detected objects.
Dynamic Routing: Capsules communicate with higher-level
capsules using iterative routing-by-agreement mechanisms,
unlike fixed pooling in CNNs.
Advantages:
Better at understanding hierarchical relationships.
More robust to changes in orientation and spatial distortions.113, What is the difference between data augmentation and
data synthesis?
Answer:
Data Augmentation: Modifies existing data to increase
diversity (e.g., Hipping, cropping, adding noise). Commonly
used for regularization.
Data Synthesis: Generates entirely new data points from a
model (e.g. GANs, VAEs).
Use Cases:
Augmentation improves robustness without altering the
dataset size drastically.
Synthesis is useful for creating data in underrepresented
categories.
114, What are GANs, and how do they work?Answer:
Generative Adversarial Networks (GANs) consist of two
models:
Generator: Creates fake data.
Discriminator: Distinguishes between real and fake data.
Training Process:
The generator learns to create realistic data by fooling the
discriminator.
The discriminator improves by distinguishing fake from real
data.
Both models play a minimax game.
Loss Function:
min_G max_D Ellog(D(real))] + Ellog(l - DCfake))]
Applications: Image generation, style transfer, and super-
resolution,1S, What is the role of the softmax function in neural
networks?
Answer:
The softmax function converts raw scores Clogits) into
probabilities, ensuring they sum to |. It is commonly used in
the output layer of classification tasks.
Formula:
softmax(xi) = exp(xi) / sumCexpCxj) for j in range(n))
Advantages:
Provides interpretable class probabilities.
Highlights the most likely class while suppressing others.
Helps during loss calculation using cross-entropy.
116, What are variational autoencoders (VAEs), and how are
they different from standard autoencoders?
Answer:
Variational autoencoders are probabilistic models that learn alatent representation as a distribution (mean and variance)
rather than a deterministic vector.
Differences:
Standard Autoencoders: Compress input into fixed latent
vectors.
VAEs: Use a probabilistic approach to generate diverse
outputs.
Loss Function:
L = reconstruction_loss + KL_divergence(latent || prior)
Applications: Image synthesis, anomaly detection, and latent
space exploration.
117, What is transfer learning, and why is it effective?
Answer:
Transfer learning involves reusing a pre-trained model on a
related task to improve performance and reduce training
time.Effectiveness:
Pre-trained models like ResNet and BERT already learn
general features, reducing the need for large labeled
datasets.
Fine-tuning adapts these features to specific tasks.
Examples:
Using ImageNet-trained CNNs for medical imaging.
Adapting BERT for sentiment analysis.
18, What is the difference between gradient clipping and
gradient normalization?
Answer:
Gradient Clipping: Limits the magnitude of gradients to
prevent exploding gradients.
if ||gradient]] > threshold:gradient = gradient * (threshold / ||gradient||)
Gradient Normalization: Adjusts gradients by dividing them
by their norm, ensuring consistency across layers.
Use Cases:
Clipping is common in RNNs with vanishing/exploding
gradients.
Normalization is used for ensuring smooth optimization.
119, How do you calculate the receptive field of a
convolutional layer?
Answer:
The receptive field is the area of input pixels that influence a
particular output feature.
Formula: For n layers:
R_n = R_(r-l) + (K_n - I) “ stride_n
where R_n is the receptive field, K is the kernel size, andstride is the stride of the current layer.
Importance: Determines the spatial context captured by a
convolutional layer.
120. What are the limitations of backpropagation?
Answer:
Vanishing/exploding gradients: Can hinder optimization in
deep networks.
High computation cost: Requires significant memory and
computation for large networks.
Dependence on labeled data: Backpropagation requires labeled
datasets, which can be expensive to acquire.
Non-convexity: Optimization often converges to local minima
or saddle points.
Solutions: Advanced optimizers (Adam, RMSProp),
initialization techniques, and regularization strategies.Amar Sharma
Al Engineer
Follow me on LinkedIn for more
informative content #