0% found this document useful (0 votes)
143 views

120 Deep Learning Important Questions + Answers ?

Deep learning question answer

Uploaded by

ranupamgupta013
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
143 views

120 Deep Learning Important Questions + Answers ?

Deep learning question answer

Uploaded by

ranupamgupta013
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 68
120 Important Deep Learning Interview Questions+ Answers Notes Ef -- Amar Sharma Important Deep Learning Interview Questions with Answers 1, What is deep learning? How is it different from machine learning? Answer: Deep learning is a subset of machine learning that uses neural networks with multiple layers to automatically learn representations from data. Key differences: Deep learning requires large datasets and computational power. It learns features directly from data, whereas traditional machine learning often requires feature engineering. Deep learning algorithms are typically based on neural || networks with many hidden layers. 2. What is a neural network, and how does it work? Answer: A neural network is a computational model inspired by the human brain, consisting of layers of interconnected nodes (neurons). Input layer receives data. Hidden layers perform computations and learn features. Output layer provides predictions. The network learns by adjusting weights using a process called backpropagation and an optimization algorithm like gradient descent. 3, What is backpropagation? Answer: Backpropagation is an algorithm used to train neural networks by minimizing the error. The error from the output layer is propagated backward through the network. || Gradients are computed for each weight using the chain rule. Weights are updated using an optimizer Ce.g., SGD or Adam) to reduce the error. 4. What are activation functions, and why are they important? Answer: Activation functions introduce non-linearity to neural networks, enabling them to model complex relationships. Common functions: ReLU (Rectified Linear Unit): Fast convergence, avoids vanishing gradient issues, Sigmoid: Output between 0 and |, used for binary classification. Softmax: Outputs probabilities for multiclass classification. Tanh: Outputs between -! and |, centered around zero. S. What is overfitting, and how can it be prevented? Answer: Overfitting occurs when a model performs well on training data but poorly on unseen data. Prevention techniques: Use regularization methods like LI/L2 (Ridge, Lasso). Apply dropout layers. Reduce model complexity. Use more training data or data augmentation. Perform early stopping during training. 6. What is the difference between batch size, epochs, and iterations? Answer: Batch size: Number of samples processed before updating the model's weights. Epoch: One complete pass through the entire training dataset. Iteration: One batch update during training. For example, if you have 1000 samples and a batch size of 100, there will be 10 iterations per epoch. 7, What is the vanishing gradient problem, and how can it be mitigated? Answer: The vanishing gradient problem occurs when gradients become very small in deep networks, slowing or stopping learning. Mitigation techniques: Use activation functions like ReLU. Initialize weights properly (e.g., Xavier or He initialization). Use batch normalization. Build networks with skip connections (e.g., ResNet). 8. What is transfer learning? Answer: Transfer learning involves using a pre-trained model on a new task. Instead of training from scratch, the model's pre- trained weights are fine-tuned for the target task, This is useful when data is limited and for tasks like image recognition or natural language processing. 9. Explain the difference between CNNs and RNNs. Answer: CNNs (Convolutional Neural Networks): Designed for spatial data like images. They use convolutional layers to capture spatial hierarchies. RNNs (Recurrent Neural Networks); Designed for sequential data like time series or text. They have memory cells to capture temporal dependencies. 10, What are gradient descent and its variants? Answer: Gradient descent is an optimization algorithm used to minimize the loss function. Common variants: Batch Gradient Descent: Uses the entire dataset for each update (slow for large datasets). Stochastic Gradient Descent (SGD): Uses one sample per update (faster but noisy). Mini-batch Gradient Descent: Uses a subset (batch) of the data for each update (balances speed and stability). Adam Optimizer: Combines momentum and adaptive learning rates for efficient training. I), What is the role of the loss function in neural networks? Answer: The loss function measures the difference between the predicted output and the actual target value. It guides the optimization process by providing a metric for minimizing the error. Common loss functions: Mean Squared Error (MSE): For regression tasks, Binary Cross-Entropy: For binary classification. Categorical Cross-Entropy: For multi-class classification. 12, What are weight initialization techniques, and why are they important? Answer: Weight initialization techniques help ensure faster convergence and avoid issues like vanishing/exploding gradients. Random Initialization: Assigns random values to weights. Xavier Initialization: Keeps the variance of activations constant across layers. He Initialization: Optimized for ReLU activations, 13, What is the difference between LI and L2 regularization? Answer: LI Regularization: Adds the absolute value of weights to the loss function (Lasso). Encourages sparsity, making some weights zero. L2 Regularization: Adds the squared value of weights to the loss function (Ridge). Penalizes large weights and prevents overfitting. 14, What are autoencoders, and how are they used? Answer: Autoencoders are neural networks used for unsupervised learning, designed to reconstruct input data. They have an encoder (to compress data) and a decoder (to reconstruct it). Applications: Dimensionality reduction, Anomaly detection. Denoising data. 1S. What is the role of batch normalization? Answer: Batch normalization normalizes the input of each layer to improve stability and convergence during training. Benefits: Reduces internal covariate shift. Allows for higher learning rates. Acts as a regularizer, reducing the need for dropout. 16, What is a recurrent neural network (RNN), and how does it handle sequential data? Answer: RNNs are designed to process sequences of data by maintaining a hidden state that captures information about previous time steps. They handle dependencies in data like time series, text, or speech, Variants like LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) address issues like vanishing gradients. 17, What is the purpose of dropout in deep learning? Answer: Dropout is a regularization technique that randomly sets a fraction of neurons to zero during training. Prevents overfitting by introducing noise. Encourages the network to learn more robust features. 13. What are GANs (Generative Adversarial Networks)? Answer: GANs are neural networks consisting of two components: Generator: Creates fake data resembling real data. Discriminator: Distinguishes between real and fake data. They are trained together, improving the generator's ability to create realistic data. Applications: Image generation, Style transfer. Data augmentation. 19, What is the difference between supervised, unsupervised, and reinforcement learning? Answer: Supervised Learning: The model learns from labeled data (eg,, classification, regression). Unsupervised Learning: The model identifies patterns in unlabeled data (e.g., clustering, dimensionality reduction). Reinforcement Learning: The model learns by interacting with the environment and receiving feedback in the form of rewards or penalties. 20. What are attention mechanisms in deep learning? Answer: Attention mechanisms allow the model to focus on relevant | parts of the input while making predictions. Example: In machine translation, the attention mechanism helps the model focus on specific words in the source sentence while translating. Applications: Transformer models like BERT and GPT. Image captioning. Text summarization. 21, What are the main components of a convolutional neural network (CNN)? Answer: Convolutional Layers: Extract features by applying filters over the input. Pooling Layers: Reduce the spatial dimensions of feature maps (e.g., max pooling). Fully Connected Layers: Combine high-level features for classification or regression. Dropout/Bias Layers: Prevent overfitting and improve generalization. 22, What is the difference between a feedforward neural network and a recurrent neural network? Answer: Feedforward Neural Network CFNN): Processes input data in one direction, without loops. Ideal for tasks like image recognition. Recurrent Neural Network (RNN): Processes sequential data with feedback loops to maintain memory. Used for time- series and language modeling. 23. What are LSTMs and GRUs? How are they different? Answer: LSTMs (Long Short-Term Memory): Use gates Cinput, forget, output) to maintain long-term dependencies in sequences. GRUs (Gated Recurrent Units): A simplified version of LSTMs, combining forget and input gates into one update || gate. GRUs are computationally faster, while LSTMs handle complex dependencies better. 24, What is the difference between parameterized and non- parameterized layers? Answer: Parameterized Layers: Contain trainable parameters (e.., Dense, Convolutional layers). Non-parameterized Layers: Do not contain trainable parameters but modify data (e.g., Activation, Pooling layers). 25, What is the exploding gradient problem, and how is it mitigated? Answer: Exploding gradients occur when large gradient values cause instability during training. Solutions: Gradient clipping: Restrict gradients to a maximum value. Use better initialization methods. Use architectures like LSTMs/GRUs for sequential data. 26, What is the purpose of the softmax function? Answer: Softmax converts raw scores (logits) into probabilities that sum to 1. Used in the output layer for multi-class classification. Formula: Softmax(xi) = exp(xi) / sum(expGy)) for all j. 27. What is the difference between supervised pretraining and self-supervised learning? Answer: Supervised Pretraining: The model is trained on a related labeled dataset, then fine-tuned on the target dataset. Self-Supervised Learning: The model generates pseudo-labels from data (e.g., predicting masked tokens in BERT) and learns representations without explicit labels. | 23. What is the Transformer architecture, and how does it work? Answer: The Transformer is a deep learning architecture designed for sequence-to-sequence tasks. It uses: Self-Attention Mechanism: To focus on relevant parts of input sequences. Positional Encoding: To maintain order in input sequences. It replaced RNNs for tasks like machine translation (eg, BERT, GPT models). 29, What are the main challenges in training deep neural networks? Answer: Vanishing/exploding gradients. Overfitting on training data. High computational cost. Difficulty in hyperparameter tuning. Data scarcity or imbalance. 30, What is the difference between model-based and data- based parallelism in deep learning? Answer: Model-based Parallelism: Splits the model across multiple devices (e.g,, splitting layers of a large neural network). Data-based Parallelism: Splits the data into batches processed in parallel across devices. 31, What is transfer learning, and why is it important in deep learning? Answer: Transfer learning involves using a pre-trained model on a related task and fine-tuning it for a target task. Benefits: Reduces training time. Requires less data for the target task. Leverages learned features from a larger dataset (eg., ImageNet). 32, What is the purpose of an activation function in a neural network? Answer: Activation functions introduce non-linearity into the network, enabling it to learn complex patterns. Common activation functions: ReLU (Rectified Linear Unit): max(0, x). Sigmoid: Outputs values between 0 and |. Tanh: Outputs values between -! and |. Softmax: Converts outputs into probabilities. 33. What is knowledge distillation in deep learning? Answer: Knowledge distillation transfers knowledge from a large, complex model (teacher) to a smaller, simpler model (student) without significant performance loss. Steps: Train the teacher model. Use the teacher's soft predictions to train the student. 34, What is the role of learning rate scheduling in training deep learning models? Answer: Learning rate scheduling adjusts the learning rate during training to balance convergence speed and stability. Types of schedules: Step decay: Reduce the learning rate at fixed intervals. Exponential decay: Multiply the learning rate by a factor at each step. Cyclic learning rates: Oscillate the learning rate within a range. 35, What are the differences between instance normalization, batch normalization, and layer normalization? Answer: Batch Normalization: Normalizes activations across a batch of data. Useful for training stability. Instance Normalization: Normalizes activations for each sample. Often used in style transfer tasks. Layer Normalization: Normalizes across features for each sample. Effective for RNNs and transformer architectures. 36, What is the vanishing gradient problem, and how do activation functions like ReLU address it? Answer: Vanishing gradients occur when gradients shrink exponentially during backpropagation, preventing effective weight updates. ReLU: Avoids vanishing gradients by allowing gradients to pass unchanged for positive values, as its derivative is either 0 orl. 37, What are the differences between Adam and SGD optimizers? Answer: SGD (Stochastic Gradient Descent): Updates weights using the gradient of the loss function. Slower convergence. | Adam (Adaptive Moment Estimation): Combines momentum and adaptive learning rates for faster convergence and improved stability. 38, What are attention heads in the transformer model? Answer: Attention heads in transformers allow the model to focus on different parts of the input simultaneously. Multi-head attention splits the queries, keys, and values into multiple parts, computes attention independently, and combines results for better contextual understanding. 349, What is the difference between gradient clipping and gradient normalization? Answer: Gradient Clipping: Limits the magnitude of gradients to a pre-defined threshold to prevent exploding gradients. Gradient Normalization: Scales gradients to have a consistent magnitude, improving training stability. | 40. What is the difference between early stopping and checkpointing in training? Answer: Early Stopping: Stops training when performance on a validation set stops improving, preventing overfitting. Checkpointing: Saves model weights periodically during training. Useful for recovering from interruptions or selecting the best-performing model. 4I. What is the difference between the encoder and decoder in seguence-to-sequence models? Answer: Encoder: Processes the input sequence and encodes it into a fixed-length vector or context. Decoder: Takes the encoded context and generates the output sequence step by step. Examples: Used in machine translation (e.g., English to French). 42, What is the role of positional encoding in transformers? Answer: || Transformers do not process data sequentially, so positional encoding is added to input embeddings to provide information about the order of tokens. Positional encodings are sinusoidal functions of different frequencies, 43. What are the challenges of deploying deep learning models in production? Answer: High inference latency and memory usage. Ensuring model robustness to real-world data. Scalability under high traffic. Maintaining model versioning and reproducibility. Monitoring performance drift over time. 44, What is Layerwise Relevance Propagation (LRP)? Answer: LRP is an explainability technique for neural networks, It decomposes the output prediction back to the input features to show their relevance. | It helps interpret model decisions and is used in sensitive domains like healthcare. 4S. What is the difference between semantic segmentation and instance segmentation? Answer: Semantic Segmentation: Classifies each pixel of an image into a category but does not differentiate between individual objects. Instance Segmentation: Identifies individual objects of the same class and segments them separately. 46, What is a dilated convolution, and where is it used? Answer: A dilated convolution Calso called atrous convolution) expands the receptive field of convolution filters by inserting spaces between kernel elements. Used in: Semantic segmentation (e.g., DeepLab). Audio and time-series data analysis. 47. What are the benefits of using cosine similarity over dot product for measuring vector similarity? Answer: Cosine Similarity: Measures the cosine of the angle between two vectors, focusing on orientation rather than magnitude. Benefits: Better for normalized embeddings. Prevents large magnitude differences from dominating the similarity score. 48. What is zero-shot learning, and how does it work? Answer: 2Zero-shot learning enables a model to make predictions for classes it has not seen during training. Mechanism: Leverages a shared semantic space (e.g., word embeddings) to transfer knowledge from seen to unseen classes. 49. What is a Siamese network, and where is it used? Answer: 4 Siamese network uses two identical subnetworks to compare inputs by learning a similarity metric. Applications: Face verification. One-shot learning. Signature verification. 50. What is the purpose of weight initialization in deep learning? Answer: Proper weight initialization prevents vanishing or exploding gradients and accelerates convergence. Xavier Initialization: Suitable for activations like sigmoid or tanh. He Initialization: Designed for ReLU activation functions. SI. What are vanishing and exploding gradients, and how do they impact deep learning models? Answer: Vanishing Gradients: Gradients become very small, causing weights to update slowly and halting learning. Exploding Gradients: Gradients become very large, leading to unstable updates and possible divergence. Solutions: Use activation functions like ReLU. Implement gradient clipping. Use batch normalization or better initialization methods like He initialization. $2, What are the differences between data augmentation and data synthesis? Answer: Data Augmentation: Applies transformations to existing data (eg., rotations, flips, noise), It enhances diversity without altering class distribution. Data Synthesis: Generates entirely new data using techniques like GANs or simulations. Useful for handling imbalanced or rare classes. $3, What are the key differences between RNNs, GRUs, and LSTMs? Answer: RNNs; Process sequential data but suffer from vanishing gradients for long sequences. GRUs (Gated Recurrent Units): Simplified LSTMs with fewer parameters; combine the forget and input gates. LSTMs (Long Short-Term Memory): Use separate forget, input, and output gates to handle long-term dependencies effectively. 54, What is the purpose of gradient accumulation in deep learning? Answer: Gradient accumulation splits the batch into smaller micro- batches to compute gradients iteratively, then updates the weights after processing all micro-batches. Benefits: Reduces memory usage for large models or small GPUs. Simulates larger batch sizes for better convergence. SS. What are capsule networks, and how do they differ from CNNs? Answer: Capsule networks model spatial relationships between features using vectors, instead of scalars like CNNs. Advantages: Better handling of spatial hierarchies. Preserves orientation and pose information. Example: Used in tasks like image classification with fewer training examples. S6. What are deep reinforcement learning (DRL) and its applications? Answer: DRL combines deep learning and reinforcement learning, where agents learn optimal policies through trial and error. Applications: Game playing (eg., AlphaGo, Dota 2). Robotics and control systems. Autonomous vehicles. $7. How does dropout work in deep learning, and why is it effective? Answer: Dropout randomly disables a fraction of neurons during training, preventing overfitting by reducing co-dependencies among neurons, During inference, the full network is used with scaled-down weights. $8. What is label smoothing, and why is it used? Answer: Label smoothing replaces hard labels (eg., | or 0) with smoothed probabilities (e.g. 0.9 and 0.1). Benefits: Reduces overconfidence in predictions. Helps the model generalize better. Example: Common in image classification with cross-entropy loss. 59. What is the difference between dense and sparse embeddings? Answer: Dense Embeddings: Low-dimensional, continuous-valued vectors (e.g., Word2Vec, BERT). Compact and efficient for downstream tasks. Sparse Embeddings: High-dimensional, mostly zero vectors (e.g., one-hot encoding). Inefficient but straightforward, 60. What is the difference between teacher forcing and free- running in sequence models? Answer: Teacher Forcing: During training, the model uses the ground truth as input for the next time step. Speeds up convergence but can lead to exposure bias. Free-Running: During inference, the model uses its own predictions as inputs. Better simulates real-world usage. 61, What is the purpose of skip connections in deep neural networks? Answer: Skip connections, like those used in ResNet, allow gradients to flow more easily through the network, mitigating the vanishing gradient problem. They also enable the model to learn identity mappings for shallow layers. 62, What are adversarial examples, and how do they affect deep learning models? Answer: Adversarial examples are inputs deliberately perturbed to fool a model into making incorrect predictions, They expose vulnerabilities in deep learning models and highlight the need || for robust training techniques like adversarial training. 63, How does attention mechanism work in deep learning models? Answer: The attention mechanism assigns different weights to different parts of the input sequence, focusing on the most relevant features for a specific task. For example, in translation tasks, attention aligns words between the source and target languages. 64. What are variational autoencoders (VAEs), and how do || they differ from standard autoencoders? Answer: VAEs generate new data by learning a probabilistic latent space. Unlike standard autoencoders, they optimize a variational lower bound using both reconstruction loss and a KL divergence term to ensure smooth latent space representations. 65. What is the difference between transfer learning and fine-tuning? Answer: Transfer Learning: Reusing a pre-trained model's features without modifying its weights. Fine-Tuning: Adapting a pre-trained model to a specific task by training some or all of its layers with a new dataset. 66. What is the concept of gradient penalty in GANs? Answer: Gradient penalty is used to enforce the Lipschitz continuity condition in Wasserstein GANs. It adds a penalty to the loss function based on the gradient norm of the discriminator, stabilizing training. | 67. What are self-supervised learning and its applications? Answer: Self-supervised learning creates labels from raw data itself to train models without manual annotations. Applications: Pre-training models like BERT and SimCLR. Applications in computer vision and NLP. 68. What is label imbalance, and how can it be addressed in deep learning? Answer: Label imbalance occurs when classes in a dataset are not equally represented. Solutions: Oversampling minority classes. Undersampling majority classes. Using class weights in the loss function, 69, What are group normalization and layer normalization? Answer: Group Normalization: Normalizes activations within groups of channels, Effective for small batch sizes. Layer Normalization: Normalizes across all features of a single data point. Common in NLP tasks. 70. What are hyperparameter optimization techniques in deep learning? Answer: Grid Search. Random Search, Bayesian Optimization. Hyperband or Population-Based Training. Tools like Optuna and Ray Tune help automate this process. 71, What is knowledge distillation, and why is it useful? Answer: Knowledge distillation transfers knowledge from a large, complex model (teacher) to a smaller, faster model (student). It improves inference speed while retaining high accuracy. 72, What is the difference between BatchNorm, LayerNorm, and InstanceNorm? Answer: BatchNorm: Normalizes over a mini-batch of samples. LayerNorm: Normalizes across features of a single sample. InstanceNorm: Normalizes across spatial dimensions for each sample, commonly used in style transfer. 73, What is spectral normalization, and why is it used in GANs? Answer: Spectral normalization constrains the Lipschitz constant of the discriminator by normalizing its weight matrices. It stabilizes training and prevents mode collapse. 74. What is the SWA (Stochastic Weight Averaging) technique? Answer: | SWA averages weights from multiple SGD steps during training. It improves generalization by converging to Hat minima in the loss landscape. 7S. What is a Softmax bottleneck problem in language models? Answer: The Softmax bottleneck limits the expressiveness of language models due to its restricted output distribution, Techniques like adaptive Softmax and Mixture of Softmax help address this issue. 76. How does the transformer architecture handle long sequences efficiently? Answer: Transformers use self-attention mechanisms that process sequences in parallel, unlike RNNs. They can model long- range dependencies without sequential computation. 77. What is a mixture of experts (MoE) model? Answer: An MoE model combines several sub-models (experts) and uses a gating mechanism to assign weights to each expert for a given input. It is computationally efficient for scaling | large models. 78. What are the main differences between Mask R-CNN and Faster R-CNN? Answer: Faster R-CNN: Detects objects and generates bounding boxes. Mask R-CNN: Extends Faster R-CNN by adding a mask head for pixel-wise segmentation. 79, What is the purpose of cosine annealing in learning rate scheduling? Answer: Cosine annealing gradually decreases the learning rate following a cosine curve. It helps achieve better convergence by encouraging the model to settle into a minimum slowly. 30. What is the difference between active learning and semi-supervised learning? Answer: Active Learning: Identifies the most informative samples to label from an unlabeled pool. Semi-Supervised Learning: Combines a small labeled dataset with a large unlabeled dataset to improve performance. 3). What is the purpose of gradient clipping, and when is it used? Answer: Gradient clipping limits the gradient magnitude to prevent exploding gradients, commonly used in RNNs and deep networks, It stabilizes training when gradients become excessively large. 82. What is focal loss, and why is it useful? Answer: Focal loss is designed to address class imbalance by down- weighting the loss for well-classified examples and focusing on hard-to-classify examples. Formula: FL(pt) = -C) - pt)*y * log(pt) where y controls the focusing effect. 33. What is dilated convolution, and how does it differ from standard convolution? Answer: Dilated convolution increases the receptive field without increasing the number of parameters by introducing spaces between kernel elements. It is useful in tasks like semantic segmentation. 34, What is weight regularization, and how does it work? Answer: Weight regularization reduces overfitting by penalizing large weights. LI Regularization: Adds |w] to the loss. L2 Regularization (Weight Decay): Adds w*2 to the loss. 85. What are the advantages of using mixed precision training? Answer: Reduces memory usage. Increases training speed. Achieves comparable accuracy by using lower precision (e.9., FPI6) for calculations and higher precision (eg, FP32) for key operations. 36. How does early stopping prevent overfitting? Answer: Early stopping halts training when performance on a validation set stops improving. It prevents the model from overfitting to the training data by stopping at an optimal point. 87. What is the difference between supervised pretraining and self-supervised pretraining? Answer: Supervised Pretraining: Pretraining on a labeled dataset before fine-tuning on a specific task. Self-Supervised Pretraining: Pretraining using self-generated labels without human annotations, commonly used in NLP and vision. 38. What is neural architecture search (NAS)? Answer: NAS is an automated process to find optimal neural network architectures. Techniques include: Reinforcement learning. Evolutionary algorithms. Gradient-based methods. 89. What is an encoder-decoder architecture? Answer: An encodey-decoder is used in sequence-to-sequence tasks. Encoder: Compresses input into a latent representation. Decoder: Generates output from the latent representation. Examples: Translation and summarization. 90. What is the purpose of cosine similarity in NLP tasks? Answer: Cosine similarity measures the similarity between two vectors by computing the cosine of the angle between them. Commonly used for comparing word or sentence embeddings. 491. What is the difference between Seg2Seg models with and without attention? Answer: Without Attention: Encodes the entire input into a fixed- length vector, limiting performance on long sequences. With Attention: Dynamically focuses on relevant parts of the input sequence for better performance. 92. How does transfer learning benefit small datasets? Answer: Transfer learning leverages features learned from a large dataset, reducing the need for extensive data, It avoids overfitting and improves generalization on small datasets. 93. What are transposed convolutions, and where are they used? Answer: Transposed (deconvolutional) convolutions increase spatial resolution, often used in generative tasks like image super- resolution or semantic segmentation. 94, What are GNNs (Graph Neural Networks), and where are they applied? Answer: GNNs work on graph-structured data, propagating information between nodes. Applications: Social networks. Molecular analysis. Recommendation systems. 95. What is the purpose of masked language models (MLMs)? Answer: MLMs, like BERT, predict missing words in a sentence by masking parts of the input. This bidirectional understanding improves performance on NLP tasks. 96. What is layer-wise learning rate scaling? Answer: Layer-wise learning rate scaling assigns different learning rates to different layers, often smaller rates for pre-trained layers and larger rates for newly added layers. 97. What is knowledge graph embedding? Answer: Knowledge graph embedding represents entities and relationships in a knowledge graph as low-dimensional vectors. Applications: Question answering. Recommendation systems. 98. What is feature pyramid network (FPN)? Answer: FPN builds a multi-scale feature hierarchy by combining low-resolution, semantically strong features with high- resolution, spatially precise features. Common in object detection. 99. What is the difference between a sparse and dense layer? Answer: Sparse Layer: Uses sparse matrices to save memory and computational resources. Dense Layer: Fully connected, requiring more resources but capturing all feature interactions. 100. What is capsule routing in capsule networks? Answer: Capsule routing ensures that lower-layer capsules send their outputs to higher-layer capsules based on agreement scores. This process preserves spatial hierarchies in data. 101. What is the vanishing gradient problem, and how is it mitigated? Answer: The vanishing gradient problem occurs when gradients become extremely small during backpropagation, preventing effective weight updates in earlier layers. This often happens in deep networks with activation functions like sigmoid or tanh, Mitigation Strategies: 1, Use ReLU activation functions: ReLU avoids vanishing gradients by having a constant gradient for positive values. 2. Batch normalization: Normalizes layer inputs to stabilize and maintain gradients. 3, Residual connections: Allow gradients to low directly through skip connections in deep networks (e.g, ResNet). 102. Explain the concept of teacher forcing in RNNs. Why is it useful? Answer: Teacher forcing is a technique used in sequence-to-seguence models where the actual target output is used as input to the next time step during training, instead of the predicted output. Advantages: Speeds up convergence by providing ground-truth inputs. Reduces exposure bias (the discrepancy between training and inference). Challenges: At inference, the model might struggle without ground-truth inputs, Scheduled sampling can gradually reduce reliance on teacher forcing. 103. What is the difference between label smoothing and hard labels? Answer: Hard Labels: Assign a one-hot encoding for the target classes (eg., LI, 0, 01). Label Smoothing: Modifies hard labels by assigning a small probability to incorrect classes to make the model less confident in its predictions. For example: [0.9, 0.05, 0.05]. Advantages of Label Smoothing: Improves generalization by preventing overconfidence. Mitigates overfitting, especially on noisy data. 104, What is knowledge distillation, and how is it applied? Answer: Knowledge distillation transfers knowledge from a large, complex model (teacher) to a smaller, simpler model student). How it works: The student model is trained to mimic the teacher's softened output probabilities instead of the hard labels. Loss Function: L = (1 - a) * cross_entropyly, 2) +a * KL_divergence(softmax(2_teacher/T), softmax(2_student/T)) where T is the temperature and a balances the loss terms. Applications: Deploying efficient models on resource-constrained devices. Model compression. 10S, What is gradient centralization, and why is it used? Answer: Gradient centralization normalizes gradients by subtracting their mean before updating weights. Benefits: Improves optimization stability. Helps models converge faster. Reduces variance in gradients, especially in deep networks. Commonly used in conjunction with optimizers like SGD or Adam, 106, What is the difference between transductive and inductive learning? Answer: Transductive Learning: Learns to predict labels only for the given test data, without generalizing to unseen data. Example: Graph-based semi-supervised learning. Inductive Learning: Learns a general function or model that can make predictions on unseen data, Example: Most deep learning models like CNNs or RNNs. 107. What are adversarial examples, and how do you defend against them? Answer: Adversarial examples are inputs deliberately perturbed to || deceive a model into making incorrect predictions while appearing unchanged to humans. Defenses: 1, Adversarial training: Train the model on adversarially perturbed data. 2. Gradient masking: Obfuscate gradients to make it harder for attackers to compute perturbations. 3. Input preprocessing: Techniques like JPEG compression or Gaussian noise addition can reduce adversarial effects. 108. What are transformer models, and how do they differ from RNNs? Answer: Transformer models use self-attention mechanisms to process sequences, unlike RNNs that process inputs sequentially. Key Differences: 1, Parallelism: Transformers process all input tokens simultaneously, while RNNs process sequentially. 2. Long-term dependencies: Transformers capture long-range dependencies better using attention. 3. Efficiency: Transformers are more efficient with GPUs due to parallelization but require more memory. Examples: BERT, GPT, TS. 109. What is the purpose of positional encoding in transformers? Answer: Positional encoding allows transformers, which lack inherent sequence awareness, to incorporate the order of tokens in a sequence. Formula for sinusoidal encoding: PE(pos, 2i) = sin(pos / 10000°(2i/d)) PE(pos, 2i+1) = cos(pos / 10000"(2i/d)) where pos is the position, i is the dimension, and d is the embedding size. 110, What is the concept of layer normalization, and how does it differ from batch normalization? Answer: Layer Normalization: Normalizes inputs across features within a single training example. Commonly used in NLP and transformers. Formula: y = & - mean) / sqrt(variance + &) Batch Normalization: Normalizes inputs across the batch for each feature. Common in CNNs. Differences: Batch normalization depends on batch size; layer normalization does not. Layer normalization is more effective in sequence-based models, IN, What are attention mechanisms, and why are they important? Answer: Attention mechanisms allow models to focus on specific parts of the input sequence when making predictions, assigning varying importance (weights) to different tokens or elements. Types of Attention: 1, Self-Attention; Helps capture relationships within a single sequence. 2. Cross-Attention: Used in sequence-to-sequence models to relate input and output sequences. Importance: Captures long-range dependencies. Enhances interpretability by showing what the model is focusing on. Forms the backbone of transformer architectures. 12, What are capsule networks, and how do they differ from traditional CNNs? Answer: Capsule networks are designed to model spatial hierarchies by encoding the pose and orientation of features in addition to their presence. Key Differences: Capsules: Groups of neurons represent the probability and parameters of detected objects. Dynamic Routing: Capsules communicate with higher-level capsules using iterative routing-by-agreement mechanisms, unlike fixed pooling in CNNs. Advantages: Better at understanding hierarchical relationships. More robust to changes in orientation and spatial distortions. 113, What is the difference between data augmentation and data synthesis? Answer: Data Augmentation: Modifies existing data to increase diversity (e.g., Hipping, cropping, adding noise). Commonly used for regularization. Data Synthesis: Generates entirely new data points from a model (e.g. GANs, VAEs). Use Cases: Augmentation improves robustness without altering the dataset size drastically. Synthesis is useful for creating data in underrepresented categories. 114, What are GANs, and how do they work? Answer: Generative Adversarial Networks (GANs) consist of two models: Generator: Creates fake data. Discriminator: Distinguishes between real and fake data. Training Process: The generator learns to create realistic data by fooling the discriminator. The discriminator improves by distinguishing fake from real data. Both models play a minimax game. Loss Function: min_G max_D Ellog(D(real))] + Ellog(l - DCfake))] Applications: Image generation, style transfer, and super- resolution, 1S, What is the role of the softmax function in neural networks? Answer: The softmax function converts raw scores Clogits) into probabilities, ensuring they sum to |. It is commonly used in the output layer of classification tasks. Formula: softmax(xi) = exp(xi) / sumCexpCxj) for j in range(n)) Advantages: Provides interpretable class probabilities. Highlights the most likely class while suppressing others. Helps during loss calculation using cross-entropy. 116, What are variational autoencoders (VAEs), and how are they different from standard autoencoders? Answer: Variational autoencoders are probabilistic models that learn a latent representation as a distribution (mean and variance) rather than a deterministic vector. Differences: Standard Autoencoders: Compress input into fixed latent vectors. VAEs: Use a probabilistic approach to generate diverse outputs. Loss Function: L = reconstruction_loss + KL_divergence(latent || prior) Applications: Image synthesis, anomaly detection, and latent space exploration. 117, What is transfer learning, and why is it effective? Answer: Transfer learning involves reusing a pre-trained model on a related task to improve performance and reduce training time. Effectiveness: Pre-trained models like ResNet and BERT already learn general features, reducing the need for large labeled datasets. Fine-tuning adapts these features to specific tasks. Examples: Using ImageNet-trained CNNs for medical imaging. Adapting BERT for sentiment analysis. 18, What is the difference between gradient clipping and gradient normalization? Answer: Gradient Clipping: Limits the magnitude of gradients to prevent exploding gradients. if ||gradient]] > threshold: gradient = gradient * (threshold / ||gradient||) Gradient Normalization: Adjusts gradients by dividing them by their norm, ensuring consistency across layers. Use Cases: Clipping is common in RNNs with vanishing/exploding gradients. Normalization is used for ensuring smooth optimization. 119, How do you calculate the receptive field of a convolutional layer? Answer: The receptive field is the area of input pixels that influence a particular output feature. Formula: For n layers: R_n = R_(r-l) + (K_n - I) “ stride_n where R_n is the receptive field, K is the kernel size, and stride is the stride of the current layer. Importance: Determines the spatial context captured by a convolutional layer. 120. What are the limitations of backpropagation? Answer: Vanishing/exploding gradients: Can hinder optimization in deep networks. High computation cost: Requires significant memory and computation for large networks. Dependence on labeled data: Backpropagation requires labeled datasets, which can be expensive to acquire. Non-convexity: Optimization often converges to local minima or saddle points. Solutions: Advanced optimizers (Adam, RMSProp), initialization techniques, and regularization strategies. Amar Sharma Al Engineer Follow me on LinkedIn for more informative content #

You might also like