Total 40 Questions and Answers Added Now!
Total 40 Questions and Answers Added Now!
Gen AI
1) What is attention dropout in Transformer models? Why is it important during training
and how does it affect generalization?
2) Compare LoRA (Low-Rank Adaptation) and Prefix Tuning. In what scenarios would one
outperform the other?
Answer: LoRA introduces low-rank learnable matrices into attention layers, enabling efficient
fine-tuning with fewer parameters. Prefix Tuning, on the other hand, prepends learnable vectors
(prefixes) to the input sequence, influencing the model's behavior via soft prompts.
● LoRA is ideal for applications requiring high precision and where access to model
internals (weights) is available.
● Prefix Tuning is better when fine-tuning access is limited and soft control via prompts is
sufficient.
LoRA generally outperforms Prefix Tuning in terms of convergence and quality, but Prefix
Tuning offers easier integration with API-only models.
3) What is the difference between cosine similarity and dot product in vector search?
Why does the choice matter in RAG systems?
Answer:
● Use cosine similarity when embedding vectors may have inconsistent magnitudes
(e.g., across domains).
● Use dot product when embeddings are uniformly scaled (like BERT or dense passage
retrieval embeddings) and speed is a priority.
Choice affects relevance ranking in retrieval and ultimately the quality of generation.
4). How do large models like GPT-4 handle context length beyond training limits using
attention mechanisms like ALiBi or Position Interpolation?
Answer: ALiBi (Attention with Linear Biases) introduces fixed linear positional biases in
attention scores, allowing extrapolation to longer sequences.
These techniques allow pretrained models with limited context windows (e.g., 2K tokens) to
generalize to longer contexts (e.g., 16K+ tokens) without full retraining, enabling long-form
reasoning and document summarization tasks.
5). Explain Mixture of Experts (MoE) in LLMs. What are the benefits and challenges of
using sparse activation in production models?
Answer: MoE involves splitting a model into multiple sub-models (experts), only activating a
subset during each forward pass. This allows scaling to billions of parameters with lower
computational cost.
Benefits:
Challenges:
6) How would you design a multi-tenant LLM API platform for enterprise clients to ensure
security, isolation, and performance?
7) Describe the trade-offs between GPU inference and using optimized CPU runtimes like
ONNX Runtime or Intel OpenVINO for LLM deployment.
Answer:
● GPU Inference:
○ Pros: High throughput, parallelization, ideal for large models.
○ Cons: Expensive, may require batching to amortize cost.
● CPU Inference (ONNX/OpenVINO):
○ Pros: Lower cost, simpler infra, good for edge/local deployments.
○ Cons: Lower throughput, unsuitable for models >1B parameters unless
quantized.
Choose GPU for real-time, high-load applications; CPU for cost-effective, on-device, or
small-scale tasks.
8) Explain how to implement a caching strategy for LLM outputs in a production pipeline
to reduce latency and cost.
Use Redis or in-memory caches. Helps with high-frequency Q&A and improves scalability.
9) How would you design a feedback loop system that lets human users improve the
output of a generative AI model over time?
10) What are the key considerations when using GenAI for multilingual support in a
customer service chatbot?
Answer:
11) How do you implement knowledge distillation to compress a large LLM without losing
much performance? Answer: Knowledge distillation involves training a smaller model
(student) to replicate the behavior of a larger pre-trained model (teacher). In LLMs, this means
feeding the same input to both models and minimizing the loss between their output
distributions (typically logits). Key components include:
12) Can you walk through the challenges and solutions in integrating an LLM into a
real-time production environment with low latency constraints? Answer: Key challenges
include:
13) How does the Mixture of Experts (MoE) model architecture optimize large-scale LLM
inference? Answer: MoE introduces sparsity into LLMs by activating only a subset of the total
model’s “experts” (sub-networks) for each input. Benefits include:
● Scalability: Parameters scale with model size, but computational cost remains low.
● Flexibility: Route tokens to different experts based on input semantics.
● Inference Efficiency: Only 2–4 experts are active per forward pass, allowing massive
models to be efficiently used. Key challenge: Load balancing and routing inefficiencies.
● Content quality: Improve the retrieval database with accurate, up-to-date documents.
● Retriever tuning: Use domain-specific embedding models.
● Prompt design: Explicitly condition the model to only answer based on retrieved docs.
● Verification: Include a post-response verification module or confidence score.
● User feedback loop: Collect and utilize feedback to refine retrieval and prompt policies.
15) How can you fine-tune a base model like LLaMA or Mistral on a specialized dataset
using LoRA, and what are the trade-offs? Answer: LoRA (Low-Rank Adaptation) injects
trainable low-rank matrices into specific layers of the transformer, enabling fine-tuning with fewer
parameters. Steps:
16) In what situations would you use adapters or prompt-tuning instead of full fine-tuning
of a large language model? Answer:
● Adapters: Ideal when updating large models for multiple tasks without altering the base
model. Example: multi-task settings across departments.
● Prompt-tuning: Best for quick experimentation or when compute is limited. Use these
approaches when:
● You need modularity.
● Limited computational resources.
● You want to maintain base model compatibility.
17) Explain token-level versus sentence-level perplexity and how these metrics impact
GenAI evaluation. Answer:
● Token-level perplexity: Measures how well a model predicts the next token. Lower is
better. Useful for language modeling.
● Sentence-level perplexity: Average perplexity across tokens in a sentence. More
interpretable for sentence coherence. Implication: Lower perplexity doesn’t always mean
better quality—complement with metrics like BLEU, ROUGE, and human evaluation.
19) How do you implement an RLHF (Reinforcement Learning with Human Feedback)
pipeline practically and at scale? Answer:
21) What is a Bag-of-Words (BoW) model and what are its limitations in NLP? Answer:
BoW represents text by the frequency of words, disregarding grammar and word order. It's
simple and fast but fails to capture context, semantics, or word similarity. This often leads to
sparse and high-dimensional vectors, making it ineffective for tasks needing word meaning or
order.
22) How does TF-IDF improve upon Bag-of-Words? Answer: TF-IDF (Term
Frequency-Inverse Document Frequency) weighs words based on how important they are to a
document relative to a corpus. It reduces the influence of common words like "the" and
highlights unique terms. However, it still ignores context and semantics, and suffers in synonym
handling.
23)What are word embeddings and how do they address BoW limitations? Answer: Word
embeddings like Word2Vec, GloVe, and FastText represent words in continuous vector space,
capturing semantic similarity. Words with similar meanings have similar vectors. This enables
context-aware similarity and downstream learning, but static embeddings still lack contextual
understanding (e.g., “bank” in river vs finance).
24) Explain the architecture of Word2Vec and its training objective. Answer: Word2Vec
has two main architectures: CBOW (predict center word from context) and Skip-gram (predict
context from center word). It uses a shallow neural network to learn embeddings by maximizing
the probability of observed word-context pairs. Negative sampling or hierarchical softmax aids
efficient training.
25) What are the limitations of Word2Vec and GloVe? Answer: Both generate a single
vector per word, regardless of context. Thus, polysemy (same word, different meanings) is not
captured. Also, they require pre-processing and large corpora. GloVe is
matrix-factorization-based and has challenges with rare word handling.
26) How does FastText improve over Word2Vec? Answer: FastText represents words as a
sum of character n-grams, enabling the model to handle rare words and capture subword
information (e.g., prefixes/suffixes). This improves handling of out-of-vocabulary (OOV) words
and better generalization for morphologically rich languages.
27) What is the architecture of an RNN, and how was it used in early NLP models?
Answer: Recurrent Neural Networks (RNNs) process sequences by maintaining a hidden state
updated with each input. They were used in sequence labeling, text generation, and translation.
However, they suffer from vanishing gradients, making long-term dependencies hard to learn.
28) How did LSTM and GRU address RNN limitations? Answer: LSTMs and GRUs
introduce gating mechanisms to control information flow, allowing them to remember
longer-term dependencies. LSTM has input, output, and forget gates; GRU simplifies this with
update and reset gates. They significantly improved performance in tasks like translation and
sentiment analysis.
29) What are the key drawbacks of LSTM-based language models that led to the
development of Transformers? Answer: LSTMs are sequential, limiting parallelism and
leading to longer training times. They also struggle with very long dependencies and scalability.
Transformers addressed this with self-attention mechanisms, allowing full context visibility and
parallel training.
30) Compare CNNs and RNNs for text classification before Transformers. Answer: CNNs
capture local n-gram features via convolutional filters and are more parallelizable than RNNs.
RNNs capture sequential dependencies but are slower. CNNs often outperform RNNs in
classification but underperform in generative or sequence prediction tasks. Neither captures
long-range dependencies as effectively as Transformers.
Answer:
● Masked Language Models (MLMs) like BERT are bidirectional and trained to predict
randomly masked tokens within a sentence. For example, in the sentence “The cat sat
on the [MASK],” the model learns to predict “mat” by attending to both left and right
contexts.
● Autoregressive models like GPT are unidirectional (usually left-to-right) and trained to
predict the next token in a sequence, e.g., “The cat sat on the” → “mat.”
● MLMs are better for understanding tasks (e.g., classification, QA), while AR models are
better for generation (e.g., story writing, dialogue).
Answer:
Multi-head attention allows the model to focus on different parts of a sequence simultaneously
from multiple representation subspaces. It consists of multiple attention heads, each with its
own learnable projection matrices for queries, keys, and values. This:
● Improves the model’s ability to capture different types of relationships (e.g., syntactic,
semantic).
Answer:
Layer normalization normalizes the inputs across the features (as opposed to across the
batch in batch normalization), stabilizing training by:
35) What are subword tokenization methods (e.g., Byte Pair Encoding), and
how do they impact model performance?
Answer:
Subword tokenization (e.g., Byte Pair Encoding, WordPiece, UnigramLM) splits rare or
unknown words into smaller, frequent sub-units. For example, “unhappiness” → “un”, “happi”,
“ness”.
Advantages:
Answer:
Static embeddings (e.g., Word2Vec, GloVe) assign a single vector to each word, regardless of
context. This means “bank” in “river bank” and “bank account” gets the same embedding.
Contextual embeddings, as generated by models like ELMo, BERT, or GPT, assign word
vectors that change based on surrounding words. For example:
● “She sat on the bank.” → different vector for “bank” than in
These embeddings capture semantic and syntactic nuances, improving performance in tasks
like NER, QA, and coreference resolution.
Answer:
Each transformer block has a position-wise feed-forward network (FFN) after the multi-head
attention. It consists of two linear transformations with a ReLU or GELU non-linearity in
between:
Its role:
● Helps the model learn compositional patterns not captured by attention alone.
● Self-attention: The queries, keys, and values all come from the same sequence. Used
in both encoder and decoder to capture intra-sequence dependencies.
Use cases:
Answer:
Causal masking (or autoregressive masking) ensures that a token at position i can only
attend to tokens at positions ≤ i. This is done by masking out future tokens in the attention
matrix using a triangular mask. It enforces left-to-right generation, critical for autoregressive
models like GPT. Without it, the model would "cheat" by looking at future tokens during training
or inference.
40) What are the challenges and methods in scaling LLMs beyond 100B
parameters?
Answer:
Challenges:
Methods: