0% found this document useful (0 votes)

10 views14 pages

Total 40 Questions and Answers Added Now!

The document outlines various advanced concepts and techniques in generative AI and natural language processing, including attention dropout, LoRA, and Mixture of Experts. It discusses the differences between various model tuning methods, the importance of context in transformer models, and strategies for deploying large language models in production. Additionally, it addresses challenges in model integration, evaluation metrics, and techniques for improving model performance and security.

Uploaded by

mukul172914

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views14 pages

Total 40 Questions and Answers Added Now!

Uploaded by

mukul172914

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Interview Questions

Gen AI
1) What is attention dropout in Transformer models? Why is it important during training
and how does it affect generalization?

Answer: Attention dropout is a regularization technique applied to the attention weights in

Transformer models. During training, a portion of the attention values is randomly set to zero.
This forces the model to not overly rely on specific tokens and promotes robustness. It helps
prevent overfitting, especially in large models, by encouraging the model to learn distributed
representations. Without attention dropout, models may memorize patterns leading to poor
generalization on unseen data.

2) Compare LoRA (Low-Rank Adaptation) and Prefix Tuning. In what scenarios would one
outperform the other?

Answer: LoRA introduces low-rank learnable matrices into attention layers, enabling efficient
fine-tuning with fewer parameters. Prefix Tuning, on the other hand, prepends learnable vectors
(prefixes) to the input sequence, influencing the model's behavior via soft prompts.

● LoRA is ideal for applications requiring high precision and where access to model
internals (weights) is available.
● Prefix Tuning is better when fine-tuning access is limited and soft control via prompts is
sufficient.

LoRA generally outperforms Prefix Tuning in terms of convergence and quality, but Prefix
Tuning offers easier integration with API-only models.

3) What is the difference between cosine similarity and dot product in vector search?
Why does the choice matter in RAG systems?

Answer:

● Dot product measures raw magnitude and directional alignment.

● Cosine similarity normalizes vectors to remove the impact of magnitude, focusing
solely on direction.

In RAG (Retrieval-Augmented Generation) systems:

● Use cosine similarity when embedding vectors may have inconsistent magnitudes
(e.g., across domains).
● Use dot product when embeddings are uniformly scaled (like BERT or dense passage
retrieval embeddings) and speed is a priority.

Choice affects relevance ranking in retrieval and ultimately the quality of generation.
4). How do large models like GPT-4 handle context length beyond training limits using
attention mechanisms like ALiBi or Position Interpolation?

Answer: ALiBi (Attention with Linear Biases) introduces fixed linear positional biases in
attention scores, allowing extrapolation to longer sequences.

Position Interpolation scales learned positional embeddings non-linearly to extend context

during inference.

These techniques allow pretrained models with limited context windows (e.g., 2K tokens) to
generalize to longer contexts (e.g., 16K+ tokens) without full retraining, enabling long-form
reasoning and document summarization tasks.

5). Explain Mixture of Experts (MoE) in LLMs. What are the benefits and challenges of
using sparse activation in production models?

Answer: MoE involves splitting a model into multiple sub-models (experts), only activating a
subset during each forward pass. This allows scaling to billions of parameters with lower
computational cost.

Benefits:

● Lower inference cost per token

● Larger capacity models with fewer FLOPs

Challenges:

● Routing inefficiencies and imbalance

● Load balancing across GPUs
● Increased engineering complexity

Used in Switch Transformer, GShard, and DeepMind's GLaM.

6) How would you design a multi-tenant LLM API platform for enterprise clients to ensure
security, isolation, and performance?

Answer: Key design components:

● Authentication & Authorization: Role-based access using OAuth or JWT

● Request Isolation: Use namespaces per tenant; containerize model instances or
sessions
● Rate Limiting & Quotas: Per-tenant limits to avoid noisy neighbors
● Monitoring: Per-tenant logging and observability (e.g., Prometheus, Grafana)
● Billing Hooks: Usage-based cost tracking
● Model Customization: Enable tenant-specific prompts or adapters

7) Describe the trade-offs between GPU inference and using optimized CPU runtimes like
ONNX Runtime or Intel OpenVINO for LLM deployment.

Answer:

● GPU Inference:
○ Pros: High throughput, parallelization, ideal for large models.
○ Cons: Expensive, may require batching to amortize cost.
● CPU Inference (ONNX/OpenVINO):
○ Pros: Lower cost, simpler infra, good for edge/local deployments.
○ Cons: Lower throughput, unsuitable for models >1B parameters unless
quantized.

Choose GPU for real-time, high-load applications; CPU for cost-effective, on-device, or
small-scale tasks.

8) Explain how to implement a caching strategy for LLM outputs in a production pipeline
to reduce latency and cost.

Answer: Use a combination of:

● Prompt Hashing: Hash prompts to use as keys

● LRU Cache: Store recent outputs for repeated queries
● Chunk-level Caching: Cache segments of responses for retrieval-based augmentation
● Semantic Similarity Caching: Use embeddings to reuse similar responses

Use Redis or in-memory caches. Helps with high-frequency Q&A and improves scalability.

9) How would you design a feedback loop system that lets human users improve the
output of a generative AI model over time?

Answer: Key components:

● UI for Feedback Collection: Thumbs up/down, comments

● Storage & Labeling: Store inputs, outputs, feedback tags
● Ranking Model (e.g., Reward Model): Fine-tune or reinforce model based on feedback
● Retraining Pipeline: Periodic updates based on labeled data
● A/B Testing: Validate changes before full deployment

10) What are the key considerations when using GenAI for multilingual support in a
customer service chatbot?

Answer:

● Language Coverage: Ensure training data includes target languages

● Code-Switching Support: Handle mixed-language inputs
● Translation Quality: Balance between direct multilingual model use vs. translating to
English before processing
● Locale-specific Customization: Cultural tone, idioms, formal/informal variants
● Latency Considerations: Real-time inference across multiple languages
● Fallback Mechanisms: When LLMs fail, integrate human-in-the-loop or canned
responses

11) How do you implement knowledge distillation to compress a large LLM without losing
much performance? Answer: Knowledge distillation involves training a smaller model
(student) to replicate the behavior of a larger pre-trained model (teacher). In LLMs, this means
feeding the same input to both models and minimizing the loss between their output
distributions (typically logits). Key components include:

● Teacher model: A large LLM like GPT-3.

● Student model: A smaller transformer model.
● Loss function: Usually a combination of cross-entropy with the teacher’s soft targets
and hard labels.
● Training: Use temperature scaling and potentially intermediate layer supervision.
Benefits include reduced inference time and memory usage. Challenges include
ensuring that the student maintains reasoning and factual capabilities.

12) Can you walk through the challenges and solutions in integrating an LLM into a
real-time production environment with low latency constraints? Answer: Key challenges
include:

● Inference speed: Transformer-based models are computationally intensive.

● Cold starts: Dynamic scaling services may introduce latency.
● Model size: Large models strain memory and CPU/GPU budgets. Solutions:
● Use quantized or distilled models (e.g., int8, 4-bit quantization).
● Serve models using optimized runtimes (ONNX, TensorRT, vLLM).
● Deploy on GPU-enabled inference servers with autoscaling.
● Use caching strategies, approximate nearest neighbor (ANN) search, and prompt
optimization to reduce calls.

13) How does the Mixture of Experts (MoE) model architecture optimize large-scale LLM
inference? Answer: MoE introduces sparsity into LLMs by activating only a subset of the total
model’s “experts” (sub-networks) for each input. Benefits include:

● Scalability: Parameters scale with model size, but computational cost remains low.
● Flexibility: Route tokens to different experts based on input semantics.
● Inference Efficiency: Only 2–4 experts are active per forward pass, allowing massive
models to be efficiently used. Key challenge: Load balancing and routing inefficiencies.

14) Describe your strategy for debugging and improving hallucinations in a

production-grade RAG pipeline. Answer:

● Content quality: Improve the retrieval database with accurate, up-to-date documents.
● Retriever tuning: Use domain-specific embedding models.
● Prompt design: Explicitly condition the model to only answer based on retrieved docs.
● Verification: Include a post-response verification module or confidence score.
● User feedback loop: Collect and utilize feedback to refine retrieval and prompt policies.

15) How can you fine-tune a base model like LLaMA or Mistral on a specialized dataset
using LoRA, and what are the trade-offs? Answer: LoRA (Low-Rank Adaptation) injects
trainable low-rank matrices into specific layers of the transformer, enabling fine-tuning with fewer
parameters. Steps:

● Load pre-trained LLaMA/Mistral.

● Insert LoRA modules into attention layers.
● Freeze base weights, train only LoRA layers.
● Merge LoRA into base weights post-training (optional). Trade-offs:
● Pros: Lower compute and memory cost, fast iteration.
● Cons: Might underfit highly domain-specific tasks compared to full fine-tuning.

16) In what situations would you use adapters or prompt-tuning instead of full fine-tuning
of a large language model? Answer:
● Adapters: Ideal when updating large models for multiple tasks without altering the base
model. Example: multi-task settings across departments.
● Prompt-tuning: Best for quick experimentation or when compute is limited. Use these
approaches when:
● You need modularity.
● Limited computational resources.
● You want to maintain base model compatibility.

17) Explain token-level versus sentence-level perplexity and how these metrics impact
GenAI evaluation. Answer:

● Token-level perplexity: Measures how well a model predicts the next token. Lower is
better. Useful for language modeling.
● Sentence-level perplexity: Average perplexity across tokens in a sentence. More
interpretable for sentence coherence. Implication: Lower perplexity doesn’t always mean
better quality—complement with metrics like BLEU, ROUGE, and human evaluation.

18) Describe a technique to make embeddings generated by transformer models more

efficient and domain-specific. Answer:

● Technique: Domain-adaptive pretraining (DAPT) followed by contrastive learning.

● Use a domain corpus to continue pretraining.
● Apply contrastive loss to bring semantically similar samples closer in the embedding
space.
● Use quantization or product quantization for efficient storage and ANN search.

19) How do you implement an RLHF (Reinforcement Learning with Human Feedback)
pipeline practically and at scale? Answer:

● Step 1: Supervised fine-tune a base model using human-preferred responses.

● Step 2: Train a reward model from human ranking data.
● Step 3: Use PPO (Proximal Policy Optimization) to further fine-tune the LLM based on
reward signals. Scaling tips:
● Use batch human feedback via labeling platforms.
● Optimize training loop using distributed training and checkpointing.
● Use scalable libraries like TRL (from Hugging Face).
20) How do you prevent prompt injection attacks in a deployed LLM-based application?
Answer:

● Sanitize inputs: Remove suspicious patterns or meta-instructions.

● Use prompt engineering: Isolate system and user prompts with strict context windows.
● Hard code boundaries: Limit model access to external tools.
● Use retrieval filters: Prevent injection through vector store poisoning.
● Post-processing: Validate outputs using regex, LLM self-check, or human approval in
critical apps.

21) What is a Bag-of-Words (BoW) model and what are its limitations in NLP? Answer:
BoW represents text by the frequency of words, disregarding grammar and word order. It's
simple and fast but fails to capture context, semantics, or word similarity. This often leads to
sparse and high-dimensional vectors, making it ineffective for tasks needing word meaning or
order.

22) How does TF-IDF improve upon Bag-of-Words? Answer: TF-IDF (Term
Frequency-Inverse Document Frequency) weighs words based on how important they are to a
document relative to a corpus. It reduces the influence of common words like "the" and
highlights unique terms. However, it still ignores context and semantics, and suffers in synonym
handling.

23)What are word embeddings and how do they address BoW limitations? Answer: Word
embeddings like Word2Vec, GloVe, and FastText represent words in continuous vector space,
capturing semantic similarity. Words with similar meanings have similar vectors. This enables
context-aware similarity and downstream learning, but static embeddings still lack contextual
understanding (e.g., “bank” in river vs finance).

24) Explain the architecture of Word2Vec and its training objective. Answer: Word2Vec
has two main architectures: CBOW (predict center word from context) and Skip-gram (predict
context from center word). It uses a shallow neural network to learn embeddings by maximizing
the probability of observed word-context pairs. Negative sampling or hierarchical softmax aids
efficient training.

25) What are the limitations of Word2Vec and GloVe? Answer: Both generate a single
vector per word, regardless of context. Thus, polysemy (same word, different meanings) is not
captured. Also, they require pre-processing and large corpora. GloVe is
matrix-factorization-based and has challenges with rare word handling.

26) How does FastText improve over Word2Vec? Answer: FastText represents words as a
sum of character n-grams, enabling the model to handle rare words and capture subword
information (e.g., prefixes/suffixes). This improves handling of out-of-vocabulary (OOV) words
and better generalization for morphologically rich languages.

27) What is the architecture of an RNN, and how was it used in early NLP models?
Answer: Recurrent Neural Networks (RNNs) process sequences by maintaining a hidden state
updated with each input. They were used in sequence labeling, text generation, and translation.
However, they suffer from vanishing gradients, making long-term dependencies hard to learn.

28) How did LSTM and GRU address RNN limitations? Answer: LSTMs and GRUs
introduce gating mechanisms to control information flow, allowing them to remember
longer-term dependencies. LSTM has input, output, and forget gates; GRU simplifies this with
update and reset gates. They significantly improved performance in tasks like translation and
sentiment analysis.

29) What are the key drawbacks of LSTM-based language models that led to the
development of Transformers? Answer: LSTMs are sequential, limiting parallelism and
leading to longer training times. They also struggle with very long dependencies and scalability.
Transformers addressed this with self-attention mechanisms, allowing full context visibility and
parallel training.

30) Compare CNNs and RNNs for text classification before Transformers. Answer: CNNs
capture local n-gram features via convolutional filters and are more parallelizable than RNNs.
RNNs capture sequential dependencies but are slower. CNNs often outperform RNNs in
classification but underperform in generative or sequence prediction tasks. Neither captures
long-range dependencies as effectively as Transformers.

31) What is the role of positional encoding in transformer-based models,

and why is it necessary?
Answer:
Transformers do not have an inherent sense of word order since they process tokens in parallel
(unlike RNNs). To incorporate the sequence order, positional encodings are added to the input
embeddings. These encodings are either learned (as in BERT) or sinusoidal (as in the original
Transformer paper). Sinusoidal encodings use sine and cosine functions of different frequencies
to capture relative and absolute positions. Without positional encoding, the model would treat
the input tokens as a bag of words, losing syntactic and sequential relationships crucial for
understanding context.

32) How do masked language models (e.g., BERT) differ from

autoregressive models (e.g., GPT) in terms of training objectives?

Answer:

● Masked Language Models (MLMs) like BERT are bidirectional and trained to predict
randomly masked tokens within a sentence. For example, in the sentence “The cat sat
on the [MASK],” the model learns to predict “mat” by attending to both left and right
contexts.

● Autoregressive models like GPT are unidirectional (usually left-to-right) and trained to
predict the next token in a sequence, e.g., “The cat sat on the” → “mat.”

● MLMs are better for understanding tasks (e.g., classification, QA), while AR models are
better for generation (e.g., story writing, dialogue).

33) Explain the concept of multi-head attention. Why is it superior to

single-head attention?

Answer:
Multi-head attention allows the model to focus on different parts of a sequence simultaneously
from multiple representation subspaces. It consists of multiple attention heads, each with its
own learnable projection matrices for queries, keys, and values. This:

● Improves the model’s ability to capture different types of relationships (e.g., syntactic,
semantic).

● Increases representational capacity without increasing computation dramatically.

● Helps with gradient flow and training stability.

Single-head attention would be limited to a single focus and may not capture complex
dependencies.

34) Describe the significance of layer normalization in stabilizing

transformer training.

Answer:
Layer normalization normalizes the inputs across the features (as opposed to across the
batch in batch normalization), stabilizing training by:

● Reducing internal covariate shift.

● Ensuring consistent distribution of activations.

● Improving gradient flow, especially in deep architectures like transformers.

Transformers typically apply layer norm before or after each sublayer (e.g., attention or
feed-forward), enabling efficient convergence during training.

35) What are subword tokenization methods (e.g., Byte Pair Encoding), and
how do they impact model performance?

Answer:
Subword tokenization (e.g., Byte Pair Encoding, WordPiece, UnigramLM) splits rare or
unknown words into smaller, frequent sub-units. For example, “unhappiness” → “un”, “happi”,
“ness”.
Advantages:

● Reduces out-of-vocabulary (OOV) issues.

● Balances between word-level and character-level models.

● Leads to a smaller and more efficient vocabulary.

● Enhances handling of morphologically rich languages.

Subword tokenization allows for robust generalization, particularly for unseen or
compound words during inference.
36) Explain the idea of contextual embeddings. How are they different from
static embeddings like Word2Vec?

Answer:
Static embeddings (e.g., Word2Vec, GloVe) assign a single vector to each word, regardless of
context. This means “bank” in “river bank” and “bank account” gets the same embedding.

Contextual embeddings, as generated by models like ELMo, BERT, or GPT, assign word
vectors that change based on surrounding words. For example:

● “She sat on the bank.” → different vector for “bank” than in

● “He deposited money in the bank.”

These embeddings capture semantic and syntactic nuances, improving performance in tasks
like NER, QA, and coreference resolution.

37) What is the role of the feed-forward neural network layer in a

transformer block?

Answer:
Each transformer block has a position-wise feed-forward network (FFN) after the multi-head
attention. It consists of two linear transformations with a ReLU or GELU non-linearity in
between:

FFN(x) = max(0, xW1 + b1)W2 + b2

Its role:

● Adds non-linearity and transformation capacity.

● Operates independently on each position, enriching token representations.

● Helps the model learn compositional patterns not captured by attention alone.

38) Compare cross-attention and self-attention mechanisms. Where is each

used?
Answer:

● Self-attention: The queries, keys, and values all come from the same sequence. Used
in both encoder and decoder to capture intra-sequence dependencies.

● Cross-attention: In decoders (e.g., in encoder-decoder models like T5 or BART),

queries come from the decoder’s output so far, and keys/values come from the encoder.
This allows the decoder to attend to the encoder’s output, useful for translation or
summarization.

Use cases:

● Self-attention: Language modeling, classification.

● Cross-attention: Sequence transduction tasks like translation, summarization.

39) How does causal masking work in transformer decoders?

Answer:
Causal masking (or autoregressive masking) ensures that a token at position i can only
attend to tokens at positions ≤ i. This is done by masking out future tokens in the attention
matrix using a triangular mask. It enforces left-to-right generation, critical for autoregressive
models like GPT. Without it, the model would "cheat" by looking at future tokens during training
or inference.

40) What are the challenges and methods in scaling LLMs beyond 100B
parameters?

Answer:
Challenges:

● Memory constraints on GPUs or TPUs.

● Slower training and inference.

● Communication bottlenecks in distributed setups.

● Environmental cost (energy and CO2 footprint).

● Alignment and controllability issues.

Methods:

● Model parallelism: Splitting weights across devices.

● Pipeline parallelism: Splitting layers sequentially across devices.

● Mixture of Experts (MoE): Activating sparse subsets of weights per input.

● Gradient checkpointing: Reduces memory usage by re-computing intermediate

activations.

● Quantization & Pruning: Reduces memory and compute usage.

● Efficient pretraining datasets (e.g., curated, deduplicated).

● Specialized hardware like H100s or custom LLM accelerators.

1Z0 1127 25 Hrd57y
No ratings yet
1Z0 1127 25 Hrd57y
49 pages
NCA-GENL Exam Valid Dumps
No ratings yet
NCA-GENL Exam Valid Dumps
5 pages
Fulll Stack LLMs Stanford University
No ratings yet
Fulll Stack LLMs Stanford University
39 pages
Generative AI Complete Questions
No ratings yet
Generative AI Complete Questions
3 pages
Exam Practice Questions
No ratings yet
Exam Practice Questions
17 pages
Top 50 GenAI Interview Questions
No ratings yet
Top 50 GenAI Interview Questions
3 pages
AI Engineer Roadmap
No ratings yet
AI Engineer Roadmap
22 pages
120 Deep Learning Important Questions + Answers ?
No ratings yet
120 Deep Learning Important Questions + Answers ?
68 pages
Pub - Teachers and Teaching From Classroom To Reflection PDF
100% (2)
Pub - Teachers and Teaching From Classroom To Reflection PDF
223 pages
Chatgpt: A Technical Perspective: Presented by Teamx
No ratings yet
Chatgpt: A Technical Perspective: Presented by Teamx
18 pages
What Is An LLM (Large Language Model) ?: Answer
No ratings yet
What Is An LLM (Large Language Model) ?: Answer
9 pages
AI Foundations
No ratings yet
AI Foundations
9 pages
Short Stories in French For Intermediate Learners (Olly Richards) (Z-Library)
88% (8)
Short Stories in French For Intermediate Learners (Olly Richards) (Z-Library)
216 pages
ORCL - Become An OCI AI Foundations Associate (2023) Exam
100% (1)
ORCL - Become An OCI AI Foundations Associate (2023) Exam
6 pages
Oracle: Question & Answers
No ratings yet
Oracle: Question & Answers
11 pages
Oracle: Question & Answers
No ratings yet
Oracle: Question & Answers
11 pages
Expanded Representation Learning Booklet
No ratings yet
Expanded Representation Learning Booklet
9 pages
Untitled 2
No ratings yet
Untitled 2
40 pages
Important Topics On AI Role
No ratings yet
Important Topics On AI Role
4 pages
Pec Gen Ai Notes
No ratings yet
Pec Gen Ai Notes
11 pages
Ai 2024
No ratings yet
Ai 2024
7 pages
Ai Exam 1
100% (1)
Ai Exam 1
10 pages
Deep Learning Viva Questions
No ratings yet
Deep Learning Viva Questions
4 pages
NLP and Generative AI Syllabus - 2025
No ratings yet
NLP and Generative AI Syllabus - 2025
5 pages
LLM2
No ratings yet
LLM2
3 pages
Assignment 12 Pruthvi NSRSR
No ratings yet
Assignment 12 Pruthvi NSRSR
3 pages
Interview Mech
No ratings yet
Interview Mech
4 pages
The PowerScore LSAT Logical Reasoning Bible Workbook Ebook and TestBank Bundle Full Download
100% (2)
The PowerScore LSAT Logical Reasoning Bible Workbook Ebook and TestBank Bundle Full Download
399 pages
Research Paper
No ratings yet
Research Paper
28 pages
Top 20 Generative AI Interview QA
No ratings yet
Top 20 Generative AI Interview QA
3 pages
Ch4 and Ch5 Notes
No ratings yet
Ch4 and Ch5 Notes
38 pages
Function Calling at Edge
No ratings yet
Function Calling at Edge
9 pages
A Study On The Implementation of Generative AI Ser
No ratings yet
A Study On The Implementation of Generative AI Ser
26 pages
Assignment 13 Jigyanshu Pati
No ratings yet
Assignment 13 Jigyanshu Pati
1 page
10 Most Asked LLM Interview Questions
No ratings yet
10 Most Asked LLM Interview Questions
12 pages
Shivi
No ratings yet
Shivi
5 pages
DL Viva
No ratings yet
DL Viva
7 pages
HAAI Assignment 13 Solutions
No ratings yet
HAAI Assignment 13 Solutions
2 pages
Deped Mission and Vision
No ratings yet
Deped Mission and Vision
5 pages
MCA Degree Questions
No ratings yet
MCA Degree Questions
1 page
LLM's For Code Generation
No ratings yet
LLM's For Code Generation
31 pages
Gen Ai Interview
No ratings yet
Gen Ai Interview
41 pages
Interview Questions Answers
No ratings yet
Interview Questions Answers
7 pages
Field Study 2 Learning Episode 8
No ratings yet
Field Study 2 Learning Episode 8
7 pages
Large Language Models
No ratings yet
Large Language Models
1 page
Huawei HCIA Artificial Intelligence Cetification Q and A WRITTEN
0% (1)
Huawei HCIA Artificial Intelligence Cetification Q and A WRITTEN
8 pages
Genai Premlim Eval
No ratings yet
Genai Premlim Eval
6 pages
Week 11 Chats
No ratings yet
Week 11 Chats
5 pages
Chupke Se Ratt Le
No ratings yet
Chupke Se Ratt Le
9 pages
PPST - RP - Module 17
No ratings yet
PPST - RP - Module 17
46 pages
Assignment Mid
No ratings yet
Assignment Mid
13 pages
H13-323 - V1.0-ENU Exam Dumps Questions
No ratings yet
H13-323 - V1.0-ENU Exam Dumps Questions
9 pages
Gena I Questions
No ratings yet
Gena I Questions
6 pages
Pe 1
No ratings yet
Pe 1
5 pages
INT426 MCQ's Unit - 4,5,6 GeeksforCampus
No ratings yet
INT426 MCQ's Unit - 4,5,6 GeeksforCampus
17 pages
Rag
No ratings yet
Rag
10 pages
Detailed Generative AI Interview Questions 2025
No ratings yet
Detailed Generative AI Interview Questions 2025
5 pages
Python - Genai - Intqa 2
No ratings yet
Python - Genai - Intqa 2
5 pages
Genai 2 Marks
No ratings yet
Genai 2 Marks
4 pages
All About Me
No ratings yet
All About Me
3 pages
2 - LLMs - ...
No ratings yet
2 - LLMs - ...
2 pages
Lesson Plan Volleyball Bumping
100% (1)
Lesson Plan Volleyball Bumping
3 pages
Ang Pagtuturo NG Filipino Sa Kurikulum NG Batayang Edukasyon
No ratings yet
Ang Pagtuturo NG Filipino Sa Kurikulum NG Batayang Edukasyon
13 pages
Geometry Lesson
No ratings yet
Geometry Lesson
4 pages
Ece-485 Classroom Management Plan 2nd Grade
No ratings yet
Ece-485 Classroom Management Plan 2nd Grade
11 pages
TQ-536 Soft-Copy
No ratings yet
TQ-536 Soft-Copy
17 pages
Wormeli Metaphors and Analogies
No ratings yet
Wormeli Metaphors and Analogies
34 pages
Motto
No ratings yet
Motto
2 pages
4th Grade Force Energy and Motion - Water Balloon Experiment
No ratings yet
4th Grade Force Energy and Motion - Water Balloon Experiment
3 pages
Lesson Plan Academic Focus
No ratings yet
Lesson Plan Academic Focus
4 pages
Form SW01 - IUM Template For Scheme of Work Rose
No ratings yet
Form SW01 - IUM Template For Scheme of Work Rose
16 pages
Sed 24 II Eng+Hin DDDD
No ratings yet
Sed 24 II Eng+Hin DDDD
80 pages
Riley Swanberg Resume
No ratings yet
Riley Swanberg Resume
3 pages
Eborn - Mus149 - Assignment 4
No ratings yet
Eborn - Mus149 - Assignment 4
3 pages
Ecoliteracy Demo
No ratings yet
Ecoliteracy Demo
30 pages
BERT Interview Questions and Cross Questions-1
No ratings yet
BERT Interview Questions and Cross Questions-1
9 pages
HBCBZI1V
No ratings yet
HBCBZI1V
3 pages
Students Perception of MBKM Policy - at University Level
No ratings yet
Students Perception of MBKM Policy - at University Level
41 pages
Literature Teaching Approaches (1) - Compressed PDF
No ratings yet
Literature Teaching Approaches (1) - Compressed PDF
4 pages
NIZZHK Itinerary
No ratings yet
NIZZHK Itinerary
7 pages
Page 1 / 2
0% (1)
Page 1 / 2
2 pages
Grade 9 Mathematics
No ratings yet
Grade 9 Mathematics
8 pages
Introduction To CBC Schemes of Work
No ratings yet
Introduction To CBC Schemes of Work
6 pages
GRPS Handbook - Updated March 24
No ratings yet
GRPS Handbook - Updated March 24
15 pages
North Bihar Power Distribution Company LTD.: Ukfkz FCGKJ I Oj FMLV HC W'Ku Deiuh Fyfevsm
No ratings yet
North Bihar Power Distribution Company LTD.: Ukfkz FCGKJ I Oj FMLV HC W'Ku Deiuh Fyfevsm
1 page
Ect 310 English Methods
No ratings yet
Ect 310 English Methods
2 pages
Article Critique
100% (1)
Article Critique
6 pages
Abigail Phillips CV
No ratings yet
Abigail Phillips CV
16 pages
Combining DI and UbD
No ratings yet
Combining DI and UbD
2 pages

Total 40 Questions and Answers Added Now!

Uploaded by

Total 40 Questions and Answers Added Now!

Uploaded by

Interview Questions

Answer: Attention dropout is a regularization technique applied to the attention weights in

●​ Dot product measures raw magnitude and directional alignment.

In RAG (Retrieval-Augmented Generation) systems:

Position Interpolation scales learned positional embeddings non-linearly to extend context

●​ Lower inference cost per token

●​ Routing inefficiencies and imbalance

Used in Switch Transformer, GShard, and DeepMind's GLaM.

Answer: Key design components:

●​ Authentication & Authorization: Role-based access using OAuth or JWT

Answer: Use a combination of:

●​ Prompt Hashing: Hash prompts to use as keys

Answer: Key components:

●​ UI for Feedback Collection: Thumbs up/down, comments

●​ Language Coverage: Ensure training data includes target languages

●​ Teacher model: A large LLM like GPT-3.

●​ Inference speed: Transformer-based models are computationally intensive.

14) Describe your strategy for debugging and improving hallucinations in a

●​ Load pre-trained LLaMA/Mistral.

18) Describe a technique to make embeddings generated by transformer models more

●​ Technique: Domain-adaptive pretraining (DAPT) followed by contrastive learning.

●​ Step 1: Supervised fine-tune a base model using human-preferred responses.

●​ Sanitize inputs: Remove suspicious patterns or meta-instructions.

31) What is the role of positional encoding in transformer-based models,

32) How do masked language models (e.g., BERT) differ from

33) Explain the concept of multi-head attention. Why is it superior to

●​ Increases representational capacity without increasing computation dramatically.​

●​ Helps with gradient flow and training stability.​

34) Describe the significance of layer normalization in stabilizing

●​ Reducing internal covariate shift.​

●​ Ensuring consistent distribution of activations.​

●​ Improving gradient flow, especially in deep architectures like transformers.​

●​ Reduces out-of-vocabulary (OOV) issues.​

●​ Balances between word-level and character-level models.​

●​ Leads to a smaller and more efficient vocabulary.​

●​ Enhances handling of morphologically rich languages.​

●​ “He deposited money in the bank.”​

37) What is the role of the feed-forward neural network layer in a

FFN(x) = max(0, xW1 + b1)W2 + b2

●​ Adds non-linearity and transformation capacity.​

●​ Operates independently on each position, enriching token representations.​

38) Compare cross-attention and self-attention mechanisms. Where is each

●​ Cross-attention: In decoders (e.g., in encoder-decoder models like T5 or BART),

●​ Self-attention: Language modeling, classification.​

●​ Cross-attention: Sequence transduction tasks like translation, summarization.​

39) How does causal masking work in transformer decoders?

●​ Memory constraints on GPUs or TPUs.​

●​ Slower training and inference.​

●​ Communication bottlenecks in distributed setups.​

●​ Environmental cost (energy and CO2 footprint).​

●​ Model parallelism: Splitting weights across devices.​

●​ Pipeline parallelism: Splitting layers sequentially across devices.​

●​ Mixture of Experts (MoE): Activating sparse subsets of weights per input.​

●​ Gradient checkpointing: Reduces memory usage by re-computing intermediate

●​ Quantization & Pruning: Reduces memory and compute usage.​

●​ Efficient pretraining datasets (e.g., curated, deduplicated).​

●​ Specialized hardware like H100s or custom LLM accelerators.​

You might also like

● Dot product measures raw magnitude and directional alignment.

● Lower inference cost per token

● Routing inefficiencies and imbalance

● Authentication & Authorization: Role-based access using OAuth or JWT

● Prompt Hashing: Hash prompts to use as keys

● UI for Feedback Collection: Thumbs up/down, comments

● Language Coverage: Ensure training data includes target languages

● Teacher model: A large LLM like GPT-3.

● Inference speed: Transformer-based models are computationally intensive.

● Load pre-trained LLaMA/Mistral.

● Technique: Domain-adaptive pretraining (DAPT) followed by contrastive learning.

● Step 1: Supervised fine-tune a base model using human-preferred responses.

● Sanitize inputs: Remove suspicious patterns or meta-instructions.

● Increases representational capacity without increasing computation dramatically.

● Helps with gradient flow and training stability.

● Reducing internal covariate shift.

● Ensuring consistent distribution of activations.

● Improving gradient flow, especially in deep architectures like transformers.

● Reduces out-of-vocabulary (OOV) issues.

● Balances between word-level and character-level models.

● Leads to a smaller and more efficient vocabulary.

● Enhances handling of morphologically rich languages.

● “He deposited money in the bank.”

● Adds non-linearity and transformation capacity.

● Operates independently on each position, enriching token representations.

● Cross-attention: In decoders (e.g., in encoder-decoder models like T5 or BART),

● Self-attention: Language modeling, classification.

● Cross-attention: Sequence transduction tasks like translation, summarization.

● Memory constraints on GPUs or TPUs.

● Slower training and inference.

● Communication bottlenecks in distributed setups.

● Environmental cost (energy and CO2 footprint).

● Model parallelism: Splitting weights across devices.

● Pipeline parallelism: Splitting layers sequentially across devices.

● Mixture of Experts (MoE): Activating sparse subsets of weights per input.

● Gradient checkpointing: Reduces memory usage by re-computing intermediate

● Quantization & Pruning: Reduces memory and compute usage.

● Efficient pretraining datasets (e.g., curated, deduplicated).

● Specialized hardware like H100s or custom LLM accelerators.