NLP Sheets
NLP Sheets
Other applications:
• E-commerce platforms:
o Product description extraction, review analysis (e.g., Amazon).
• Healthcare, finance, and law
• Companies such as Arria:
o Automating report generation, legal document analysis, financial forecasting.
• Spelling and grammar correction:
o Grammarly, Microsoft Word, Google Docs.
• IBM Watson: An AI built using NLP techniques that competed on "Jeopardy!" quiz show and won $1 million,
outperforming human champions.
• Educational tools:
o Automated scoring (e.g., GRE)
o plagiarism detection (e.g., Turnitin)
o language learning apps (e.g., Duolingo)
• Knowledge bases:
o Google Knowledge Graph for search and question answering.
1
2. Explain the following NLP tasks
• Language modeling:
o Predicting the next word in a sequence based on previous words.
o Used in speech recognition, machine translation, and spelling correction.
• Text classification:
o Categorizing text into predefined classes (e.g., spam detection, sentiment analysis).
• Information extraction:
o Extracting structured information from unstructured text (e.g., extracting names,or events from emails).
• Information retrieval:
o Finding relevant documents from a large collection (e.g., search engines).
• Conversational agent:
o Building systems that can converse with humans (e.g., chatbots, voice assistants).
• Text summarization:
o Creating concise summaries of long documents while retaining key information.
• Question answering:
o Building systems that can answer questions posed in natural language.
• Machine translation:
o Translating text from one language to another (e.g., Google Translate).
• Topic modeling:
o Identifying latent topics in a large collection of documents (e.g., identifying themes ).
2
3. What are the building blocks of language and their applications?
• Phonemes:
o Smallest units of sound in a language.
o Used in speech recognition and text-to-speech systems.
• Morphemes and lexemes:
o Smallest units of meaning.
o Used in tokenization, stemming, and part-of-speech tagging.
• Syntax:
o Rules for constructing grammatically correct sentences.
o Used in parsing and sentence structure analysis.
• Context:
o Meaning derived from semantics and pragmatics.
o Used in tasks like sarcasm detection, summarization, and topic modeling.
• Ambiguity: Words and sentences can have multiple meanings depending on context.
• Common knowledge: Humans rely on implicit knowledge that machines lack.
• Creativity: Language includes creative elements like poetry, metaphors, and idioms.
• Diversity across languages: No direct mapping between vocabularies of different languages.
• Complexity of human language: Syntax, semantics, and pragmatics make language processing difficult for
machines.
3
5. How NLP, ML, and DL are related?
• AI: Broad field aiming to build systems that perform tasks requiring human intelligence.
• ML: Subfield of AI that learns patterns from data without explicit programming.
• DL: Subfield of ML based on neural networks to model complex patterns.
• NLP: Subfield of AI focused on language processing, often using ML and DL techniques.
• Rule-Based Systems: Early NLP systems relied on handcrafted rules and resources like dictionaries and
thesauruses.
• Lexicon-Based Sentiment Analysis: Uses counts of positive and negative words to deduce sentiment.
• Knowledge Bases: WordNet (synonyms, hyponyms, meronyms), Open Mind Common Sense.
• Regex and CFG: Regular expressions and context-free grammars for text analysis.
7. Explain briefly Naive Bayes, Support Vector Machine, Hidden Markov Model, and Conditional
Random Fields approaches
• Naive Bayes: A probabilistic classifier based on Bayes’ theorem. Assumes feature independence. Used in text
classification.
• Support Vector Machine (SVM): A classifier that finds the optimal decision boundary between classes. Used
in text classification.
• Hidden Markov Model (HMM): A statistical model for sequential data. Used in part-of-speech tagging.
• Conditional Random Fields (CRF): A sequential classifier that considers context. Used in named entity
recognition and part-of-speech tagging.
4
8. What is the difference between RNN and LSTM NN?
• RNN: Processes sequential data but struggles with long-term dependencies due to the vanishing gradient
problem.
• LSTM: A variant of RNN that uses memory cells to retain long-term context, making it more effective for
longer sequences.
• CNNs process text by converting words into 2D word vectors of dimension n ✕ d, forming a matrix.
o n : number of words in the sentence , d : size of the word vectors.
• This matrix can be treated similar to an image and can be modeled by a CNN
• Convolution filters capture local patterns (e.g., n-grams).
• pooling layers condense features, and fully connected layers classify the text.
• This makes CNNs effective for tasks like sentiment analysis and text classification.
• Applying knowledge learned from one task to a different but related task.
• Example: Pre-training a large model (e.g., BERT) on a massive dataset, then fine-tuning it for specific NLP
tasks like text classification or question answering.
5
12. List the key reasons that make DL not suitable for all NLP tasks
6
Chapter 2 : NLP Pipeline
1. What are the key stages of a generic pipeline for NLP system development?
• Data acquisition: Collecting relevant data for the task
5. Using dot (.) to segment sentences can cause problems, explain how?
• Abbreviations (e.g., "Dr.", "Mr.").
• Ellipses ("...").
• Decimals ("3.14").
• URLs/Domains ("example.com").
Stemming:
• Reduces words to base form by removing suffixes
• Doesn't always produce linguistically correct words
• Example: "cars" → "car", "revolution" → "revolut"
(incorrect)
Lemmatization:
• Maps words to their base dictionary form (lemma)
• Requires linguistic knowledge
• Example: "better" → "good", "was" → "be"
8
8. What is the difference between code mixing and transliteration?
• Code mixing: Using multiple languages in one sentence/phrase
o (e.g., Singlish: "We makan at kopitiam").
• Transliteration: Writing words from one language using another language's script
o (e.g., "namaste" for नमस्ते).
10. Explain the feature engineering for classical NLP versus DL-based NLP
Classical NLP:
DL-based NLP:
Classical NLP: Handcrafted features (e.g., TF-IDF, n-grams) fed to ML models (interpretable).
DL: Raw text → dense embeddings (e.g., word2vec) → hidden layers → output (less interpretable).
9
11. How to combine heuristics directly or indirectly with the ML model?
Two approaches:
• Direct: Use heuristic outputs as features (e.g., spam word count in email classification).
• Indirect: Pre-process inputs with heuristics to filter data before ML (e.g., filter obvious spam before ML).
• Stacking: Feed one model’s output as input to another in Sequential approach (e.g., logistic
regression on Naive Bayes).
13. Which modeling technique can be used in the following cases: small data, large data,
poor data quality, and good data quality?
• Small data: Rule-based systems, traditional ML (less data hungry)
10
• Extrinsic:
• Measures real-world impact Uses business metrics
• Requires human/real-world testing.
o Example: Time saved by spam filter
15. What are the metrics that can be used in: classification, measuring model quality,
information retrieval, predication, machine translation, and summarization tasks.
• Classification: Accuracy, precision, recall, F1, AUC
16. Describe deploying, monitoring, and updating phases of NLP the pipeline.
17. Explain how the NLP pipeline is different from a language to another?
• High-resource languages (e.g., English): Use pre-trained models.
• Low-resource languages (e.g., Swahili): Manual labeling, morphological analyzers.
• CJK languages (Chinese/Japanese/Korean): Special tokenization (no spaces).
• Morphologically rich languages (e.g., Turkish): Handle complex word forms.
18. Describe the NLP pipeline for ranking tickets in a ticketing system by Uber.
1. Data sources: Ticket text, trip data, ticket metadata.
2. Pre-processing: Tokenization, lowercasing, stop word removal, lemmatization.
11
3. Feature engineering:
o Bag-of-words → TF-IDF/LSI → cosine similarity with historical solutions.
4. Modeling: Binary classifier ranks solutions; top 3 displayed.
5. Evaluation: MRR (intrinsic), cost savings (extrinsic).
6. Deployment & updates: Continuous testing, model refinement.
12
Chapter 3: Text Representation
1. List the four categories of text representation techniques.
• Basic Vectorization Approaches:
o Includes one-hot encoding, bag of words (BoW), bag of n-grams, and TF-IDF.
o Simple methods to convert text into numerical vectors.
• Distributed Representations:
o Uses neural networks to create dense, low-dimensional representations (e.g., Word2vec, fastText, Doc2vec).
o Captures semantic relationships between words.
• Universal Language Representation:
o Contextual representations using advanced neural models (e.g., ELMo, BERT).
o Accounts for word meaning based on context.
• Handcrafted Features:
o Domain-specific features manually designed for specific NLP tasks.
o Incorporates expert knowledge for tasks like text complexity or essay scoring.
3. Using D1,D2,D3,D4find their representation using one-hot encoding, bag of words, bag
of n-grams, and TF-IDF.
• D1: "Dog bites man"
• D2: "Man bites dog"
• D3: "Dog eats meat"
• D4: "Man eats food"
• Corpus Vocabulary: [dog, bites, man, eats, meat, food]
o (6 words, mapped as dog=1, bites=2, man=3, meat=4, food=5, eats=6).
13
• One-Hot Encoding :
o Each word is a 6-dimensional binary vector with a single 1 at its ID index.
o D1: Dog bites man → [[1,0,0,0,0,0], [0,1,0,0,0,0], [0,0,1,0,0,0]]
o D2: Man bites dog → [[0,0,1,0,0,0], [0,1,0,0,0,0], [1,0,0,0,0,0]]
o D3: Dog eats meat → [[1,0,0,0,0,0], [0,0,0,0,0,1], [0,0,0,1,0,0]]
o D4: Man eats food → [[0,0,1,0,0,0], [0,0,0,0,0,1], [0,0,0,0,1,0]]
• Bag of Words (BoW) :
o A 6-dimensional vector counting word occurrences in each document.
o D1: Dog bites man → [1,1,1,0,0,0]
o D2: Man bites dog → [1,1,1,0,0,0]
o D3: Dog eats meat → [1,0,0,1,1,0]
o D4: Man eats food → [0,0,1,1,0,1]
• Bag of N-Grams :
o Vocabulary of bigrams: [dog bites, bites man, man bites, bites dog, dog eats, eats meat, man eats, eats food].
o Each document is an 8-dimensional vector of bigram frequencies.
o D1: Dog bites man → [1,1,0,0,0,0,0,0]
o D2: Man bites dog → [0,0,1,1,0,0,0,0]
o D3: Dog eats meat → [0,0,0,0,1,1,0,0]
o D4: Man eats food → [0,0,0,0,0,0,1,1]
• TF-IDF :
o Combines Term Frequency (TF) and Inverse Document Frequency (IDF).
o TF-IDF scores :
▪ dog: 0.136, bites: 0.17, man: 0.136, eats: 0.17, meat: 0.17, food: 0.17
o D1: Dog bites man → [0.136, 0.17, 0.136, 0, 0, 0]
o D2: Man bites dog → [0.136, 0.17, 0.136, 0, 0, 0]
o D3: Dog eats meat → [0.136, 0, 0, 0.17, 0.17, 0]
o D4: Man eats food → [0, 0, 0.136, 0.17, 0, 0.17]
14
4. Explain the difference between
• (a) distributional similarity and distributional hypothesis :
o Similarity: Meaning derived from context (e.g., "bank" in "river bank" vs. "money bank").
o Hypothesis: Words in similar contexts have similar meanings (e.g., "dog" and "cat").
• (b) distributional representation and distributed representation :
o Distributional Representation:
▪ High-dimensional, sparse vectors based on word co-occurrence in contexts.
▪ Derived from a co-occurrence matrix.
▪ Examples: One-hot, BoW, n-grams, TF-IDF.
o Distributed Representation:
▪ Low-dimensional, dense vectors that compress distributional representations.
▪ Uses neural networks to capture semantic relationships.
▪ Examples: Word2vec, fastText, Doc2vec embeddings.
6. Explain with an example the architectural variants of Word2vec: CBOW and SkipGram.
• Continuous Bag of Words (CBOW):
o Predict the center word given context words.
o Example:
▪ Sentence: "The quick brown fox jumps over the lazy dog."
▪ Window (k=2): ["quick", "brown", "fox", "jumps", "over"]
▪ Input (X): ["quick", "brown", "jumps", "over"]
▪ Output (Y): "fox"
15
o Neural Network Architecture:
• SkipGram:
o Predict context words given the center word.
o Example:
▪ Same slide window as CBOW, but now:
• Difference:
CBOW SkipGram
Predicts center word from context. Predicts context words from center word.
Faster, good for frequent words. Slower, better for rare words.
16
7.How can the OOV (Out-of-Vocabulary) problem be solved?
• Problem: OOV words are those not present in the training vocabulary, causing issues in representations
• Solutions:
o Ignore OOV Words:
▪ Exclude OOV words from feature extraction.
o Random Initialization :
▪ Assign random vectors (components between -0.25 and +0.25) to OOV words.
o Subword Information :
▪ Use algorithms like fastText, which represent words as character n-grams.
▪ Example:For “gregarious” (OOV), break into n-grams(gre, reg, ega, …, ous), combine embeddings.
o Retrain Model :
▪ Expand vocabulary to include new words and retrain the model.
▪ Feasible but computationally expensive.
Word2vec Doc2vec
9.What are the important aspects to keep in mind while using word embeddings?
• Bias in Training Data:
o Embeddings reflect biases in training data (e.g., “Apple” closer to “Microsoft” than “orange” in tech-heavy corpora).
o Mitigate biases to ensure fair NLP model performance.
17
• Scalability (Large File Sizes):
o Pre-trained embeddings (e.g., Word2vec ~4.5 GB) require significant storage.
o Use in-memory databases like Redis with caching to manage scaling and latency.
• Context Limitations:
o Embeddings may not capture domain-specific nuances (e.g., sarcasm detection).
o Combine with domain-specific features for better performance.
• Evolving Field:
o Neural text representation is rapidly advancing; new models (e.g., BERT) may outperform older ones.
o Consider ROI, business needs, and infrastructure constraints before adopting.
18
Chapter 4: Text Classification
1. What is the difference between binary, multi-class, and multi-label classification?
• Binary Classification:
o Two classes (e.g., spam vs. non-spam email).
o Each document belongs to exactly one class.
• Multi-Class Classification:
o More than two classes (e.g., sentiment as positive, negative, neutral).
o Each document belongs to exactly one class from set C
• Multi-Label Classification:
o One or more labels per document (e.g., a news article labeled as "sports" and "soccer").
o Labels are a subset of C. and a document can have no, one, multiple, or all classes.
19
4. Classification can be done without the text classification pipeline, explain how?
• Lexicon-Based Sentiment Analysis:
o Uses predefined lists of positive/negative words to classify text.
o No machine learning; relies on heuristics or rules.
o Example: A tweet with more positive words (e.g., “great”) is classified as positive.
• Existing APIs:
o Use off-the-shelf APIs for generic tasks like sentiment analysis or category classification.
o No need to train a model; APIs provide pre-trained classifiers.
• Benefits:
o Quick MVP deployment.
o Provides a baseline for evaluation.
Classifier performs well on non-relevant articles but struggles with relevant ones due to class imbalance.
• Use: Highlights errors (e.g., false positives/negatives) and class-specific performance.
20
6. List the potential reasons for poor classifier performance.
1. Large, Sparse Feature Vectors: Too many features introduce noise and sparsity, hindering learning.
2. Class Imbalance: Skewed data biases the model toward the majority class.
3. Suboptimal Algorithm: The chosen algorithm may not suit the dataset.
4. Poor Pre-Processing/Feature Extraction: Ineffective text cleaning or feature representation.
5. Untuned Parameters: Classifier hyperparameters need optimization.
10. List the steps for converting training and test data into a format suitable for the
neural network.
1. Tokenize Texts: Convert texts into word index vectors using a tokenizer.
2. Pad Sequences: Ensure all sequences are the same length by padding with zeros.
3. Map to Embeddings: Convert word indices to embedding vectors using a pre-trained or trainable embedding matrix.
4. Input to Neural Network: Use the resulting vectors as input to the neural network’s embedding layer.
21
11. Which technique is better for text classification, CNN or LSTM, and why?
• CNNs:
o Strengths: Learn key bag-of-words/n-grams features; faster to train; less data-hungry.
o Best For: Smaller datasets or when speed is critical.
• LSTMs:
o Strengths: Capture sequential context (word order); suitable for language modeling.
o Weaknesses: Slower to train; data-hungry; may underperform on small datasets.
• Which is Better?:
o No universal winner; depends on dataset and task.
o CNNs often better for smaller datasets or speed; LSTMs for sequential context with large data.
o Experiment with both and tune hyperparameters (e.g., layers, epochs).
14. Give some options to explore when no labels exist for a dataset.
• Use Existing APIs/Libraries:
o Map API categories to relevant classes.
• Use Public Datasets:
o Adapt datasets like 20 Newsgroups to train a classifier.
22
• Weak Supervision:
o Create rules to bootstrap a dataset.
• Active Learning:
o Use tools like Prodigy to label key instances interactively.
• Feedback Integration:
o Use implicit or explicit signals to refine the model.
15. Describe the pipeline for building a classifier when there is no training data.
1. Start with a Baseline:
o Use a public API, public dataset , or weak supervision to create an initial model.
2. Deploy in Production:
o Apply the baseline model to real-world data.
3. Collect Feedback:
o Gather explicit and implicit signals on model performance.
4. Refine with Active Learning:
o Identify low-confidence predictions, label them, and retrain the model.
5. Iterate:
o Collect more data over time, transition to sophisticated models as data grows.
6. Monitor and Improve:
o Continuously update the model based on new data and feedback.
23