0% found this document useful (0 votes)
17 views23 pages

NLP Sheets

Chapter 1 provides an overview of Natural Language Processing (NLP), including real-world applications like email filtering, voice assistants, and machine translation, as well as key NLP tasks such as language modeling and text classification. It discusses the challenges of NLP, the relationship between NLP, machine learning (ML), and deep learning (DL), and various modeling techniques. Additionally, it outlines the stages of an NLP pipeline, data acquisition methods, and text representation techniques.

Uploaded by

asdasdosama1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views23 pages

NLP Sheets

Chapter 1 provides an overview of Natural Language Processing (NLP), including real-world applications like email filtering, voice assistants, and machine translation, as well as key NLP tasks such as language modeling and text classification. It discusses the challenges of NLP, the relationship between NLP, machine learning (ML), and deep learning (DL), and various modeling techniques. Additionally, it outlines the stages of an NLP pipeline, data acquisition methods, and text representation techniques.

Uploaded by

asdasdosama1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Chapter 1 : NLP A Primer

1. List examples of real-world applications of NLP


• Email platforms:
o Spam classification, priority inbox, calendar event extraction, auto-complete (e.g., Gmail, Outlook).
• Voice-based assistants:
o Apple Siri, Google Assistant, Amazon Alexa, Microsoft Cortana.
• Modern Search engines:
o Query understanding, query expansion, question answering, information retrieval (e.g., Google, Bing).
• Machine translation:
o Google Translate, Bing Microsoft Translator, Amazon Translate.

Other applications:

• Analyze their social media feeds:


o Build a better and deeper understanding of the voice of their customers.

• E-commerce platforms:
o Product description extraction, review analysis (e.g., Amazon).
• Healthcare, finance, and law
• Companies such as Arria:
o Automating report generation, legal document analysis, financial forecasting.
• Spelling and grammar correction:
o Grammarly, Microsoft Word, Google Docs.
• IBM Watson: An AI built using NLP techniques that competed on "Jeopardy!" quiz show and won $1 million,
outperforming human champions.
• Educational tools:
o Automated scoring (e.g., GRE)
o plagiarism detection (e.g., Turnitin)
o language learning apps (e.g., Duolingo)
• Knowledge bases:
o Google Knowledge Graph for search and question answering.

1
2. Explain the following NLP tasks

• Language modeling:
o Predicting the next word in a sequence based on previous words.
o Used in speech recognition, machine translation, and spelling correction.
• Text classification:
o Categorizing text into predefined classes (e.g., spam detection, sentiment analysis).
• Information extraction:
o Extracting structured information from unstructured text (e.g., extracting names,or events from emails).
• Information retrieval:
o Finding relevant documents from a large collection (e.g., search engines).
• Conversational agent:
o Building systems that can converse with humans (e.g., chatbots, voice assistants).
• Text summarization:
o Creating concise summaries of long documents while retaining key information.
• Question answering:
o Building systems that can answer questions posed in natural language.
• Machine translation:
o Translating text from one language to another (e.g., Google Translate).
• Topic modeling:
o Identifying latent topics in a large collection of documents (e.g., identifying themes ).

2
3. What are the building blocks of language and their applications?

• Phonemes:
o Smallest units of sound in a language.
o Used in speech recognition and text-to-speech systems.
• Morphemes and lexemes:
o Smallest units of meaning.
o Used in tokenization, stemming, and part-of-speech tagging.
• Syntax:
o Rules for constructing grammatically correct sentences.
o Used in parsing and sentence structure analysis.
• Context:
o Meaning derived from semantics and pragmatics.
o Used in tasks like sarcasm detection, summarization, and topic modeling.

4. Why is NLP challenging?

• Ambiguity: Words and sentences can have multiple meanings depending on context.
• Common knowledge: Humans rely on implicit knowledge that machines lack.
• Creativity: Language includes creative elements like poetry, metaphors, and idioms.
• Diversity across languages: No direct mapping between vocabularies of different languages.
• Complexity of human language: Syntax, semantics, and pragmatics make language processing difficult for
machines.

3
5. How NLP, ML, and DL are related?

• AI: Broad field aiming to build systems that perform tasks requiring human intelligence.
• ML: Subfield of AI that learns patterns from data without explicit programming.
• DL: Subfield of ML based on neural networks to model complex patterns.
• NLP: Subfield of AI focused on language processing, often using ML and DL techniques.

6. Describe the heuristics-based NLP

• Rule-Based Systems: Early NLP systems relied on handcrafted rules and resources like dictionaries and
thesauruses.
• Lexicon-Based Sentiment Analysis: Uses counts of positive and negative words to deduce sentiment.
• Knowledge Bases: WordNet (synonyms, hyponyms, meronyms), Open Mind Common Sense.
• Regex and CFG: Regular expressions and context-free grammars for text analysis.

7. Explain briefly Naive Bayes, Support Vector Machine, Hidden Markov Model, and Conditional
Random Fields approaches

• Naive Bayes: A probabilistic classifier based on Bayes’ theorem. Assumes feature independence. Used in text
classification.
• Support Vector Machine (SVM): A classifier that finds the optimal decision boundary between classes. Used
in text classification.
• Hidden Markov Model (HMM): A statistical model for sequential data. Used in part-of-speech tagging.
• Conditional Random Fields (CRF): A sequential classifier that considers context. Used in named entity
recognition and part-of-speech tagging.

4
8. What is the difference between RNN and LSTM NN?

• RNN: Processes sequential data but struggles with long-term dependencies due to the vanishing gradient
problem.
• LSTM: A variant of RNN that uses memory cells to retain long-term context, making it more effective for
longer sequences.

9. How CNN can be used for text processing?

• CNNs process text by converting words into 2D word vectors of dimension n ✕ d, forming a matrix.
o n : number of words in the sentence , d : size of the word vectors.
• This matrix can be treated similar to an image and can be modeled by a CNN
• Convolution filters capture local patterns (e.g., n-grams).
• pooling layers condense features, and fully connected layers classify the text.
• This makes CNNs effective for tasks like sentiment analysis and text classification.

10. Describe the concept of transfer learning

• Applying knowledge learned from one task to a different but related task.
• Example: Pre-training a large model (e.g., BERT) on a massive dataset, then fine-tuning it for specific NLP
tasks like text classification or question answering.

11. Give the architecture of autoencoder

• Input Layer: Takes in text data.


• Hidden Layer (Encoder): Compresses the input into a dense vector representation.
• Output Layer (Decoder): Reconstructs the input from the compressed representation.
• Purpose: Used for unsupervised learning of feature representations.

5
12. List the key reasons that make DL not suitable for all NLP tasks

• Overfitting on Small Datasets: DL models require large amounts of data.


• Few-Shot Learning: DL struggles with learning from very few examples.
• Domain Adaptation: DL models trained on one domain may not generalize well to another.
• Interpretability: DL models are often black boxes, making it hard to explain predictions.
• Cost: DL models are expensive to train and deploy.
• On-Device Deployment: DL models may not fit on devices with limited memory and power.

13. Explain the flow of conversation agents

• Speech recognition: Converts speech to text.


• Natural language understanding: Analyzes text for sentiment, entities, and intent.
• Dialog management: Determines the user’s intent and decides the next action.
• Response generation: Generates a response (e.g., retrieving information or performing an action).

6
Chapter 2 : NLP Pipeline
1. What are the key stages of a generic pipeline for NLP system development?
• Data acquisition: Collecting relevant data for the task

• Text cleaning: Removing noise and non-textual information


• Pre-processing: Converting text to canonical form (tokenization, normalization, etc.)
• Feature engineering: Creating numerical representations of text
• Modeling: Building and training ML/DL models
• Evaluation: Measuring model performance
• Deployment: Integrating model into production
• Monitoring and model updating: Continuous improvement

2. How can we get data required for training an NLP technique?


• Public datasets (e.g., Google Dataset Search).

• Scraping data (e.g., forums, Stack Overflow).


• Product intervention (instrumenting products to collect data).
• Data augmentation (creating synthetic data).

3. List the different data augmentation methods?


• Synonym replacement (replace words with synonyms).

• Back translation (translate to another language and back).


• TF-IDF-based word replacement (replace words based on importance).
• Bigram flipping (swap adjacent word pairs).
• Replacing entities (swap names, locations, etc.).
• Adding noise (simulate typos, keyboard errors).
• Advanced techniques (Snorkel, EDA, NLPAug, Active Learning).
7
4. Data can be collected from PDF files, HTML pages, and images, how this data can be
cleaned based on their sources?
• HTML Parsing (BeautifulSoup, Scrapy).

• PDF Extraction (PyPDF, PDFMiner, OCR for scanned PDFs).


• Image Text Extraction (Tesseract OCR).
• Unicode Normalization (handling emojis, symbols).
• Spelling Correction (Microsoft Spell Check API, pyenchant).

5. Using dot (.) to segment sentences can cause problems, explain how?
• Abbreviations (e.g., "Dr.", "Mr.").

• Ellipses ("...").
• Decimals ("3.14").
• URLs/Domains ("example.com").

6. What are the frequent steps in the data pre-processing phase?


• Basic: Sentence segmentation, word tokenization.

• Common: Stop word removal, lowercasing, removing digits/punctuation.


• Advanced: Stemming, lemmatization, normalization, POS tagging.

7. With examples, explain the differences between segmentation and lemmatization.

Stemming:
• Reduces words to base form by removing suffixes
• Doesn't always produce linguistically correct words
• Example: "cars" → "car", "revolution" → "revolut"
(incorrect)

Lemmatization:
• Maps words to their base dictionary form (lemma)
• Requires linguistic knowledge
• Example: "better" → "good", "was" → "be"
8
8. What is the difference between code mixing and transliteration?
• Code mixing: Using multiple languages in one sentence/phrase
o (e.g., Singlish: "We makan at kopitiam").
• Transliteration: Writing words from one language using another language's script
o (e.g., "namaste" for नमस्ते).

9. Describe the concept coreference resolution.


• Identifying all expressions that refer to the same entity in a text.
• Links pronouns/noun phrases to their referents
o (e.g., "John left. He was angry." → "He" = "John").

10. Explain the feature engineering for classical NLP versus DL-based NLP

Classical NLP:

• Handcrafted features based on domain knowledge


• Features are interpretable
• Example: Counting positive/negative words for sentiment
analysis

DL-based NLP:

• Raw text is fed directly to model


• Model learns features automatically
• Features are not interpretable
• Uses embeddings (word2vec, GloVe, etc.)

 Classical NLP: Handcrafted features (e.g., TF-IDF, n-grams) fed to ML models (interpretable).
 DL: Raw text → dense embeddings (e.g., word2vec) → hidden layers → output (less interpretable).

9
11. How to combine heuristics directly or indirectly with the ML model?

Two approaches:
• Direct: Use heuristic outputs as features (e.g., spam word count in email classification).
• Indirect: Pre-process inputs with heuristics to filter data before ML (e.g., filter obvious spam before ML).

12. What is the difference between models ensembling and stacking?


• Ensembling: Combine predictions from multiple models in Parallel approach (e.g.,
voting/averaging).

• Stacking: Feed one model’s output as input to another in Sequential approach (e.g., logistic
regression on Naive Bayes).

13. Which modeling technique can be used in the following cases: small data, large data,
poor data quality, and good data quality?
• Small data: Rule-based systems, traditional ML (less data hungry)

• Large data: Deep learning, richer feature sets unsupervised learning.


• Poor data quality: More cleaning/pre-processing needed
• Good data quality: Can use off-the-shelf algorithms/APIs

14. What is the difference between intrinsic and extrinsic evaluation?


• Intrinsic:

• Measures model performance Uses ML metrics (accuracy, precision, etc.)


• Automated (e.g., precision, recall)
o Example: Spam detection precision/recall

10
• Extrinsic:
• Measures real-world impact Uses business metrics
• Requires human/real-world testing.
o Example: Time saved by spam filter

15. What are the metrics that can be used in: classification, measuring model quality,
information retrieval, predication, machine translation, and summarization tasks.
• Classification: Accuracy, precision, recall, F1, AUC

• Model quality: RMSE, MAPE


• Information retrieval: MRR, MAP, recall@K
• Prediction: RMSE, MAPE
• Machine translation: BLEU, METEOR
• Summarization: ROUGE

16. Describe deploying, monitoring, and updating phases of NLP the pipeline.

• Deployment: Integrating into production (e.g., web service).


• Monitoring: Tracking performance (dashboards, logs).
• Updating: Retraining with new data.

17. Explain how the NLP pipeline is different from a language to another?
• High-resource languages (e.g., English): Use pre-trained models.
• Low-resource languages (e.g., Swahili): Manual labeling, morphological analyzers.
• CJK languages (Chinese/Japanese/Korean): Special tokenization (no spaces).
• Morphologically rich languages (e.g., Turkish): Handle complex word forms.

18. Describe the NLP pipeline for ranking tickets in a ticketing system by Uber.
1. Data sources: Ticket text, trip data, ticket metadata.
2. Pre-processing: Tokenization, lowercasing, stop word removal, lemmatization.
11
3. Feature engineering:
o Bag-of-words → TF-IDF/LSI → cosine similarity with historical solutions.
4. Modeling: Binary classifier ranks solutions; top 3 displayed.
5. Evaluation: MRR (intrinsic), cost savings (extrinsic).
6. Deployment & updates: Continuous testing, model refinement.

12
Chapter 3: Text Representation
1. List the four categories of text representation techniques.
• Basic Vectorization Approaches:
o Includes one-hot encoding, bag of words (BoW), bag of n-grams, and TF-IDF.
o Simple methods to convert text into numerical vectors.
• Distributed Representations:
o Uses neural networks to create dense, low-dimensional representations (e.g., Word2vec, fastText, Doc2vec).
o Captures semantic relationships between words.
• Universal Language Representation:
o Contextual representations using advanced neural models (e.g., ELMo, BERT).
o Accounts for word meaning based on context.
• Handcrafted Features:
o Domain-specific features manually designed for specific NLP tasks.
o Incorporates expert knowledge for tasks like text complexity or essay scoring.

2. Describe the concept of vector space models.


• A mathematical model representing text units (e.g., words, phrases, documents) as vectors in a high-dimensional space
• Known as the Vector Space Model (VSM) or term vector model.
• Used for information retrieval, classification, clustering.
• Similarity measured using cosine similarity:

• Captures linguistic properties for ML tasks.

3. Using D1,D2,D3,D4find their representation using one-hot encoding, bag of words, bag
of n-grams, and TF-IDF.
• D1: "Dog bites man"
• D2: "Man bites dog"
• D3: "Dog eats meat"
• D4: "Man eats food"
• Corpus Vocabulary: [dog, bites, man, eats, meat, food]
o (6 words, mapped as dog=1, bites=2, man=3, meat=4, food=5, eats=6).
13
• One-Hot Encoding :
o Each word is a 6-dimensional binary vector with a single 1 at its ID index.
o D1: Dog bites man → [[1,0,0,0,0,0], [0,1,0,0,0,0], [0,0,1,0,0,0]]
o D2: Man bites dog → [[0,0,1,0,0,0], [0,1,0,0,0,0], [1,0,0,0,0,0]]
o D3: Dog eats meat → [[1,0,0,0,0,0], [0,0,0,0,0,1], [0,0,0,1,0,0]]
o D4: Man eats food → [[0,0,1,0,0,0], [0,0,0,0,0,1], [0,0,0,0,1,0]]
• Bag of Words (BoW) :
o A 6-dimensional vector counting word occurrences in each document.
o D1: Dog bites man → [1,1,1,0,0,0]
o D2: Man bites dog → [1,1,1,0,0,0]
o D3: Dog eats meat → [1,0,0,1,1,0]
o D4: Man eats food → [0,0,1,1,0,1]
• Bag of N-Grams :
o Vocabulary of bigrams: [dog bites, bites man, man bites, bites dog, dog eats, eats meat, man eats, eats food].
o Each document is an 8-dimensional vector of bigram frequencies.
o D1: Dog bites man → [1,1,0,0,0,0,0,0]
o D2: Man bites dog → [0,0,1,1,0,0,0,0]
o D3: Dog eats meat → [0,0,0,0,1,1,0,0]
o D4: Man eats food → [0,0,0,0,0,0,1,1]
• TF-IDF :
o Combines Term Frequency (TF) and Inverse Document Frequency (IDF).

o TF-IDF scores :
▪ dog: 0.136, bites: 0.17, man: 0.136, eats: 0.17, meat: 0.17, food: 0.17
o D1: Dog bites man → [0.136, 0.17, 0.136, 0, 0, 0]
o D2: Man bites dog → [0.136, 0.17, 0.136, 0, 0, 0]
o D3: Dog eats meat → [0.136, 0, 0, 0.17, 0.17, 0]
o D4: Man eats food → [0, 0, 0.136, 0.17, 0, 0.17]

14
4. Explain the difference between
• (a) distributional similarity and distributional hypothesis :
o Similarity: Meaning derived from context (e.g., "bank" in "river bank" vs. "money bank").
o Hypothesis: Words in similar contexts have similar meanings (e.g., "dog" and "cat").
• (b) distributional representation and distributed representation :
o Distributional Representation:
▪ High-dimensional, sparse vectors based on word co-occurrence in contexts.
▪ Derived from a co-occurrence matrix.
▪ Examples: One-hot, BoW, n-grams, TF-IDF.
o Distributed Representation:
▪ Low-dimensional, dense vectors that compress distributional representations.
▪ Uses neural networks to capture semantic relationships.
▪ Examples: Word2vec, fastText, Doc2vec embeddings.

5. Describe the word embedding concept with an example of its use.


• Definition:
o Word embeddings are dense, low-dimensional vectors (50-500 dimensions) that capture semantic
relationships based on distributional similarity.
o Learned using neural networks (e.g., Word2vec) to represent words in a vector space where similar
words cluster together.
• Example of Use:
o Task: Finding words similar to “beautiful” using pre-trained Word2vec embeddings.
w2v_model.most_similar("beautiful")
# Output: [("gorgeous", 0.83), ("lovely", 0.81)] (ranked by cosine similarity.)
o Use Case: Semantic analysis and text classification.

6. Explain with an example the architectural variants of Word2vec: CBOW and SkipGram.
• Continuous Bag of Words (CBOW):
o Predict the center word given context words.
o Example:

▪ Sentence: "The quick brown fox jumps over the lazy dog."
▪ Window (k=2): ["quick", "brown", "fox", "jumps", "over"]
▪ Input (X): ["quick", "brown", "jumps", "over"]
▪ Output (Y): "fox"

15
o Neural Network Architecture:

▪ Input Layer: One-hot encoded context words.


▪ Hidden Layer: Sum/Average of word embeddings.
▪ Output Layer: Softmax predicts the center word.

• SkipGram:
o Predict context words given the center word.
o Example:
▪ Same slide window as CBOW, but now:

▪ Input (X): Center word ("fox").


▪ Output (Y): All context words (["quick", "brown", "jumps", "over"]).

o Neural Network Architecture:

▪ Input Layer: One-hot encoded center word.


▪ Hidden Layer: Word embedding.
▪ Output Layer: Softmax predicts each context word.

• Difference:

CBOW SkipGram

Predicts center word from context. Predicts context words from center word.

Faster, good for frequent words. Slower, better for rare words.

16
7.How can the OOV (Out-of-Vocabulary) problem be solved?
• Problem: OOV words are those not present in the training vocabulary, causing issues in representations
• Solutions:
o Ignore OOV Words:
▪ Exclude OOV words from feature extraction.
o Random Initialization :
▪ Assign random vectors (components between -0.25 and +0.25) to OOV words.
o Subword Information :
▪ Use algorithms like fastText, which represent words as character n-grams.
▪ Example:For “gregarious” (OOV), break into n-grams(gre, reg, ega, …, ous), combine embeddings.
o Retrain Model :
▪ Expand vocabulary to include new words and retrain the model.
▪ Feasible but computationally expensive.

8.What is the difference between Doc2vec and Word2vec?


• Word2vec:
o Learns embeddings for individual words based on their context.
o Representations are context-independent; each word has a fixed vector.
• Doc2vec:
o learn embeddings for entire texts (phrases, sentences, documents).
o Incorporates context by learning a unique “paragraph vector” for each text alongside shared word
vectors.
• Difference:

Word2vec Doc2vec

Learns word-level embeddings. Learns document-level embeddings.

Ignores word order. Captures context within a document.

Used for word similarity. Used for document classification.

9.What are the important aspects to keep in mind while using word embeddings?
• Bias in Training Data:
o Embeddings reflect biases in training data (e.g., “Apple” closer to “Microsoft” than “orange” in tech-heavy corpora).
o Mitigate biases to ensure fair NLP model performance.
17
• Scalability (Large File Sizes):
o Pre-trained embeddings (e.g., Word2vec ~4.5 GB) require significant storage.
o Use in-memory databases like Redis with caching to manage scaling and latency.
• Context Limitations:
o Embeddings may not capture domain-specific nuances (e.g., sarcasm detection).
o Combine with domain-specific features for better performance.
• Evolving Field:
o Neural text representation is rapidly advancing; new models (e.g., BERT) may outperform older ones.
o Consider ROI, business needs, and infrastructure constraints before adopting.

11.How can high-dimensional data be represented visually?


• Technique: t-SNE (t-distributed Stochastic Neighbor Embedding).
• Purpose:
o Reduces high-dimensional data to 2D or 3D for visualization.
o Preserves data distributions from high-dimensional to low-dimensional space.
• Applications:
o MNIST Digits
o Word Embeddings
o Word Analogies
o Document Embeddings
• Tools: TensorBoard’s Embedding Projector :Helps explore feature quality and relationships visually.

12.With an example, explain the use of handcrafted feature representations.


• Definition:
o Custom features incorporating domain-specific knowledge for specific NLP tasks.
o Used when general-purpose embeddings (e.g., Word2vec) are insufficient.
• Examples:
o TextEvaluator :Helps teachers select grade-appropriate reading materials by assessing text complexity.
▪ Measures Syntactic complexity, concreteness, and other domain-specific metrics.
o Automated Essay Scoring : Features for evaluating essay coherence, grammar in GRE/TOEFL exams.
o Spelling/Grammar Correction : Features for detecting specific error types in tools like Grammarly.
• Use Case:
o Combined with vectorization/embeddings for hybrid approaches.

18
Chapter 4: Text Classification
1. What is the difference between binary, multi-class, and multi-label classification?
• Binary Classification:
o Two classes (e.g., spam vs. non-spam email).
o Each document belongs to exactly one class.
• Multi-Class Classification:
o More than two classes (e.g., sentiment as positive, negative, neutral).
o Each document belongs to exactly one class from set C
• Multi-Label Classification:
o One or more labels per document (e.g., a news article labeled as "sports" and "soccer").
o Labels are a subset of C. and a document can have no, one, multiple, or all classes.

2. Give some applications of text classification.


• Content Classification and Organization:
o Tagging news, blogs, product reviews; organizing emails (e.g., Gmail’s personal, social, promotions tabs).
• Customer Support:
o Identifying actionable social media posts .
• E-Commerce:
o Sentiment analysis of product reviews; aspect-based sentiment analysis.
• Other Applications:
o Language identification (e.g., Google Translate).
o Authorship attribution.
o Triaging mental health forum posts.
o Segregating fake news from real news.

3. Describe the pipeline for building text classification systems.


1. Collect Labeled Dataset: Gather data suitable for the task.
2. Split Dataset: Divide into training, validation (development), and test sets; select evaluation metric(s).
3. Feature Extraction: Transform raw text into feature vectors (e.g., BoW, embeddings).
4. Train Classifier: Use feature vectors and labels to train a model.
5. Evaluate Model: Benchmark performance on the test set using metrics (e.g., accuracy, F1 score).
6. Deploy and Monitor: Deploy the model in production and monitor real-world performance.
o Iterate steps 3-5 to tune features, algorithms, and hyperparameters.

19
4. Classification can be done without the text classification pipeline, explain how?
• Lexicon-Based Sentiment Analysis:
o Uses predefined lists of positive/negative words to classify text.
o No machine learning; relies on heuristics or rules.
o Example: A tweet with more positive words (e.g., “great”) is classified as positive.
• Existing APIs:
o Use off-the-shelf APIs for generic tasks like sentiment analysis or category classification.
o No need to train a model; APIs provide pre-trained classifiers.
• Benefits:
o Quick MVP deployment.
o Provides a baseline for evaluation.

5. Describe with an example the confusion matrix of a classifier.


• A table showing predicted vs. actual class labels to evaluate classifier performance.
• Example: Naive Bayes classifier on the “Economic News Article Tone and Relevance” dataset (binary:
relevant vs. non-relevant).
o Confusion Matrix :
▪ Non-relevant (predicted correctly): High accuracy (86%).
▪ Relevant (predicted correctly): Low accuracy (42%).

 Classifier performs well on non-relevant articles but struggles with relevant ones due to class imbalance.
• Use: Highlights errors (e.g., false positives/negatives) and class-specific performance.

20
6. List the potential reasons for poor classifier performance.
1. Large, Sparse Feature Vectors: Too many features introduce noise and sparsity, hindering learning.
2. Class Imbalance: Skewed data biases the model toward the majority class.
3. Suboptimal Algorithm: The chosen algorithm may not suit the dataset.
4. Poor Pre-Processing/Feature Extraction: Ineffective text cleaning or feature representation.
5. Untuned Parameters: Classifier hyperparameters need optimization.

7. How to solve the class imbalance problem of a dataset?


• Oversampling: Increase instances of the minority class.
• Undersampling: Reduce instances of the majority class.
• Weight Balancing: Adjust classifier weights to prioritize minority classes.
• Tools: Use libraries like Imbalanced-Learn for sampling methods.
• Example: Logistic regression with class_weight="balanced" boosts minority class weights.

8. What is the difference between generative and discriminative classifiers?


• Generative Classifier (e.g., Naive Bayes):
o Model the joint probability P(X,Y) of features X and labels Y.
o Chooses the class with the highest probability.
• Discriminative Classifier (e.g., Logistic Regression, SVM):
o Model the conditional probability P(Y∥X) directly.
o Aims to find a decision boundary.

9. How to use word embeddings as features for text classification?


• Load a pre-trained embedding model (e.g., Word2vec GoogleNews-vectors).
• Pre-process text (e.g., tokenization, lowercasing).
• Create feature vectors by averaging embeddings of words in a text (ignore OOV words).
• Train a classifier (e.g., logistic regression) using these vectors as an input.

10. List the steps for converting training and test data into a format suitable for the
neural network.
1. Tokenize Texts: Convert texts into word index vectors using a tokenizer.
2. Pad Sequences: Ensure all sequences are the same length by padding with zeros.
3. Map to Embeddings: Convert word indices to embedding vectors using a pre-trained or trainable embedding matrix.
4. Input to Neural Network: Use the resulting vectors as input to the neural network’s embedding layer.

21
11. Which technique is better for text classification, CNN or LSTM, and why?
• CNNs:
o Strengths: Learn key bag-of-words/n-grams features; faster to train; less data-hungry.
o Best For: Smaller datasets or when speed is critical.
• LSTMs:
o Strengths: Capture sequential context (word order); suitable for language modeling.
o Weaknesses: Slower to train; data-hungry; may underperform on small datasets.
• Which is Better?:
o No universal winner; depends on dataset and task.
o CNNs often better for smaller datasets or speed; LSTMs for sequential context with large data.
o Experiment with both and tune hyperparameters (e.g., layers, epochs).

12. How can text classification models be interpreted?


• Need for Interpretability: Explain predictions for transparency or debugging.
• Tool: Lime (Local Interpretable Model-agnostic Explanations).
o Approximates a black-box model with a linear model locally around a test instance.
o Outputs weighted features influencing the prediction.
• Use Cases: Justify decisions, inform feature selection, improve model reliability.

13. How to solve no training and less training data problems?


• No Training Data:
o Manual Labeling: Use domain experts to label data.
o Weak Supervision: Use patterns/rules to label data ; tools like Snorkel.
o Crowdsourcing: Use platforms like Amazon Mechanical Turk or Figure Eight for large-scale labeling.
• Less Training Data:
o Active Learning:
▪ Train with available data, identify low-confidence predictions, label those, and retrain.
o Domain Adaptation/Transfer Learning:
▪ Fine-tune a pre-trained model on target domain’s unlabeled data, then train on labeled data.

14. Give some options to explore when no labels exist for a dataset.
• Use Existing APIs/Libraries:
o Map API categories to relevant classes.
• Use Public Datasets:
o Adapt datasets like 20 Newsgroups to train a classifier.
22
• Weak Supervision:
o Create rules to bootstrap a dataset.
• Active Learning:
o Use tools like Prodigy to label key instances interactively.
• Feedback Integration:
o Use implicit or explicit signals to refine the model.

15. Describe the pipeline for building a classifier when there is no training data.
1. Start with a Baseline:
o Use a public API, public dataset , or weak supervision to create an initial model.
2. Deploy in Production:
o Apply the baseline model to real-world data.
3. Collect Feedback:
o Gather explicit and implicit signals on model performance.
4. Refine with Active Learning:
o Identify low-confidence predictions, label them, and retrain the model.
5. Iterate:
o Collect more data over time, transition to sophisticated models as data grows.
6. Monitor and Improve:
o Continuously update the model based on new data and feedback.

23

You might also like