0% found this document useful (0 votes)

17 views23 pages

NLP Sheets

Chapter 1 provides an overview of Natural Language Processing (NLP), including real-world applications like email filtering, voice assistants, and machine translation, as well as key NLP tasks such as language modeling and text classification. It discusses the challenges of NLP, the relationship between NLP, machine learning (ML), and deep learning (DL), and various modeling techniques. Additionally, it outlines the stages of an NLP pipeline, data acquisition methods, and text representation techniques.

Uploaded by

asdasdosama1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views23 pages

NLP Sheets

Uploaded by

asdasdosama1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Chapter 1 : NLP A Primer

1. List examples of real-world applications of NLP

• Email platforms:
o Spam classification, priority inbox, calendar event extraction, auto-complete (e.g., Gmail, Outlook).
• Voice-based assistants:
o Apple Siri, Google Assistant, Amazon Alexa, Microsoft Cortana.
• Modern Search engines:
o Query understanding, query expansion, question answering, information retrieval (e.g., Google, Bing).
• Machine translation:
o Google Translate, Bing Microsoft Translator, Amazon Translate.

Other applications:

• Analyze their social media feeds:

o Build a better and deeper understanding of the voice of their customers.

• E-commerce platforms:
o Product description extraction, review analysis (e.g., Amazon).
• Healthcare, finance, and law
• Companies such as Arria:
o Automating report generation, legal document analysis, financial forecasting.
• Spelling and grammar correction:
o Grammarly, Microsoft Word, Google Docs.
• IBM Watson: An AI built using NLP techniques that competed on "Jeopardy!" quiz show and won $1 million,
outperforming human champions.
• Educational tools:
o Automated scoring (e.g., GRE)
o plagiarism detection (e.g., Turnitin)
o language learning apps (e.g., Duolingo)
• Knowledge bases:
o Google Knowledge Graph for search and question answering.

1
2. Explain the following NLP tasks

• Language modeling:
o Predicting the next word in a sequence based on previous words.
o Used in speech recognition, machine translation, and spelling correction.
• Text classification:
o Categorizing text into predefined classes (e.g., spam detection, sentiment analysis).
• Information extraction:
o Extracting structured information from unstructured text (e.g., extracting names,or events from emails).
• Information retrieval:
o Finding relevant documents from a large collection (e.g., search engines).
• Conversational agent:
o Building systems that can converse with humans (e.g., chatbots, voice assistants).
• Text summarization:
o Creating concise summaries of long documents while retaining key information.
• Question answering:
o Building systems that can answer questions posed in natural language.
• Machine translation:
o Translating text from one language to another (e.g., Google Translate).
• Topic modeling:
o Identifying latent topics in a large collection of documents (e.g., identifying themes ).

2
3. What are the building blocks of language and their applications?

• Phonemes:
o Smallest units of sound in a language.
o Used in speech recognition and text-to-speech systems.
• Morphemes and lexemes:
o Smallest units of meaning.
o Used in tokenization, stemming, and part-of-speech tagging.
• Syntax:
o Rules for constructing grammatically correct sentences.
o Used in parsing and sentence structure analysis.
• Context:
o Meaning derived from semantics and pragmatics.
o Used in tasks like sarcasm detection, summarization, and topic modeling.

4. Why is NLP challenging?

• Ambiguity: Words and sentences can have multiple meanings depending on context.
• Common knowledge: Humans rely on implicit knowledge that machines lack.
• Creativity: Language includes creative elements like poetry, metaphors, and idioms.
• Diversity across languages: No direct mapping between vocabularies of different languages.
• Complexity of human language: Syntax, semantics, and pragmatics make language processing difficult for
machines.

3
5. How NLP, ML, and DL are related?

• AI: Broad field aiming to build systems that perform tasks requiring human intelligence.
• ML: Subfield of AI that learns patterns from data without explicit programming.
• DL: Subfield of ML based on neural networks to model complex patterns.
• NLP: Subfield of AI focused on language processing, often using ML and DL techniques.

6. Describe the heuristics-based NLP

• Rule-Based Systems: Early NLP systems relied on handcrafted rules and resources like dictionaries and
thesauruses.
• Lexicon-Based Sentiment Analysis: Uses counts of positive and negative words to deduce sentiment.
• Knowledge Bases: WordNet (synonyms, hyponyms, meronyms), Open Mind Common Sense.
• Regex and CFG: Regular expressions and context-free grammars for text analysis.

7. Explain briefly Naive Bayes, Support Vector Machine, Hidden Markov Model, and Conditional
Random Fields approaches

• Naive Bayes: A probabilistic classifier based on Bayes’ theorem. Assumes feature independence. Used in text
classification.
• Support Vector Machine (SVM): A classifier that finds the optimal decision boundary between classes. Used
in text classification.
• Hidden Markov Model (HMM): A statistical model for sequential data. Used in part-of-speech tagging.
• Conditional Random Fields (CRF): A sequential classifier that considers context. Used in named entity
recognition and part-of-speech tagging.

4
8. What is the difference between RNN and LSTM NN?

• RNN: Processes sequential data but struggles with long-term dependencies due to the vanishing gradient
problem.
• LSTM: A variant of RNN that uses memory cells to retain long-term context, making it more effective for
longer sequences.

9. How CNN can be used for text processing?

• CNNs process text by converting words into 2D word vectors of dimension n ✕ d, forming a matrix.
o n : number of words in the sentence , d : size of the word vectors.
• This matrix can be treated similar to an image and can be modeled by a CNN
• Convolution filters capture local patterns (e.g., n-grams).
• pooling layers condense features, and fully connected layers classify the text.
• This makes CNNs effective for tasks like sentiment analysis and text classification.

10. Describe the concept of transfer learning

• Applying knowledge learned from one task to a different but related task.
• Example: Pre-training a large model (e.g., BERT) on a massive dataset, then fine-tuning it for specific NLP
tasks like text classification or question answering.

11. Give the architecture of autoencoder

• Input Layer: Takes in text data.

• Hidden Layer (Encoder): Compresses the input into a dense vector representation.
• Output Layer (Decoder): Reconstructs the input from the compressed representation.
• Purpose: Used for unsupervised learning of feature representations.

5
12. List the key reasons that make DL not suitable for all NLP tasks

• Overfitting on Small Datasets: DL models require large amounts of data.

• Few-Shot Learning: DL struggles with learning from very few examples.
• Domain Adaptation: DL models trained on one domain may not generalize well to another.
• Interpretability: DL models are often black boxes, making it hard to explain predictions.
• Cost: DL models are expensive to train and deploy.
• On-Device Deployment: DL models may not fit on devices with limited memory and power.

13. Explain the flow of conversation agents

• Speech recognition: Converts speech to text.

• Natural language understanding: Analyzes text for sentiment, entities, and intent.
• Dialog management: Determines the user’s intent and decides the next action.
• Response generation: Generates a response (e.g., retrieving information or performing an action).

6
Chapter 2 : NLP Pipeline
1. What are the key stages of a generic pipeline for NLP system development?
• Data acquisition: Collecting relevant data for the task

• Text cleaning: Removing noise and non-textual information

• Pre-processing: Converting text to canonical form (tokenization, normalization, etc.)
• Feature engineering: Creating numerical representations of text
• Modeling: Building and training ML/DL models
• Evaluation: Measuring model performance
• Deployment: Integrating model into production
• Monitoring and model updating: Continuous improvement

2. How can we get data required for training an NLP technique?

• Public datasets (e.g., Google Dataset Search).

• Scraping data (e.g., forums, Stack Overflow).

• Product intervention (instrumenting products to collect data).
• Data augmentation (creating synthetic data).

3. List the different data augmentation methods?

• Synonym replacement (replace words with synonyms).

• Back translation (translate to another language and back).

• TF-IDF-based word replacement (replace words based on importance).
• Bigram flipping (swap adjacent word pairs).
• Replacing entities (swap names, locations, etc.).
• Adding noise (simulate typos, keyboard errors).
• Advanced techniques (Snorkel, EDA, NLPAug, Active Learning).
7
4. Data can be collected from PDF files, HTML pages, and images, how this data can be
cleaned based on their sources?
• HTML Parsing (BeautifulSoup, Scrapy).

• PDF Extraction (PyPDF, PDFMiner, OCR for scanned PDFs).

• Image Text Extraction (Tesseract OCR).
• Unicode Normalization (handling emojis, symbols).
• Spelling Correction (Microsoft Spell Check API, pyenchant).

5. Using dot (.) to segment sentences can cause problems, explain how?
• Abbreviations (e.g., "Dr.", "Mr.").

• Ellipses ("...").
• Decimals ("3.14").
• URLs/Domains ("example.com").

6. What are the frequent steps in the data pre-processing phase?

• Basic: Sentence segmentation, word tokenization.

• Common: Stop word removal, lowercasing, removing digits/punctuation.

• Advanced: Stemming, lemmatization, normalization, POS tagging.

7. With examples, explain the differences between segmentation and lemmatization.

Stemming:
• Reduces words to base form by removing suffixes
• Doesn't always produce linguistically correct words
• Example: "cars" → "car", "revolution" → "revolut"
(incorrect)

Lemmatization:
• Maps words to their base dictionary form (lemma)
• Requires linguistic knowledge
• Example: "better" → "good", "was" → "be"
8
8. What is the difference between code mixing and transliteration?
• Code mixing: Using multiple languages in one sentence/phrase
o (e.g., Singlish: "We makan at kopitiam").
• Transliteration: Writing words from one language using another language's script
o (e.g., "namaste" for नमस्ते).

9. Describe the concept coreference resolution.

• Identifying all expressions that refer to the same entity in a text.
• Links pronouns/noun phrases to their referents
o (e.g., "John left. He was angry." → "He" = "John").

10. Explain the feature engineering for classical NLP versus DL-based NLP

Classical NLP:

• Handcrafted features based on domain knowledge

• Features are interpretable
• Example: Counting positive/negative words for sentiment
analysis

DL-based NLP:

• Raw text is fed directly to model

• Model learns features automatically
• Features are not interpretable
• Uses embeddings (word2vec, GloVe, etc.)

 Classical NLP: Handcrafted features (e.g., TF-IDF, n-grams) fed to ML models (interpretable).
 DL: Raw text → dense embeddings (e.g., word2vec) → hidden layers → output (less interpretable).

9
11. How to combine heuristics directly or indirectly with the ML model?

Two approaches:
• Direct: Use heuristic outputs as features (e.g., spam word count in email classification).
• Indirect: Pre-process inputs with heuristics to filter data before ML (e.g., filter obvious spam before ML).

12. What is the difference between models ensembling and stacking?

• Ensembling: Combine predictions from multiple models in Parallel approach (e.g.,
voting/averaging).

• Stacking: Feed one model’s output as input to another in Sequential approach (e.g., logistic
regression on Naive Bayes).

13. Which modeling technique can be used in the following cases: small data, large data,
poor data quality, and good data quality?
• Small data: Rule-based systems, traditional ML (less data hungry)

• Large data: Deep learning, richer feature sets unsupervised learning.

• Poor data quality: More cleaning/pre-processing needed
• Good data quality: Can use off-the-shelf algorithms/APIs

14. What is the difference between intrinsic and extrinsic evaluation?

• Intrinsic:

• Measures model performance Uses ML metrics (accuracy, precision, etc.)

• Automated (e.g., precision, recall)
o Example: Spam detection precision/recall

10
• Extrinsic:
• Measures real-world impact Uses business metrics
• Requires human/real-world testing.
o Example: Time saved by spam filter

15. What are the metrics that can be used in: classification, measuring model quality,
information retrieval, predication, machine translation, and summarization tasks.
• Classification: Accuracy, precision, recall, F1, AUC

• Model quality: RMSE, MAPE

• Information retrieval: MRR, MAP, recall@K
• Prediction: RMSE, MAPE
• Machine translation: BLEU, METEOR
• Summarization: ROUGE

16. Describe deploying, monitoring, and updating phases of NLP the pipeline.

• Deployment: Integrating into production (e.g., web service).

• Monitoring: Tracking performance (dashboards, logs).
• Updating: Retraining with new data.

17. Explain how the NLP pipeline is different from a language to another?
• High-resource languages (e.g., English): Use pre-trained models.
• Low-resource languages (e.g., Swahili): Manual labeling, morphological analyzers.
• CJK languages (Chinese/Japanese/Korean): Special tokenization (no spaces).
• Morphologically rich languages (e.g., Turkish): Handle complex word forms.

18. Describe the NLP pipeline for ranking tickets in a ticketing system by Uber.
1. Data sources: Ticket text, trip data, ticket metadata.
2. Pre-processing: Tokenization, lowercasing, stop word removal, lemmatization.
11
3. Feature engineering:
o Bag-of-words → TF-IDF/LSI → cosine similarity with historical solutions.
4. Modeling: Binary classifier ranks solutions; top 3 displayed.
5. Evaluation: MRR (intrinsic), cost savings (extrinsic).
6. Deployment & updates: Continuous testing, model refinement.

12
Chapter 3: Text Representation
1. List the four categories of text representation techniques.
• Basic Vectorization Approaches:
o Includes one-hot encoding, bag of words (BoW), bag of n-grams, and TF-IDF.
o Simple methods to convert text into numerical vectors.
• Distributed Representations:
o Uses neural networks to create dense, low-dimensional representations (e.g., Word2vec, fastText, Doc2vec).
o Captures semantic relationships between words.
• Universal Language Representation:
o Contextual representations using advanced neural models (e.g., ELMo, BERT).
o Accounts for word meaning based on context.
• Handcrafted Features:
o Domain-specific features manually designed for specific NLP tasks.
o Incorporates expert knowledge for tasks like text complexity or essay scoring.

2. Describe the concept of vector space models.

• A mathematical model representing text units (e.g., words, phrases, documents) as vectors in a high-dimensional space
• Known as the Vector Space Model (VSM) or term vector model.
• Used for information retrieval, classification, clustering.
• Similarity measured using cosine similarity:

• Captures linguistic properties for ML tasks.

3. Using D1,D2,D3,D4find their representation using one-hot encoding, bag of words, bag
of n-grams, and TF-IDF.
• D1: "Dog bites man"
• D2: "Man bites dog"
• D3: "Dog eats meat"
• D4: "Man eats food"
• Corpus Vocabulary: [dog, bites, man, eats, meat, food]
o (6 words, mapped as dog=1, bites=2, man=3, meat=4, food=5, eats=6).
13
• One-Hot Encoding :
o Each word is a 6-dimensional binary vector with a single 1 at its ID index.
o D1: Dog bites man → [[1,0,0,0,0,0], [0,1,0,0,0,0], [0,0,1,0,0,0]]
o D2: Man bites dog → [[0,0,1,0,0,0], [0,1,0,0,0,0], [1,0,0,0,0,0]]
o D3: Dog eats meat → [[1,0,0,0,0,0], [0,0,0,0,0,1], [0,0,0,1,0,0]]
o D4: Man eats food → [[0,0,1,0,0,0], [0,0,0,0,0,1], [0,0,0,0,1,0]]
• Bag of Words (BoW) :
o A 6-dimensional vector counting word occurrences in each document.
o D1: Dog bites man → [1,1,1,0,0,0]
o D2: Man bites dog → [1,1,1,0,0,0]
o D3: Dog eats meat → [1,0,0,1,1,0]
o D4: Man eats food → [0,0,1,1,0,1]
• Bag of N-Grams :
o Vocabulary of bigrams: [dog bites, bites man, man bites, bites dog, dog eats, eats meat, man eats, eats food].
o Each document is an 8-dimensional vector of bigram frequencies.
o D1: Dog bites man → [1,1,0,0,0,0,0,0]
o D2: Man bites dog → [0,0,1,1,0,0,0,0]
o D3: Dog eats meat → [0,0,0,0,1,1,0,0]
o D4: Man eats food → [0,0,0,0,0,0,1,1]
• TF-IDF :
o Combines Term Frequency (TF) and Inverse Document Frequency (IDF).

o TF-IDF scores :
▪ dog: 0.136, bites: 0.17, man: 0.136, eats: 0.17, meat: 0.17, food: 0.17
o D1: Dog bites man → [0.136, 0.17, 0.136, 0, 0, 0]
o D2: Man bites dog → [0.136, 0.17, 0.136, 0, 0, 0]
o D3: Dog eats meat → [0.136, 0, 0, 0.17, 0.17, 0]
o D4: Man eats food → [0, 0, 0.136, 0.17, 0, 0.17]

14
4. Explain the difference between
• (a) distributional similarity and distributional hypothesis :
o Similarity: Meaning derived from context (e.g., "bank" in "river bank" vs. "money bank").
o Hypothesis: Words in similar contexts have similar meanings (e.g., "dog" and "cat").
• (b) distributional representation and distributed representation :
o Distributional Representation:
▪ High-dimensional, sparse vectors based on word co-occurrence in contexts.
▪ Derived from a co-occurrence matrix.
▪ Examples: One-hot, BoW, n-grams, TF-IDF.
o Distributed Representation:
▪ Low-dimensional, dense vectors that compress distributional representations.
▪ Uses neural networks to capture semantic relationships.
▪ Examples: Word2vec, fastText, Doc2vec embeddings.

5. Describe the word embedding concept with an example of its use.

• Definition:
o Word embeddings are dense, low-dimensional vectors (50-500 dimensions) that capture semantic
relationships based on distributional similarity.
o Learned using neural networks (e.g., Word2vec) to represent words in a vector space where similar
words cluster together.
• Example of Use:
o Task: Finding words similar to “beautiful” using pre-trained Word2vec embeddings.
w2v_model.most_similar("beautiful")
# Output: [("gorgeous", 0.83), ("lovely", 0.81)] (ranked by cosine similarity.)
o Use Case: Semantic analysis and text classification.

6. Explain with an example the architectural variants of Word2vec: CBOW and SkipGram.
• Continuous Bag of Words (CBOW):
o Predict the center word given context words.
o Example:

▪ Sentence: "The quick brown fox jumps over the lazy dog."
▪ Window (k=2): ["quick", "brown", "fox", "jumps", "over"]
▪ Input (X): ["quick", "brown", "jumps", "over"]
▪ Output (Y): "fox"

15
o Neural Network Architecture:

▪ Input Layer: One-hot encoded context words.

▪ Hidden Layer: Sum/Average of word embeddings.
▪ Output Layer: Softmax predicts the center word.

• SkipGram:
o Predict context words given the center word.
o Example:
▪ Same slide window as CBOW, but now:

▪ Input (X): Center word ("fox").

▪ Output (Y): All context words (["quick", "brown", "jumps", "over"]).

o Neural Network Architecture:

▪ Input Layer: One-hot encoded center word.

▪ Hidden Layer: Word embedding.
▪ Output Layer: Softmax predicts each context word.

• Difference:

CBOW SkipGram

Predicts center word from context. Predicts context words from center word.

Faster, good for frequent words. Slower, better for rare words.

16
7.How can the OOV (Out-of-Vocabulary) problem be solved?
• Problem: OOV words are those not present in the training vocabulary, causing issues in representations
• Solutions:
o Ignore OOV Words:
▪ Exclude OOV words from feature extraction.
o Random Initialization :
▪ Assign random vectors (components between -0.25 and +0.25) to OOV words.
o Subword Information :
▪ Use algorithms like fastText, which represent words as character n-grams.
▪ Example:For “gregarious” (OOV), break into n-grams(gre, reg, ega, …, ous), combine embeddings.
o Retrain Model :
▪ Expand vocabulary to include new words and retrain the model.
▪ Feasible but computationally expensive.

8.What is the difference between Doc2vec and Word2vec?

• Word2vec:
o Learns embeddings for individual words based on their context.
o Representations are context-independent; each word has a fixed vector.
• Doc2vec:
o learn embeddings for entire texts (phrases, sentences, documents).
o Incorporates context by learning a unique “paragraph vector” for each text alongside shared word
vectors.
• Difference:

Word2vec Doc2vec

Learns word-level embeddings. Learns document-level embeddings.

Ignores word order. Captures context within a document.

Used for word similarity. Used for document classification.

9.What are the important aspects to keep in mind while using word embeddings?
• Bias in Training Data:
o Embeddings reflect biases in training data (e.g., “Apple” closer to “Microsoft” than “orange” in tech-heavy corpora).
o Mitigate biases to ensure fair NLP model performance.
17
• Scalability (Large File Sizes):
o Pre-trained embeddings (e.g., Word2vec ~4.5 GB) require significant storage.
o Use in-memory databases like Redis with caching to manage scaling and latency.
• Context Limitations:
o Embeddings may not capture domain-specific nuances (e.g., sarcasm detection).
o Combine with domain-specific features for better performance.
• Evolving Field:
o Neural text representation is rapidly advancing; new models (e.g., BERT) may outperform older ones.
o Consider ROI, business needs, and infrastructure constraints before adopting.

11.How can high-dimensional data be represented visually?

• Technique: t-SNE (t-distributed Stochastic Neighbor Embedding).
• Purpose:
o Reduces high-dimensional data to 2D or 3D for visualization.
o Preserves data distributions from high-dimensional to low-dimensional space.
• Applications:
o MNIST Digits
o Word Embeddings
o Word Analogies
o Document Embeddings
• Tools: TensorBoard’s Embedding Projector :Helps explore feature quality and relationships visually.

12.With an example, explain the use of handcrafted feature representations.

• Definition:
o Custom features incorporating domain-specific knowledge for specific NLP tasks.
o Used when general-purpose embeddings (e.g., Word2vec) are insufficient.
• Examples:
o TextEvaluator :Helps teachers select grade-appropriate reading materials by assessing text complexity.
▪ Measures Syntactic complexity, concreteness, and other domain-specific metrics.
o Automated Essay Scoring : Features for evaluating essay coherence, grammar in GRE/TOEFL exams.
o Spelling/Grammar Correction : Features for detecting specific error types in tools like Grammarly.
• Use Case:
o Combined with vectorization/embeddings for hybrid approaches.

18
Chapter 4: Text Classification
1. What is the difference between binary, multi-class, and multi-label classification?
• Binary Classification:
o Two classes (e.g., spam vs. non-spam email).
o Each document belongs to exactly one class.
• Multi-Class Classification:
o More than two classes (e.g., sentiment as positive, negative, neutral).
o Each document belongs to exactly one class from set C
• Multi-Label Classification:
o One or more labels per document (e.g., a news article labeled as "sports" and "soccer").
o Labels are a subset of C. and a document can have no, one, multiple, or all classes.

2. Give some applications of text classification.

• Content Classification and Organization:
o Tagging news, blogs, product reviews; organizing emails (e.g., Gmail’s personal, social, promotions tabs).
• Customer Support:
o Identifying actionable social media posts .
• E-Commerce:
o Sentiment analysis of product reviews; aspect-based sentiment analysis.
• Other Applications:
o Language identification (e.g., Google Translate).
o Authorship attribution.
o Triaging mental health forum posts.
o Segregating fake news from real news.

3. Describe the pipeline for building text classification systems.

1. Collect Labeled Dataset: Gather data suitable for the task.
2. Split Dataset: Divide into training, validation (development), and test sets; select evaluation metric(s).
3. Feature Extraction: Transform raw text into feature vectors (e.g., BoW, embeddings).
4. Train Classifier: Use feature vectors and labels to train a model.
5. Evaluate Model: Benchmark performance on the test set using metrics (e.g., accuracy, F1 score).
6. Deploy and Monitor: Deploy the model in production and monitor real-world performance.
o Iterate steps 3-5 to tune features, algorithms, and hyperparameters.

19
4. Classification can be done without the text classification pipeline, explain how?
• Lexicon-Based Sentiment Analysis:
o Uses predefined lists of positive/negative words to classify text.
o No machine learning; relies on heuristics or rules.
o Example: A tweet with more positive words (e.g., “great”) is classified as positive.
• Existing APIs:
o Use off-the-shelf APIs for generic tasks like sentiment analysis or category classification.
o No need to train a model; APIs provide pre-trained classifiers.
• Benefits:
o Quick MVP deployment.
o Provides a baseline for evaluation.

5. Describe with an example the confusion matrix of a classifier.

• A table showing predicted vs. actual class labels to evaluate classifier performance.
• Example: Naive Bayes classifier on the “Economic News Article Tone and Relevance” dataset (binary:
relevant vs. non-relevant).
o Confusion Matrix :
▪ Non-relevant (predicted correctly): High accuracy (86%).
▪ Relevant (predicted correctly): Low accuracy (42%).

 Classifier performs well on non-relevant articles but struggles with relevant ones due to class imbalance.
• Use: Highlights errors (e.g., false positives/negatives) and class-specific performance.

20
6. List the potential reasons for poor classifier performance.
1. Large, Sparse Feature Vectors: Too many features introduce noise and sparsity, hindering learning.
2. Class Imbalance: Skewed data biases the model toward the majority class.
3. Suboptimal Algorithm: The chosen algorithm may not suit the dataset.
4. Poor Pre-Processing/Feature Extraction: Ineffective text cleaning or feature representation.
5. Untuned Parameters: Classifier hyperparameters need optimization.

7. How to solve the class imbalance problem of a dataset?

• Oversampling: Increase instances of the minority class.
• Undersampling: Reduce instances of the majority class.
• Weight Balancing: Adjust classifier weights to prioritize minority classes.
• Tools: Use libraries like Imbalanced-Learn for sampling methods.
• Example: Logistic regression with class_weight="balanced" boosts minority class weights.

8. What is the difference between generative and discriminative classifiers?

• Generative Classifier (e.g., Naive Bayes):
o Model the joint probability P(X,Y) of features X and labels Y.
o Chooses the class with the highest probability.
• Discriminative Classifier (e.g., Logistic Regression, SVM):
o Model the conditional probability P(Y∥X) directly.
o Aims to find a decision boundary.

9. How to use word embeddings as features for text classification?

• Load a pre-trained embedding model (e.g., Word2vec GoogleNews-vectors).
• Pre-process text (e.g., tokenization, lowercasing).
• Create feature vectors by averaging embeddings of words in a text (ignore OOV words).
• Train a classifier (e.g., logistic regression) using these vectors as an input.

10. List the steps for converting training and test data into a format suitable for the
neural network.
1. Tokenize Texts: Convert texts into word index vectors using a tokenizer.
2. Pad Sequences: Ensure all sequences are the same length by padding with zeros.
3. Map to Embeddings: Convert word indices to embedding vectors using a pre-trained or trainable embedding matrix.
4. Input to Neural Network: Use the resulting vectors as input to the neural network’s embedding layer.

21
11. Which technique is better for text classification, CNN or LSTM, and why?
• CNNs:
o Strengths: Learn key bag-of-words/n-grams features; faster to train; less data-hungry.
o Best For: Smaller datasets or when speed is critical.
• LSTMs:
o Strengths: Capture sequential context (word order); suitable for language modeling.
o Weaknesses: Slower to train; data-hungry; may underperform on small datasets.
• Which is Better?:
o No universal winner; depends on dataset and task.
o CNNs often better for smaller datasets or speed; LSTMs for sequential context with large data.
o Experiment with both and tune hyperparameters (e.g., layers, epochs).

12. How can text classification models be interpreted?

• Need for Interpretability: Explain predictions for transparency or debugging.
• Tool: Lime (Local Interpretable Model-agnostic Explanations).
o Approximates a black-box model with a linear model locally around a test instance.
o Outputs weighted features influencing the prediction.
• Use Cases: Justify decisions, inform feature selection, improve model reliability.

13. How to solve no training and less training data problems?

• No Training Data:
o Manual Labeling: Use domain experts to label data.
o Weak Supervision: Use patterns/rules to label data ; tools like Snorkel.
o Crowdsourcing: Use platforms like Amazon Mechanical Turk or Figure Eight for large-scale labeling.
• Less Training Data:
o Active Learning:
▪ Train with available data, identify low-confidence predictions, label those, and retrain.
o Domain Adaptation/Transfer Learning:
▪ Fine-tune a pre-trained model on target domain’s unlabeled data, then train on labeled data.

14. Give some options to explore when no labels exist for a dataset.
• Use Existing APIs/Libraries:
o Map API categories to relevant classes.
• Use Public Datasets:
o Adapt datasets like 20 Newsgroups to train a classifier.
22
• Weak Supervision:
o Create rules to bootstrap a dataset.
• Active Learning:
o Use tools like Prodigy to label key instances interactively.
• Feedback Integration:
o Use implicit or explicit signals to refine the model.

15. Describe the pipeline for building a classifier when there is no training data.
1. Start with a Baseline:
o Use a public API, public dataset , or weak supervision to create an initial model.
2. Deploy in Production:
o Apply the baseline model to real-world data.
3. Collect Feedback:
o Gather explicit and implicit signals on model performance.
4. Refine with Active Learning:
o Identify low-confidence predictions, label them, and retrain the model.
5. Iterate:
o Collect more data over time, transition to sophisticated models as data grows.
6. Monitor and Improve:
o Continuously update the model based on new data and feedback.

ANT-A114518R1v06-4261 Datasheet
No ratings yet
ANT-A114518R1v06-4261 Datasheet
4 pages
Ferrum Phosphoricum
No ratings yet
Ferrum Phosphoricum
4 pages
Raymond S. T. Lee - Natural Language Processing. A Textbook With Python Implementation-Springer (2024)
No ratings yet
Raymond S. T. Lee - Natural Language Processing. A Textbook With Python Implementation-Springer (2024)
454 pages
Manual Quick
No ratings yet
Manual Quick
181 pages
Extracted-Chemistry Hand Book Last and Final Copy 17-6-2025
No ratings yet
Extracted-Chemistry Hand Book Last and Final Copy 17-6-2025
40 pages
01 Introduction To Natural Language Processing
No ratings yet
01 Introduction To Natural Language Processing
42 pages
Hydraulic Diagram MM0434313 - 1
100% (1)
Hydraulic Diagram MM0434313 - 1
4 pages
Operation/Technical Manual
No ratings yet
Operation/Technical Manual
64 pages
Chapter#1 - Introduction To Web Engineering
No ratings yet
Chapter#1 - Introduction To Web Engineering
54 pages
Introduction Presentation
No ratings yet
Introduction Presentation
17 pages
Grid-Connected EV Charging With Renewable Energy Integration in Parking Lots
No ratings yet
Grid-Connected EV Charging With Renewable Energy Integration in Parking Lots
64 pages
Unit 1 All About You
No ratings yet
Unit 1 All About You
11 pages
SECTION 2 Course Outline Managerial Economics MGCR 293 002 Dr. K. Salmasi (Fall 2017)
No ratings yet
SECTION 2 Course Outline Managerial Economics MGCR 293 002 Dr. K. Salmasi (Fall 2017)
12 pages
Unit-3NaturalLanguageProcessing (NLP) 1 T1743588944524
No ratings yet
Unit-3NaturalLanguageProcessing (NLP) 1 T1743588944524
83 pages
Unit 1
No ratings yet
Unit 1
99 pages
Cambridge International Advanced Subsidiary and Advanced Level
No ratings yet
Cambridge International Advanced Subsidiary and Advanced Level
12 pages
MMG 301 Final March18
No ratings yet
MMG 301 Final March18
143 pages
3rd Year 1st Semester
No ratings yet
3rd Year 1st Semester
11 pages
NLP Handwritten Notes
No ratings yet
NLP Handwritten Notes
26 pages
Activity 2 - Qualitative Test For The Presence of Organic Compounds
No ratings yet
Activity 2 - Qualitative Test For The Presence of Organic Compounds
5 pages
Chapter - 6 Communicating, Perceiving, and Acting
No ratings yet
Chapter - 6 Communicating, Perceiving, and Acting
30 pages
Conclusion
No ratings yet
Conclusion
2 pages
COMPTRONIX
No ratings yet
COMPTRONIX
18 pages
Technical Analysis Elearn
100% (5)
Technical Analysis Elearn
44 pages
BSP6032 Writ2
No ratings yet
BSP6032 Writ2
10 pages
NLP Chapter - 1 Sheet
No ratings yet
NLP Chapter - 1 Sheet
6 pages
Salinas CA Fy 2025 26 Adopted Budget in Brief
No ratings yet
Salinas CA Fy 2025 26 Adopted Budget in Brief
13 pages
Key Notes For Natural Language Processing
No ratings yet
Key Notes For Natural Language Processing
2 pages
Natural Language Processing
No ratings yet
Natural Language Processing
5 pages
NLP Unit 1
No ratings yet
NLP Unit 1
48 pages
MIDs POR IDENTIFICAR
No ratings yet
MIDs POR IDENTIFICAR
34 pages
Assignemnt 1
No ratings yet
Assignemnt 1
3 pages
Plutopia Chapters 19-20, 22, 30
No ratings yet
Plutopia Chapters 19-20, 22, 30
3 pages
Statement of Financial Position (S.F.P)
No ratings yet
Statement of Financial Position (S.F.P)
3 pages
SCO409 Lecture Notes
No ratings yet
SCO409 Lecture Notes
64 pages
Ai CH 4
No ratings yet
Ai CH 4
53 pages
Authorization Form Panda Food
No ratings yet
Authorization Form Panda Food
3 pages
Exam - Digital Egypt, Transformation Into A Digital Economy
No ratings yet
Exam - Digital Egypt, Transformation Into A Digital Economy
4 pages
Unit 1
No ratings yet
Unit 1
20 pages
Natural Language Processing - 1
No ratings yet
Natural Language Processing - 1
44 pages
Natural Language Processing Unit1
No ratings yet
Natural Language Processing Unit1
23 pages
Tyler Hoge Resume
No ratings yet
Tyler Hoge Resume
1 page
Mathematics BSC FYUP Syllabus 2024
No ratings yet
Mathematics BSC FYUP Syllabus 2024
36 pages
Unit Iii
No ratings yet
Unit Iii
6 pages
Wisdom Natural Language Processing
No ratings yet
Wisdom Natural Language Processing
4 pages
NLP Materia
No ratings yet
NLP Materia
29 pages
Natural Language Processing
No ratings yet
Natural Language Processing
4 pages
Artificial Intelligence-UNIT-4
No ratings yet
Artificial Intelligence-UNIT-4
37 pages
Natural Language Processing
No ratings yet
Natural Language Processing
4 pages
Module I NLP
No ratings yet
Module I NLP
65 pages
Ai Applications Unit-1
No ratings yet
Ai Applications Unit-1
11 pages
Unit 4
No ratings yet
Unit 4
39 pages
Deep Learning Lecture 28 April
No ratings yet
Deep Learning Lecture 28 April
4 pages
Natural Language Processing
No ratings yet
Natural Language Processing
6 pages
Natural Language Processing - Bridging The Gap Between Humans and Machines
No ratings yet
Natural Language Processing - Bridging The Gap Between Humans and Machines
6 pages
Module-1 Introduction To NLP
No ratings yet
Module-1 Introduction To NLP
28 pages
1 NLP
No ratings yet
1 NLP
26 pages
Ai 2
No ratings yet
Ai 2
7 pages
Sha 10
No ratings yet
Sha 10
6 pages
NLP Qna Sem 7 2024 18 11 05 03 29 1
No ratings yet
NLP Qna Sem 7 2024 18 11 05 03 29 1
37 pages
Natural Language Processing
No ratings yet
Natural Language Processing
21 pages
Eco 36
No ratings yet
Eco 36
6 pages
NLP Notes
No ratings yet
NLP Notes
90 pages
NLP Notes
No ratings yet
NLP Notes
9 pages
Brief History of NLP
No ratings yet
Brief History of NLP
7 pages
Infographic - Schools of Criticism
No ratings yet
Infographic - Schools of Criticism
1 page
Natural Language Processing
No ratings yet
Natural Language Processing
7 pages
NLP LectureNotes UNIT 1
No ratings yet
NLP LectureNotes UNIT 1
55 pages
Natural Language Processin1
No ratings yet
Natural Language Processin1
86 pages
NLP 2
No ratings yet
NLP 2
45 pages
What Is NLP
No ratings yet
What Is NLP
16 pages
NLP
No ratings yet
NLP
3 pages
NLP Module 1
No ratings yet
NLP Module 1
31 pages
Chapter 6.
No ratings yet
Chapter 6.
31 pages
Top 50 NLP Interview Questions and Answers (2023) - Reader View
No ratings yet
Top 50 NLP Interview Questions and Answers (2023) - Reader View
27 pages
Topic 2: Introduction To Natural Language Processing (NLP)
No ratings yet
Topic 2: Introduction To Natural Language Processing (NLP)
16 pages
Natural Language Processing
No ratings yet
Natural Language Processing
3 pages
Ai Assigment 1
No ratings yet
Ai Assigment 1
4 pages
Natural Language Processing
No ratings yet
Natural Language Processing
4 pages
Introduction To Data Science - Week 7 - LAQ's
No ratings yet
Introduction To Data Science - Week 7 - LAQ's
4 pages
Short Answer
No ratings yet
Short Answer
4 pages
Natural Language Processing 101
No ratings yet
Natural Language Processing 101
26 pages
LFP Syllabus
No ratings yet
LFP Syllabus
2 pages
AI4youngster - 6 - Topic NLP
No ratings yet
AI4youngster - 6 - Topic NLP
66 pages
Leveraging
No ratings yet
Leveraging
5 pages
Introduction To NLP - First - Week - Lecture - 1st
No ratings yet
Introduction To NLP - First - Week - Lecture - 1st
6 pages
Chapter 1 Solutions
No ratings yet
Chapter 1 Solutions
5 pages
Unit 3
No ratings yet
Unit 3
14 pages
Remote Viewing Dialogues Daz Smith PDF Download
No ratings yet
Remote Viewing Dialogues Daz Smith PDF Download
87 pages
Explanation Based Learning: Fundamentals and Applications
From Everand
Explanation Based Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet

NLP Sheets

Uploaded by

NLP Sheets

Uploaded by

Chapter 1 : NLP A Primer

1. List examples of real-world applications of NLP

• Analyze their social media feeds:

4. Why is NLP challenging?

6. Describe the heuristics-based NLP

9. How CNN can be used for text processing?

10. Describe the concept of transfer learning

11. Give the architecture of autoencoder

• Input Layer: Takes in text data.

• Overfitting on Small Datasets: DL models require large amounts of data.

13. Explain the flow of conversation agents

• Speech recognition: Converts speech to text.

• Text cleaning: Removing noise and non-textual information

2. How can we get data required for training an NLP technique?

• Scraping data (e.g., forums, Stack Overflow).

3. List the different data augmentation methods?

• Back translation (translate to another language and back).

• PDF Extraction (PyPDF, PDFMiner, OCR for scanned PDFs).

6. What are the frequent steps in the data pre-processing phase?

• Common: Stop word removal, lowercasing, removing digits/punctuation.

7. With examples, explain the differences between segmentation and lemmatization.

9. Describe the concept coreference resolution.

• Handcrafted features based on domain knowledge

• Raw text is fed directly to model

12. What is the difference between models ensembling and stacking?

• Large data: Deep learning, richer feature sets unsupervised learning.

14. What is the difference between intrinsic and extrinsic evaluation?

• Measures model performance Uses ML metrics (accuracy, precision, etc.)

• Model quality: RMSE, MAPE

• Deployment: Integrating into production (e.g., web service).

2. Describe the concept of vector space models.

• Captures linguistic properties for ML tasks.

5. Describe the word embedding concept with an example of its use.

▪ Input Layer: One-hot encoded context words.

▪ Input (X): Center word ("fox").

o Neural Network Architecture:

▪ Input Layer: One-hot encoded center word.

8.What is the difference between Doc2vec and Word2vec?

Learns word-level embeddings. Learns document-level embeddings.

Ignores word order. Captures context within a document.

Used for word similarity. Used for document classification.

11.How can high-dimensional data be represented visually?

12.With an example, explain the use of handcrafted feature representations.

2. Give some applications of text classification.

3. Describe the pipeline for building text classification systems.

5. Describe with an example the confusion matrix of a classifier.

7. How to solve the class imbalance problem of a dataset?

8. What is the difference between generative and discriminative classifiers?

9. How to use word embeddings as features for text classification?

12. How can text classification models be interpreted?

13. How to solve no training and less training data problems?

You might also like