What Is NLP?
What Is NLP?
What is NLP?
NLP stands for Natural Language Processing, which is a part of Computer
Science, Human language, and Artificial Intelligence. It is the technology
that is used by machines to understand, analyse, manipulate, and interpret
human's languages. It helps developers to organize knowledge for performing
tasks such as translation, automatic summarization, Named Entity
Recognition (NER), speech recognition, relationship extraction, and topic
segmentation.
History of NLP
(1940-1960) - Focused on Machine Translation (MT)
1948 - In the Year 1948, the first recognisable NLP application was introduced
in Birkbeck College, London.
1950s - In the Year 1950s, there was a conflicting view between linguistics and
computer science. Now, Chomsky developed his first book syntactic structures
and claimed that language is generative in nature.
In 1957, Chomsky also introduced the idea of Generative Grammar, which is rule
based descriptions of syntactic structures.
(1960-1980) - Flavored with Artificial Intelligence (AI)
Case Grammar
Case Grammar was developed by Linguist Charles J. Fillmore in the year 1968.
Case Grammar uses languages such as English to express the relationship
between nouns and verbs by using the preposition.
In Case Grammar, case roles can be defined to link certain kinds of verbs and
objects.
For example: "Neha broke the mirror with the hammer". In this example case
grammar identify Neha as an agent, mirror as a theme, and hammer as an
instrument.
1980 - Current
Till the year 1980, natural language processing systems were based on complex
sets of hand-written rules. After 1980, NLP introduced machine learning
algorithms for language processing.
Advantages of NLP
o NLP helps users to ask questions about any subject and get a direct
response within seconds.
o NLP offers exact answers to the question means it does not offer
unnecessary and unwanted information.
o NLP helps computers to communicate with humans in their
languages.
o It is very time efficient.
o Most of the companies use NLP to improve the efficiency of
documentation processes, accuracy of documentation, and identify
the information from large databases.
Disadvantages of NLP
A list of disadvantages of NLP is given below:
NLP Libraries
Scikit-learn: It provides a wide range of algorithms for building machine
learning models in Python.
Natural language Toolkit (NLTK): NLTK is a complete toolkit for all NLP
techniques.
SpaCy: SpaCy is an open-source NLP library which is used for Data Extraction,
Data Analysis, Sentiment Analysis, and Text Summarization.
1. Sentiment Analysis
Definition: Sentiment Analysis is the process of determining the emotional tone or sentiment
behind a series of words. It is often used to understand the opinions, attitudes, and emotions
expressed in a text.
Types:
VADER is a lexicon and rule-based tool designed specifically to perform sentiment analysis
on text. It is highly efficient for social media data due to its sensitivity to linguistic nuances,
such as emoticons, slang, acronyms, and capitalization.
• Key Features:
o Lexicon-based: Uses a predefined dictionary of words with associated
sentiment scores.
o Rule-based heuristics: Accounts for contextual sentiment based on grammar
and syntax.
o Outputs: Returns sentiment polarity (positive, neutral, negative) and a
compound score ranging from -1 (most negative) to +1 (most positive).
o Lightweight and easy to use, ideal for quick sentiment analysis tasks.
• Applications:
o Analyzing customer reviews, tweets, or other short text.
o Social media monitoring and brand sentiment tracking.
2. TextBlob
TextBlob is a Python library for processing textual data, offering simple APIs for common
natural language processing (NLP) tasks, including sentiment analysis.
• Key Features:
o Sentiment Analysis: Provides polarity (range -1 to 1) and subjectivity (range
0 to 1).
▪ Polarity: Measures the sentiment as positive or negative.
▪ Subjectivity: Indicates the level of personal opinion versus factual
information.
o Supports text classification, tokenization, and part-of-speech tagging.
o Built on the Natural Language Toolkit (NLTK), making it extendable and
reliable.
• Applications:
o Sentiment analysis for blogs, articles, and longer texts.
o General-purpose text analysis and preprocessing for NLP projects.
3. Transformers (BERT)
BERT (Bidirectional Encoder Representations from Transformers) is a state-of-the-art
transformer-based model developed by Google. It is designed for deep understanding of
language context by processing text bidirectionally.
• Key Features:
o Pre-trained on massive datasets: Ensures high accuracy and understanding of
complex language patterns.
o Bidirectional processing: Considers both left and right context in sentences.
o Fine-tuning: Allows adaptation to specific tasks like sentiment analysis,
question answering, and text classification.
o Supports multilingual text processing.
• Applications:
o Sentiment analysis for complex, nuanced text.
o Advanced NLP tasks like named entity recognition (NER), machine
translation, and text summarization.
o Ideal for large datasets requiring high precision and contextual understanding.
Definition: OCR is the process of converting different types of documents (e.g., scanned
paper documents, PDFs) into editable and searchable data. It plays a crucial role in digitizing
printed or handwritten text, making it accessible for further processing or analysis.
1. Image Acquisition:
o The document is scanned or captured as an image (in formats like PNG, JPEG,
TIFF, or PDF).
o Quality of the image is crucial for accurate OCR. Key factors include
resolution, contrast, and clarity.
2. Preprocessing:
o Image preprocessing techniques are applied to enhance quality and prepare it
for text recognition:
▪ Noise reduction: Removes speckles or distortions.
▪ Binarization: Converts the image to black and white for easier text
detection.
▪ Skew correction: Aligns the image properly.
▪ Normalization: Standardizes brightness and contrast.
3. Segmentation:
o The OCR system identifies and separates different elements in the image, such
as:
▪ Lines of text.
▪ Words.
▪ Characters.
4. Feature Extraction:
o The system analyzes the shape, structure, and patterns of each character.
Common methods include:
▪ Template matching: Comparing characters to a predefined database of
fonts and shapes.
▪ Feature-based recognition: Identifying unique characteristics like
loops, lines, or curves in characters.
5. Recognition:
o Characters are matched to corresponding symbols or letters using algorithms
like:
▪ Pattern recognition.
▪ Machine learning models.
▪ Neural networks (for modern OCR systems).
6. Post-Processing:
o Contextual analysis is applied to improve accuracy:
▪ Correcting errors using dictionaries.
▪ Identifying words in context to eliminate unlikely results.
Types:
Tools:
• Tesseract
• OCR.space
• Adobe Acrobat OCR
import pytesseract
from PIL import Image
image = Image.open('example.png')
text = pytesseract.image_to_string(image)
print(text)
3. Text Categorization
Types:
• Topic Classification: Categorizing texts into topics like sports, politics, etc.
• Sentiment Classification: Classifying sentiment as positive, negative, neutral.
• Spam Detection: Classifying messages as spam or not spam.
1. Text Representation:
o Bag of Words (BoW): Represents text as a collection of words, disregarding
grammar and order.
o TF-IDF (Term Frequency-Inverse Document Frequency): Captures the
importance of words based on frequency and uniqueness in the dataset.
o Word Embeddings: Uses techniques like Word2Vec, GloVe, or FastText to
represent text in dense vector forms.
o Sentence Embeddings: Captures context and meaning at the sentence level
using models like BERT or Sentence-BERT.
2. Algorithms and Models:
o Rule-Based Models: Uses predefined rules to classify text. Suitable for
simple, well-defined tasks.
o Machine Learning Models:
▪ Naive Bayes: Probabilistic approach based on Bayes' theorem.
▪ Support Vector Machines (SVM): Effective for high-dimensional
spaces like text data.
▪ Decision Trees and Random Forests: Handles categorical text
features well.
o Deep Learning Models:
▪ Recurrent Neural Networks (RNNs): Captures sequential
dependencies in text.
▪ Convolutional Neural Networks (CNNs): Identifies patterns for
classification tasks.
▪ Transformer-Based Models: State-of-the-art models like BERT,
GPT, and RoBERTa handle context-rich and complex texts effectively.
3. Categories of Text Classification:
o Binary Classification: Assigns one of two categories (e.g., spam or not
spam).
o Multi-Class Classification: Assigns one label from multiple categories (e.g.,
classifying news articles into sports, politics, or technology).
o Multi-Label Classification: Assigns multiple labels to a single piece of text
(e.g., tagging a movie review as both "romantic" and "comedy").
4. Feature Engineering:
o Tokenization, stemming, lemmatization, and removal of stopwords are
common preprocessing steps.
o N-grams are used to capture sequential relationships.
Applications of Text Categorization
1. Email Filtering:
o Categorizing emails as spam, promotional, or important.
2. Sentiment Analysis:
o Analyzing customer feedback to classify sentiment as positive, negative, or
neutral.
3. News Categorization:
o Organizing news articles into predefined categories like politics, sports, or
entertainment.
4. Social Media Monitoring:
o Analyzing tweets or posts for brand mentions, sentiments, or specific topics.
5. Document Organization:
o Classifying legal documents, research papers, or resumes into relevant
categories.
6. Chatbots and Customer Support:
o Classifying user queries to route them to the appropriate department or
automated response.
7. Healthcare Applications:
o Classifying patient records, medical reports, or symptoms into relevant
categories.
1. High Dimensionality:
o Text data often involves a vast number of unique words or features.
2. Ambiguity and Context Dependency:
o Words can have multiple meanings based on the context.
3. Class Imbalance:
o Some categories may have significantly more data than others.
4. Dynamic Nature of Text:
o Language evolves over time with new words, phrases, and trends.
5. Multilingual Texts:
o Handling mixed-language or non-English text increases complexity.
1. Data Augmentation:
o Expanding datasets by paraphrasing or translating text.
2. Transfer Learning:
o Using pre-trained models like BERT or GPT for domain-specific fine-tuning.
3. Regularization:
o Preventing overfitting in machine learning models.
4. Cross-Validation:
o Evaluating models effectively to ensure generalization.
5. Active Learning:
o Incorporating user feedback to improve model accuracy iteratively.
1. Scikit-Learn:
o Offers algorithms like Naive Bayes, SVM, and Random Forest for text
classification.
2. NLTK (Natural Language Toolkit):
o A comprehensive library for text preprocessing and classification.
3. SpaCy:
o Efficient for text processing and feature extraction.
4. Transformers (Hugging Face):
o Provides pre-trained models like BERT, RoBERTa, and GPT for advanced
text classification.
5. TensorFlow and PyTorch:
o Frameworks for building custom deep learning models.
1. Data Collection:
o Gather raw text data from sources like websites, emails, or databases.
2. Preprocessing:
o Clean and preprocess the text (tokenization, stopword removal, etc.).
3. Feature Extraction:
o Convert text into numerical representations (TF-IDF, embeddings, etc.).
4. Model Selection and Training:
o Train a machine learning or deep learning model on labeled data.
5. Evaluation:
o Assess model performance using metrics like accuracy, precision, recall, and
F1-score.
6. Deployment:
o Use the trained model for real-world categorization tasks.
# Example data
texts = ["I love programming", "Python is great", "I hate bugs", "I
enjoy learning new things"]
labels = ["positive", "positive", "negative", "positive"]
# Train classifier
classifier = MultinomialNB()
classifier.fit(X_train, y_train)
# Test prediction
predictions = classifier.predict(X_test)
print(predictions)
4. Word Prediction
Definition: Word Prediction refers to the process of predicting the next word in a sequence of
words based on context. It's often used in applications like text messaging, search engines,
and coding assistants.
Types:
1. Context Understanding:
o The algorithm analyzes the surrounding text (context) to predict the most
likely next word.
o Context includes grammar, semantics, and statistical patterns.
2. Probabilistic Modeling:
o Word prediction often relies on probabilities, estimating which word is most
likely based on the input.
3. Feature Representation:
o Words are converted into numerical forms, such as:
▪ One-hot encoding: Represents each word as a binary vector.
▪ Word embeddings: Dense vector representations (e.g., Word2Vec,
GloVe).
▪ Contextual embeddings: Dynamic embeddings that consider the
sentence context (e.g., BERT, GPT).
1. Rule-Based Systems:
o Early systems used predefined grammatical rules.
o Limited flexibility and scalability.
2. Statistical Language Models:
o Predict the probability of a sequence of words.
o N-grams: Predict the next word based on the previous n words.
▪ Example: A trigram model predicts the next word using the last two
words.
o Limitations: Fixed window size and inability to capture long-term
dependencies.
3. Neural Networks:
o Deep learning models predict words based on complex patterns in data.
Common Architectures:
1. Typing Assistance:
o Predictive text in smartphones and word processors.
o Example: Autocomplete suggestions in search engines.
2. Chatbots and Virtual Assistants:
o Predict responses in conversational systems like Siri, Alexa, and ChatGPT.
3. Text Generation:
o Used in creative applications like story writing and content generation.
4. Machine Translation:
o Predicts the next word in the target language during translation.
5. Spell and Grammar Correction:
o Suggests the correct word in case of typos or grammatical errors.
1. N-gram Models:
o Simple and effective for small datasets.
o Example: Google’s early search autocomplete.
2. Transformer-Based Models:
o GPT (Generative Pre-trained Transformer):
▪ Predicts the next word or sequence based on context.
o BERT (Bidirectional Encoder Representations from Transformers):
▪ Predicts missing words in a sentence and understands context
bidirectionally.
o T5 (Text-to-Text Transfer Transformer):
▪
Converts all NLP tasks into text-to-text format, including word
prediction.
3. RNNs and LSTMs:
o Commonly used in earlier NLP models for sequential tasks.
1. Ambiguity:
o Multiple valid predictions for a given context.
▪ Example: "I need a" could lead to "break," "drink," or "ride."
2. Out-of-Vocabulary (OOV) Words:
o Difficulties in predicting uncommon or newly coined words.
3. Computational Complexity:
o Deep models require significant computational resources, especially for large
vocabularies.
4. Context Length:
o Capturing very long-term dependencies remains challenging, though
transformers have made significant progress.
Input:
"I love to read books about"
Predicted Output:
"science," "history," "technology," or "adventure."
Example (Using GPT-2 from Hugging Face in Python):
5. Speech Recognition
Definition: Speech Recognition is the process of converting spoken language into text.
Types:
1. Audio Input:
o The system captures audio signals through a microphone or other audio
device.
o The audio is represented as a waveform with varying amplitudes over time.
2. Preprocessing:
o Converts the raw audio signal into a suitable format for analysis:
▪ Noise Reduction: Removes background noise to enhance clarity.
▪ Normalization: Standardizes the audio volume.
▪ Framing and Windowing: Divides the audio signal into short
overlapping segments for processing.
3. Feature Extraction:
o Extracts relevant characteristics from the audio for recognition:
▪ Mel-Frequency Cepstral Coefficients (MFCCs): Captures
frequency-related features of human speech.
▪ Spectrograms: Visual representations of the audio signal's frequency
over time.
▪ Log-Mel Spectrograms: Used in advanced models like deep learning
for better accuracy.
4. Acoustic Modeling:
o Maps the audio features to phonemes (basic units of sound in speech).
o Models like Hidden Markov Models (HMMs) or deep neural networks
(DNNs) are used for this task.
5. Language Modeling:
o Predicts the most likely word sequence based on grammar, syntax, and
context.
o Language models include n-grams, statistical models, and neural network-
based models.
6. Decoding:
o Combines acoustic and language models to output the final text transcription.
7. Post-Processing:
o Applies corrections to improve accuracy:
▪ Spelling correction.
▪ Punctuation insertion.
▪ Contextual adjustments.
1. Speaker-Dependent Systems:
o Trained for specific individuals.
o Used in applications like voice biometrics.
2. Speaker-Independent Systems:
o Works for any speaker without prior training.
o Common in general-purpose voice assistants.
3. Continuous Speech Recognition:
o Recognizes natural, flowing speech.
o Handles pauses and changes in tone.
4. Isolated Word Recognition:
o Recognizes one word at a time, requiring distinct pauses between words.
5. Multilingual Speech Recognition:
o Supports multiple languages and mixed-language input.
1. Voice Assistants:
o Systems like Siri, Alexa, and Google Assistant rely on ASR for understanding
commands.
2. Transcription Services:
o Converts spoken content into text for meetings, lectures, or legal proceedings.
3. Accessibility Tools:
o Helps individuals with disabilities by enabling voice-to-text communication.
4. Call Centers:
o Automates responses and analyzes customer sentiment during calls.
5. Language Learning:
o Provides pronunciation feedback for learners.
6. Healthcare:
o Assists in dictating and transcribing medical notes.
7. IoT and Smart Devices:
o Enables voice control for smart home systems.
1. Contextual Awareness:
o Enhancing systems to better understand context and intent.
2. Personalization:
o Adapting to individual users' accents, preferences, and environments.
3. Real-Time Processing:
o Faster and more efficient recognition for live applications.
4. Multimodal Integration:
o Combining speech with other inputs like gestures or text for richer
interactions.
5. Edge Computing:
o Running ASR locally on devices to enhance privacy and reduce latency.
6. Enhanced Multilingual Capabilities:
o Seamlessly supporting multiple languages in a single conversation.
import speech_recognition as sr
recognizer = sr.Recognizer()
with sr.AudioFile('audio.wav') as source:
audio = recognizer.record(source)
text = recognizer.recognize_google(audio)
print(text)
6. Machine Translation
Definition: Machine Translation (MT) is the automatic translation of text or speech from one
language to another using computer software.
Types:
Tools:
# Translate
text = "Hello, how are you?"
translated = tokenizer.encode(text, return_tensors="pt")
translation = model.generate(translated)
translated_text = tokenizer.decode(translation[0],
skip_special_tokens=True)
Tasks:
1. Text Cleaning:
o Removes unwanted elements from text that are not useful for analysis.
▪ Lowercasing: Converts all text to lowercase to ensure uniformity.
▪ Example: "Hello World!" → "hello world!"
▪ Removing Punctuation: Eliminates symbols like !, ., ,, etc.
▪ Example: "Hello, world!" → "hello world"
▪ Removing Numbers: Excludes digits if they are not relevant.
▪ Example: "I have 2 cats." → "I have cats."
▪ Removing Special Characters: Deletes symbols like @, #, $, etc.
▪ Example: "Welcome #2023!" → "Welcome"
2. Tokenization:
o Splits text into smaller units, like words or sentences.
▪ Word Tokenization: Divides text into individual words.
▪ Example: "I love NLP." → ["I", "love", "NLP"]
▪ Sentence Tokenization: Divides text into sentences.
▪ Example: "I love NLP. It's fascinating!" → ["I love NLP.", "It's
fascinating!"]
3. Stopword Removal:
o Removes common words (e.g., "is," "and," "the") that add little value to the
analysis.
▪ Example: "The dog is cute." → ["dog", "cute"]
4. Stemming:
o Reduces words to their root form by removing suffixes.
▪ Example: "running," "runner," "ran" → "run"
o Tools: Porter Stemmer, Snowball Stemmer.
5. Lemmatization:
o Converts words to their base or dictionary form using linguistic rules.
▪ Example: "running," "ran" → "run"
o Unlike stemming, lemmatization ensures that the root word is meaningful.
o Tools: WordNet Lemmatizer.
6. Removing Non-Alphanumeric Words:
o Removes words that contain non-alphabetic characters.
▪ Example: "I love NLP123!" → ["I", "love", "NLP"]
7. Handling Negations:
o Converts negations into meaningful forms.
▪ Example: "I don't like this." → ["I", "do_not", "like", "this"]
8. Text Normalization:
o Standardizes text to a consistent format.
▪ Expanding Contractions: Converts "don't" to "do not."
▪ Spell Correction: Corrects misspelled words.
▪ Example: "recieve" → "receive"
▪ Removing Accents: Normalizes text by removing accents.
▪ Example: "Café" → "Cafe"
9. Word Embedding Preparation:
o Converts text into numerical representations.
▪ Example techniques: Bag of Words, TF-IDF, Word2Vec, GloVe.
10. Handling Rare and Frequent Words:
o Removes words that occur too frequently (e.g., "and," "the") or too rarely.
▪ Rare words may add noise, and overly frequent words might not carry
unique information.
11. Removing URLs, Emails, or HTML Tags:
o Cleans up web-related content.
▪ Example: "Visit us at http://example.com!" → "Visit us"
12. Sentence Segmentation:
o Splits long text into meaningful sentences to make processing easier.
13. Padding and Truncating:
o Ensures consistent input size by adding padding or truncating long text.
▪ Used in deep learning models.
1. Python Libraries:
o NLTK (Natural Language Toolkit): Provides tools for tokenization,
stemming, lemmatization, and stopword removal.
o SpaCy: Efficient for preprocessing large-scale text data.
o TextBlob: Easy-to-use library for text cleaning and basic NLP tasks.
o Gensim: Useful for topic modeling and word vector generation.
o TfidfVectorizer (Scikit-learn): Converts text into TF-IDF vectors.
2. Regex (Regular Expressions):
o Allows pattern matching for cleaning text.
3. Open-Source Pretrained Models:
o Hugging Face Transformers: Handles advanced preprocessing for
transformer-based models like BERT and GPT.
Applications of Text Preprocessing
1. Sentiment Analysis:
o Prepares social media posts or reviews for sentiment classification.
2. Text Classification:
o Cleans and structures data for categorizing emails, news articles, or
documents.
3. Topic Modeling:
o Helps in clustering and identifying topics in large datasets.
4. Machine Translation:
o Prepares text for language translation tasks.
5. Chatbots:
o Enables chatbots to understand user input effectively.
1. Language Variability:
o Handling diverse languages, slang, or regional phrases.
2. Context Loss:
o Aggressive cleaning, like stopword removal, can lose meaningful context.
3. Ambiguity:
o Words with multiple meanings (e.g., "bank" as a riverbank or financial
institution).
4. Scaling for Large Datasets:
o Preprocessing large-scale text data can be computationally intensive.
5. Dynamic Content:
o Handling ever-changing text content like social media trends.
8. Tokenization
Tokenization is the process of breaking down text into smaller units called tokens. These
tokens can be words, phrases, or even characters, depending on the level of granularity
required. Tokenization is a fundamental step in Natural Language Processing (NLP) and text
analysis, as it prepares raw text data for further processing and modelling.
Types of Tokenization
1. Word Tokenization:
o Splits text into individual words.
o Example:
▪ Input: "I love NLP."
▪ Tokens: ["I", "love", "NLP"]
2. Sentence Tokenization:
o Divides text into sentences.
o Example:
▪ Input: "I love NLP. It's fascinating!"
▪ Tokens: ["I love NLP.", "It's fascinating!"]
3. Subword Tokenization:
o Splits text into smaller units, such as subwords or syllables.
o Example (using Byte-Pair Encoding, BPE):
▪ Input: "unbelievable"
▪ Tokens: ["un", "believ", "able"]
o Common in transformer-based models like BERT and GPT to handle rare or
unknown words.
4. Character Tokenization:
o Breaks text into individual characters.
o Example:
▪ Input: "hello"
▪ Tokens: ["h", "e", "l", "l", "o"]
Tokenization Techniques
1. Rule-Based Tokenization:
o Uses predefined rules, such as splitting on spaces, punctuation, or predefined
delimiters.
o Simple but prone to errors with contractions, abbreviations, or mixed-language
text.
2. Regex-Based Tokenization:
o Uses regular expressions to define patterns for splitting text.
o Example: Splitting text based on non-alphanumeric characters.
3. Whitespace Tokenization:
o Splits text based on spaces.
o Example:
▪ Input: "Tokenization is important."
▪ Tokens: ["Tokenization", "is", "important"]
4. Subword Tokenization (BPE, WordPiece):
o Combines frequent subword units to handle unknown or rare words
efficiently.
o Example:
▪ "transformers" → ["transform", "##ers"] (WordPiece tokenization used
in BERT).
5. Library-Based Tokenization:
o Libraries like NLTK, SpaCy, or Hugging Face provide robust tokenization
methods that handle edge cases like contractions or special characters.
Challenges in Tokenization
1. Ambiguity:
o Phrases like "New York" should ideally be treated as a single token, but
simple tokenizers might split them.
2. Language Dependency:
o Different languages have different tokenization needs:
▪ English: Space-separated words.
▪ Chinese/Japanese: Requires character or subword tokenization.
3. Special Cases:
o Handling numbers, URLs, abbreviations, and contractions.
▪ Example: "it's" → ["it", "'s"]
4. Compound Words:
o Languages like German often use compound words that need specific
handling.
▪ Example: "SchwarzwälderKirschtorte" → ["Schwarzwälder", "Kirsch",
"torte"]
Applications of Tokenization
1. Text Analysis:
o Tokenization is a preprocessing step for sentiment analysis, topic modeling, or
keyword extraction.
2. Machine Translation:
o Splits text into units that can be translated accurately.
3. Text Generation:
o Helps models like GPT predict tokens and generate coherent text.
4. Search Engines:
o Enables indexing and searching by breaking down documents into tokens.
import nltk
nltk.download('punkt')
9. Lemmatization
Lemmatization is the process of reducing a word to its base or dictionary form, known as a
lemma, while ensuring that the resulting word remains linguistically meaningful. Unlike
stemming, which crudely chops off word endings, lemmatization considers the context and
grammar, ensuring that the base form belongs to a valid word class (e.g., noun, verb,
adjective).
1. Morphological Analysis:
o Lemmatization uses a language's vocabulary and morphological rules to
determine the lemma of a word.
2. Context Sensitivity:
o Lemmatizers consider the word’s part of speech (POS) to determine the
correct lemma.
▪ Example:
▪ "running" as a verb → "run"
▪ "better" as an adjective → "good"
3. Tools and Techniques:
o Lemmatizers rely on linguistic databases such as WordNet or pretrained
models for context-aware processing.
Examples of Lemmatization
Word POS (Part of Speech) Lemma
running Verb run
runs Verb run
better Adjective good
geese Noun goose
children Noun child
studies Verb study
studies Noun study
Applications of Lemmatization
1. Search Engines:
o Normalizes words to improve search results.
▪ Example: Searching for "studying" also retrieves results for "study."
2. Text Classification:
o Reduces vocabulary size by grouping variations of the same word.
3. Information Retrieval:
o Enhances retrieval systems by linking related words.
4. Chatbots and NLP Systems:
o Enables bots to understand different forms of a word.
▪ Example: "helping," "helped," and "helps" → "help."
5. Sentiment Analysis:
o Improves sentiment detection by standardizing word forms.
Challenges in Lemmatization
1. Ambiguity:
oWords with multiple lemmas require context for accurate resolution.
▪ Example: "bark" (tree vs. dog sound).
2. Language-Specific Complexity:
o Requires a deep understanding of the grammar and morphology of the
language.
3. Performance Overhead:
o Lemmatization is computationally intensive compared to stemming.
4. Vocabulary Limitations:
o Limited by the linguistic database used (e.g., WordNet).
Advantages of Lemmatization
Disadvantages of Lemmatization
1. Resource-Intensive:
o Slower than stemming due to linguistic analysis.
2. Language Dependency:
o Requires language-specific rules and databases.
3. Not Always Necessary:
o For tasks like simple text categorization, stemming may suffice.
lemmatizer = WordNetLemmatizer()
word = "better"
lemma = lemmatizer.lemmatize(word, pos="a") # 'a' is for adjective
print(lemma)
10. Stemming
Stemming is the process of reducing a word to its base or root form, often by removing
suffixes or prefixes. Unlike lemmatization, stemming is a rule-based, heuristic method that
does not consider the context or part of speech of a word. The resulting "stem" may not
always be a linguistically valid word, but it serves as a representative base for various word
forms.
Stemming applies predefined rules to strip affixes (prefixes and suffixes) from words to
extract their root form. For example:
• Running → Run
• Studies → Studi
• Happily → Happili
Stemmers often use algorithms to handle common patterns of word endings, such as
removing -ing, -ed, or -ly. However, they do not always produce meaningful or valid
words.
Examples of Stemming
Word Stem
playing play
played play
player play
studies studi
cats cat
happiness happi
1. Porter Stemmer:
o One of the most widely used stemming algorithms.
o Reduces words using a series of rules to handle suffixes.
o Example:
▪ "running" → "run"
▪ "studies" → "studi"
2. Lancaster Stemmer:
o A more aggressive algorithm that reduces words more drastically.
o Example:
▪ "happiness" → "hap"
▪ "running" → "run"
3. Snowball Stemmer:
o An improved version of the Porter Stemmer, designed to work with multiple
languages.
o Example:
▪ "playing" → "play"
▪ "happier" → "happi"
4. Regex-Based Stemmer:
o Uses regular expressions to strip affixes.
o Example:
▪ Removing -ing, -ly, or -ed from words.
Applications of Stemming
1. Search Engines:
o Improves search accuracy by matching documents containing variations of a
word.
▪ Example: Searching for "runs" also retrieves results for "run" and
"running."
2. Text Classification:
o Reduces vocabulary size by grouping similar words.
3. Sentiment Analysis:
o Combines words with similar meanings to analyze overall sentiment.
4. Topic Modeling:
o Helps cluster related terms during topic identification.
Advantages of Stemming
Disadvantages of Stemming
1. Over-Stemming:
o Reduces words too aggressively, leading to loss of meaning.
▪ Example: "universal" and "university" → "univers"
2. Under-Stemming:
o Fails to reduce words that share the same root.
▪ Example: "data" and "database" are treated as separate stems.
3. Context Ignorance:
o Does not consider word usage or grammar, which can lead to inaccurate
results compared to lemmatization.
4. Language Dependency:
o Works better for languages with simple morphological structures like English.
word = "running"
stemmed_word = stemmer.stem(word)
print(stemmed_word)
1. Preserves Meaning
2. Context-Aware
5. Grammar Preservation
"The cats are running across the field because they want to catch the mice that
are running too fast."
Now, we'll tokenize, stem, and lemmatize the words from the paragraph and
compare the results.
1. Tokenization
• The
• cats
• are
• running
• across
• the
• field
• because
• they
• want
• to
• catch
• the
• mice
• that
• are
• running
• too
• fast
2. Stemming
Stemming reduces words to their root form by chopping off prefixes or suffixes,
often resulting in words that are not actual dictionary forms. For stemming,
we'll use the Porter Stemmer algorithm.
Stemmed words:
• The → The
• cats → cat
• are → are
• running → run
• across → across
• the → the
• field → field
• because → becaus
• they → they
• want → want
• to → to
• catch → catch
• the → the
• mice → mice
• that → that
• are → are
• running → run
• too → too
• fast → fast
Notice how some words like "because" were reduced to "becaus," which is not a
valid word, and "running" was stemmed to "run."
3. Lemmatization
Lemmatized words:
• The → The
• cats → cat
• are → be (as auxiliary verb)
• running → run
• across → across
• the → the
• field → field
• because → because
• they → they
• want → want
• to → to
• catch → catch
• the → the
• mice → mouse (plural to singular)
• that → that
• are → be
• running → run
• too → too
• fast → fast
Here, "running" was correctly lemmatized to "run", and "mice" was lemmatized
to "mouse", which is more accurate. Additionally, the auxiliary verb "are" was
correctly identified and lemmatized to "be".
Comparison of Results:
Key Observations:
Consider a document d:
"I love NLP and I love machine learning."
1. Information Retrieval:
o Used in search engines to rank documents based on term relevance.
2. Text Classification:
o TF is used as a feature in classification models to identify key terms.
3. Sentiment Analysis:
o Helps identify frequent terms that may indicate sentiment polarity.
4. Keyword Extraction:
o Determines which terms are most relevant in a document.
TF as Part of TF-IDF
Implementation in Python
# Sample corpus
documents = [
"I love NLP and I love machine learning",
"NLP is fun and exciting",
]
# Create CountVectorizer
vectorizer = CountVectorizer()
Output:
Best Practices
Inverse Document Frequency (IDF) is a statistical measure used in text analysis and
Natural Language Processing (NLP) to evaluate how important a term is within a corpus of
documents. Unlike Term Frequency (TF), which measures the frequency of a term in a
single document, IDF considers the rarity of a term across an entire corpus. It helps
downweight commonly occurring terms and highlight terms that are unique or rare.
Consider a corpus of 5
documents:
Applications of IDF
1. TF-IDF:
o IDF is often combined with Term Frequency (TF) to form the TF-IDF metric,
which highlights terms that are both frequent in a document and rare across
the corpus.
o
2. Search Engines:
o Used to rank documents by relevance in response to a query.
o Terms with higher IDF values contribute more to the relevance score.
3. Text Classification:
o Identifies distinctive features for categorizing documents.
4. Keyword Extraction:
o Highlights rare and important terms in a text.
5. Clustering:
o Helps group similar documents by identifying shared rare terms.
Advantages of IDF
Limitations of IDF
1. Corpus Dependency:
o IDF values depend on the specific corpus; a term may have a high IDF in one
corpus but not in another.
2. Assumes Static Corpus:
o Adding or removing documents can change IDF values, requiring
recalculation.
3. Sensitive to Rare Terms:
o Terms that appear in only one document may receive disproportionately high
IDF values.
4. Ignores Semantic Context:
o IDF is purely statistical and does not account for the meaning or relationships
between terms.
Using Scikit-learn:
# Sample corpus
documents = [
"I love NLP",
"NLP is amazing",
"I love machine learning",
"Machine learning is the future",
"I love learning"
]
# Initialize TfidfVectorizer
vectorizer = TfidfVectorizer(use_idf=True)
Output:
Best Practices
1. Preprocessing Text:
o Remove stopwords, punctuation, and irrelevant characters before calculating
IDF.
2. Handling Sparse Data:
o
Use dimensionality reduction techniques like Truncated SVD to manage
sparse TF-IDF matrices.
3. Domain-Specific Vocabulary:
o Tailor the corpus to include domain-specific terms for better relevance.
4. Combine with Other Features:
o Use IDF alongside embeddings or contextual features for richer analysis.
Feature Extraction
Feature Extraction is the process of transforming raw data into a set of measurable and
meaningful features that can be used by machine learning algorithms to perform tasks such as
classification, clustering, or regression. In Natural Language Processing (NLP), it involves
converting unstructured text data into numerical representations while retaining its semantic
meaning and essential characteristics.
• Machine learning models require numerical inputs, but raw text is inherently
unstructured.
• Feature extraction simplifies and summarizes the data, highlighting the most relevant
aspects for a given task.
• It reduces dimensionality, improving model efficiency and performance.
2. Word Embeddings
3. N-grams
7. Topic Modeling
8. Feature Hashing
1. Scikit-learn
2. Gensim
3. SpaCy
4. Hugging Face Transformers:
5. NLTK (Natural Language Toolkit):
o Useful for tokenization, POS tagging, and basic preprocessing.
1. Text Classification:
o Features are used to classify emails (e.g., spam vs. not spam) or sentiment
(positive/negative).
2. Clustering:
o Groups similar documents or sentences based on extracted features.
3. Information Retrieval:
o Extracts meaningful keywords or entities for search engines.
4. Machine Translation:
o Encodes sentences for translating from one language to another.
5. Recommendation Systems:
o Analyzes text features for personalized recommendations (e.g., books,
articles).
6. Chatbots:
o Extracts user intent and entities to generate appropriate responses.
N-grams are continuous sequences of nnn items (typically words or tokens) from a given
text. The concept is commonly used in text analysis and natural language processing (NLP)
tasks to capture word relationships and context.
1. Unigram
2. Bigram
• Definition: A bigram consists of two consecutive words (or tokens).
• Use Case: Captures relationships between adjacent words, useful for tasks like part-
of-speech tagging and shallow context understanding.
• Example:
o Sentence: "I love NLP"
o Bigrams: ["I love", "love NLP"]
3. Trigram
Expanded Example
Applications of N-grams
1. Unigrams:
o Simple text classification.
o Sentiment analysis with a bag-of-words approach.
2. Bigrams:
o Phrase detection.
o Spell-checking (e.g., detecting "New York" as a phrase instead of two separate
words).
3. Trigrams:
o Language modeling.
o Text prediction and generation (e.g., autocompletion).
Q) Manually calculate TF-IDF for a small corpus with 2-3 documents
Corpus:
Example:
• TF Only:
o "is" would have the highest weight across all documents due to its frequent
appearance, even though it carries little semantic value.
• TF-IDF:
o Words like "amazing," "NLP," and "learning" receive higher weights because
they are less frequent and more meaningful for distinguishing the documents.
Conclusion
Using only TF can lead to feature vectors dominated by common terms, resulting in poor
model performance and reduced interpretability. By incorporating IDF, the TF-IDF approach
enhances the relevance of terms, making it a cornerstone for effective feature extraction in
text analysis and Natural Language Processing tasks.
Spam Filtering
Spam Filtering is the process of identifying and separating unwanted or irrelevant messages
(spam) from legitimate ones. Spam filtering is widely used in email systems, messaging apps,
and social media platforms to reduce user exposure to malicious or irrelevant content.
1. Data Collection:
o Collect a dataset of labeled emails or messages (spam and non-spam/ham).
o Example dataset:
▪ Spam: "Win a free iPhone! Click here."
▪ Ham: "Let's meet for lunch tomorrow."
2. Preprocessing:
o Clean and normalize the text to prepare it for analysis.
▪ Convert text to lowercase.
▪ Remove stopwords, punctuation, and special characters.
▪ Tokenize the text into words or n-grams.
3. Feature Extraction:
o Convert text into numerical features using techniques like:
▪ Bag of Words (BoW): Count occurrences of words.
▪ TF-IDF (Term Frequency-Inverse Document Frequency):
Highlight important terms while downweighting common ones.
4. Model Training:
o Train a classification model to distinguish spam from ham using extracted
features.
o Common algorithms:
▪ Naive Bayes
▪ Logistic Regression
▪ Support Vector Machines (SVM)
▪ Decision Trees or Random Forests
▪ Deep learning models (e.g., LSTMs, transformers)
5. Evaluation:
o Assess model performance using metrics such as:
▪ Accuracy
▪ Precision
▪ Recall
▪ F1-score
6. Deployment:
o Integrate the trained model into email systems or messaging platforms for
real-time spam detection.
Dataset Example:
data = [
("Win a free iPhone! Click here now.", "spam"),
("Your bill payment is due tomorrow.", "ham"),
("Congratulations! You have won a $1,000 gift card.", "spam"),
("Let’s catch up for coffee this weekend.", "ham"),
("Claim your free vacation package now!", "spam"),
]
Steps:
1. Import Libraries:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
2. Prepare Data:
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)
y_pred = model.predict(X_test_tfidf)
print(classification_report(y_test, y_pred))
Evaluation Metrics
1. Accuracy:
o Measures the proportion of correctly classified messages.
2. Precision:
o Measures the proportion of correctly identified spam messages out of all
predicted spam messages.
3. Recall:
o Measures the proportion of actual spam messages correctly identified.
4. F1-Score:
o Combines precision and recall into a single metric.
1. Loss of Context:
o TF-IDF does not capture word order or semantic relationships.
2. Static Weights:
o Weights are computed based on the training corpus and do not adapt to
changes in spam patterns.
3. Sparse Representation:
o Large corpora result in sparse TF-IDF matrices, which may require
dimensionality reduction.
Conclusion
Using TF-IDF for spam filtering enhances the feature extraction process by emphasizing rare
and distinctive terms while reducing the influence of common words. Combined with a
machine learning classifier like Naive Bayes, TF-IDF enables accurate and efficient spam
detection. While it has limitations in capturing semantic context, it remains a powerful tool
for identifying and filtering spam messages.
1. Dataset Preparation
data = [
("Win a free iPhone! Click here now.", "spam"),
("Your bill payment is due tomorrow.", "ham"),
("Congratulations! You have won a $1,000 gift card.", "spam"),
("Let’s catch up for coffee this weekend.", "ham"),
("Claim your free vacation package now!", "spam"),
]
2. Data Preprocessing
4. Train-Test Split
• Split the data into training and testing sets to evaluate the model's performance.
6. Model Evaluation
• Evaluate the model using metrics like accuracy, precision, recall, and F1-score.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score
# Example data
data = [
("Win a free iPhone! Click here now.", "spam"),
("Your bill payment is due tomorrow.", "ham"),
("Congratulations! You have won a $1,000 gift card.", "spam"),
("Let’s catch up for coffee this weekend.", "ham"),
("Claim your free vacation package now!", "spam"),
]
# Convert to DataFrame
df = pd.DataFrame(data, columns=["message", "label"])
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
5. Train a Model
# Use Naive Bayes Classifier
model = MultinomialNB()
model.fit(X_train_tfidf, y_train)
6. Make Predictions
# Evaluation Metrics
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
Modeling Output
Assume the dataset was split into training and testing as:
• Training:
o "Win a free iPhone! Click here now." → spam
o "Your bill payment is due tomorrow." → ham
o "Congratulations! You have won a $1,000 gift card." → spam
• Testing:
o "Let’s catch up for coffee this weekend." → ham
o "Claim your free vacation package now!" → spam
Example Output:
Accuracy: 1.0
Classification Report:
precision recall f1-score support
accuracy 1.00 2
macro avg 1.00 1.00 1.00 2
weighted avg 1.00 1.00 1.00 2
1. Loss of Context:
o TF-IDF does not consider the word order or semantic relationships between
words.
2. Static Weights:
o TF-IDF weights are static and do not adapt to changing corpus characteristics.
3. Sparse Matrix:
o For large corpora, the TF-IDF matrix can be very sparse, leading to
computational inefficiency.
1. Spam Filtering:
o Identifies spam messages based on distinctive keywords (e.g., "win," "free,"
"offer").
2. Text Classification:
o Classifies documents into categories such as news topics or product reviews.
3. Sentiment Analysis:
o Identifies positive or negative sentiment in customer feedback.
4. Search Engines:
o Improves search relevance by prioritizing documents with important query
terms.
5. Keyword Extraction:
o Extracts significant keywords from a document or webpage.
Conclusion
TF-IDF is a powerful feature extraction method for converting text into a numerical format
that machine learning models can use effectively. By combining TF-IDF with classification
algorithms, you can build robust models for tasks like spam filtering, sentiment analysis, and
text classification.
Extracting customer opinions and classifying them as positive, negative, or neutral
refers to:
Answer: Natural Language Processing (NLP) is a field of artificial intelligence that focuses
on the interaction between computers and humans through natural language. The goal of NLP
is to enable computers to understand, interpret, and generate human language in a valuable
way. The different levels of NLP include:
1. Lexical Analysis: Analyzing the structure of words and their composition from
characters.
2. Syntactic Analysis (Parsing): Involves analyzing words in a sentence for grammar
and arranging words in a manner that shows the relationships among the words.
3. Semantic Analysis: Determines the meanings of words and how sentences are
composed, ensuring that the interpretations are sensible.
4. Discourse Integration: The meaning of a sentence may depend on the sentences that
precede it and might influence those that follow.
5. Pragmatic Analysis: Deals with the effective use of language in context and the
strategies used to influence the conversation.
Answer:
• Stemming: This is the process of reducing a word to its word stem that affixes to
suffixes and prefixes or to the roots of words known as a lemma. Stemming is
typically used to bring variations of words to their base form. However, it often leads
to incorrect meanings and spellings.
• Lemmatization: It involves the use of a vocabulary and morphological analysis of
words, aiming to remove inflectional endings only and to return the base or dictionary
form of a word, which is known as the lemma. It is more sophisticated than stemming
and uses a lexical resource such as WordNet to ensure that the root word belongs to
the language.
Explain the TF-IDF Feature Extraction concept. Compute the TF-IDF for the term
"Computer" in a document where:
Answer:
TF−IDF=5×2=10TF-IDF = 5 \times 2 = 10
This means the TF-IDF score for "Computer" in this document is 10.
UNIT 2
POS(Parts-Of-Speech) Tagging
Parts of Speech tagging is a linguistic activity in Natural
Language Processing (NLP) wherein each word in a
document is given a particular part of speech (adverb,
adjective, verb, etc.) or grammatical category. Through
the addition of a layer of syntactic and semantic
information to the words, this procedure makes it easier
to comprehend the sentence’s structure and meaning.
In NLP applications, POS tagging is useful for machine
translation, named entity recognition, and information extraction, among other
things. It also works well for clearing out ambiguity in terms with numerous
meanings and revealing a sentence’s grammatical structure.
# Sample text
text = "NLTK is a powerful
library for natural language
processing."
1. The Penn Treebank Tagset: This is one of the most widely used tagsets
for English. It includes detailed parts of speech like NN (noun, singular),
NNS (noun plural), VBP (verb, present tense, not 3rd person singular), JJ
(adjective), etc.
2. The Universal POS Tagset: Simplified compared to the Penn Treebank
Tagset, the Universal Tagset focuses on broad categories and is used in
multilingual tagging contexts. It includes tags like NOUN, VERB, ADJ
(adjective), ADV (adverb), PRON (pronoun), DET (detevsagdrminer),
ADP (adposition), NUM (numeral), CONJ (conjunction), and PRT
(particle).
3. The Brown Corpus Tagset: One of the earliest tagsets, used in the
Brown Corpus, a collection of text samples from a wide variety of
sources, designed to be a representative mix of modern American
English. It includes a mix of simple and complex tags, incorporating
elements like verb forms and tense.
4. The CLAWS Tagset (the Constituent Likelihood Automatic Word-
tagging System): Used particularly in the British National Corpus
(BNC), this tagset is highly detailed and is used for automated tagging of
texts based on a hidden Markov model. It has several subcategories for
each part of speech to capture the nuances in the language.
POS tagging is typically performed using algorithms that may involve rules-
based systems, machine learning, or a combination of both. The choice of tagset
often depends on the specific requirements of the application, such as the level
of granularity needed in the linguistic analysis, the language of the text (even
within English, different dialects might be better served by different tagsets),
and the computational resources available.
Named Entity Recognition
Ambiguity in NER
• For a person, the category definition is intuitively quite clear, but
for computers, there is some ambiguity in classification. Let’s
look at some ambiguous examples:
o England (Organization) won the 2019 world cup
vs The 2019 world cup happened in England
(Location).
How Named Entity Recognition (NER) works?
The working of Named Entity Recognition is discussed below:
• The NER system analyses the entire input text to identify and
locate the named entities.
• The system then identifies the sentence boundaries by
considering capitalization rules. It recognizes the end of the
sentence when a word starts with a capital letter, assuming it
could be the beginning of a new sentence. Knowing sentence
boundaries aids in contextualizing entities within the text,
allowing the model to understand relationships and meanings.
• NER can be trained to classify entire documents into different
types, such as invoices, receipts, or passports. Document
classification enhances the versatility of NER, allowing it to
adapt its entity recognition based on the specific characteristics
and context of different document types.
• NER employs machine learning algorithms, including supervised
learning, to analyze labeled datasets. These datasets contain
examples of annotated entities, guiding the model in recognizing
similar entities in new, unseen data.
• Through multiple training iterations, the model refines its
understanding of contextual features, syntactic structures, and
entity patterns, continuously improving its accuracy over time.
• The model’s ability to adapt to new data allows it to handle
variations in language, context, and entity types, making it more
robust and effective.
Named Entity Recognition (NER) Methods
Vectorization
Vectorization is the process of converting text data into numerical
vectors. In the context of Natural Language Processing (NLP),
vectorization transforms words, phrases, or entire documents into a
format that can be understood and processed by machine learning
models. These numerical representations capture the semantic meaning
and contextual relationships of the text, allowing algorithms to perform
tasks such as classification, clustering, and prediction.
Word Embeddings
Word embeddings are dense vector representations of words in a
continuous vector space, where semantically similar words are located
closer to each other. These embeddings capture the context of a word, its
syntactic role, and semantic relationships with other words, leading to
better performance in various NLP tasks.
Advantages:
• Captures semantic meaning and relationships between words.
• Dense representations are computationally efficient.
• Handles out-of-vocabulary words (especially with FastText).
Disadvantages:
• Requires large corpora for training high-quality embeddings.
• May not capture complex linguistic nuances in all contexts.
How removing stop words affects NLP models:
Explanation: CBOW is a model used in natural language processing to predict a target word
from a set of context words surrounding it. This model is part of the Word2Vec approach to
word embeddings, where words are represented as vectors in a continuous vector space.
CBOW tends to predict the probability of a word given a context—a reverse application of
the common Bag of Words model, which involves representing text data as a set of words
along with their frequency of occurrence.
Example: In the sentence "The cat sat on the ___", CBOW would try to predict the word
"mat" using the context words 'the', 'cat', 'sat', 'on', 'the'.
2. Word Cloud
Explanation: A word cloud is a visual representation of text data where the size of each word
indicates its frequency or importance in the corpus. It's often used for exploring data,
understanding key themes, and presenting text data in an accessible format.
Example: In analyzing customer reviews for a product, a word cloud could highlight
frequently mentioned words like "quality", "price", "delivery", which helps in quickly
perceiving customer sentiments and concerns.
3. Word2Vec
Example: After training on a large corpus, words like "king" and "queen" will have similar
vector representations as both share contextual similarities.
4. GloVe (Global Vectors)
Example: Similar to Word2Vec, in GloVe, words like "Paris" and "France" will be closer
together in the vector space than "Paris" and "banana".
Explanation: ELMo is a deep contextualized word representation that models both complex
characteristics of word use (like syntax and semantics), and how these uses vary across
linguistic contexts (i.e., to model polysemy). ELMo representations are functions of the entire
input sentence, differentiated from traditional word embeddings like Word2Vec or GloVe.
Example: ELMo can understand that "stick" in "walking stick" and "stick the photo"
represent different meanings, based on context.
6. Topic Modeling
Example: If applied to a set of news articles, LDA might find topics such as "politics,"
"sports," and "economy," based on the distribution of words.
8. Applications of LDA
Explanation: POS tagging tutorials typically involve teaching how to use various NLP
libraries to assign part-of-speech tags to each word in a given text. Common libraries include
NLTK, spaCy, and Stanford NLP.
Example: Given the sentence "Apple is looking at buying U.K. startup for $1 billion", a POS
tagger will label "Apple" as a proper noun, "is" as a verb, "looking" as a verb, etc.
10. Named Entity Recognition (NER)
Explanation: NER is an NLP task of identifying and classifying named entities in text into
predefined categories such as the names of persons, organizations, locations, expressions of
times, quantities, monetary values, percentages, etc.
Example: In the sentence "Google was founded by Larry Page and Sergey Brin", an NER
model would identify "Google" as an Organization, "Larry Page" and "Sergey Brin" as
Persons.
These techniques and models are foundational to modern NLP applications and are widely
used in various real-world applications such as voice-operated GPS systems, customer
service automation, and organizing large datasets.