NLP Unit 1
NLP Unit 1
NLP bridges the gap between human communication and computer understanding. Unlike
programming languages, natural languages are highly ambiguous, complex, and
context-dependent, making NLP a challenging yet essential area of research and
application.
1. Understanding Language: Parsing and comprehending the syntax and semantics
of human language.
3. Translation and Interpretation: Converting text or speech from one language to
another.
5. Knowledge Extraction: Deriving structured information from unstructured text (e.g.,
named entity recognition, relation extraction).
1. Morphological Analysis: Breaking words into morphemes (e.g., root words, affixes)
to understand grammatical roles.
2. Syntactic Analysis (Parsing): Analyzing sentence structure using grammar rules
(e.g., constituency and dependency parsing).
3. Semantic Analysis: Understanding the meaning of individual words and sentences
(e.g., word sense disambiguation).
4. Pragmatic Analysis: Interpreting meaning based on context, speaker intention, and
conversational implicature.
5. Discourse Analysis: Studying how sentences are connected to form coherent
narratives or arguments.
6. Speech Processing: Includes both speech recognition (ASR) and speech synthesis
(TTS) to handle spoken language.
● Tokenization
● Sentiment Analysis
● Machine Translation
● Question Answering
● Text Summarization
● Coreference Resolution
Approaches to NLP:
1. Rule-Based Methods: Traditional NLP systems use hand-crafted linguistic rules and
grammars.
2. Statistical NLP: Involves probabilistic models trained on large corpora (e.g., Hidden
Markov Models, Naive Bayes).
4. Deep Learning-based NLP: Leverages neural networks, including RNNs, LSTMs,
and Transformers (e.g., BERT, GPT), to model complex language representations
and generate human-like text.
2. Applications of Natural Language Processing (NLP)
1. Machine Translation
NLP is widely used in automatic language translation, where text or speech is translated
from one language to another. Modern systems like Google Translate, DeepL, and
Amazon Translate use deep learning models such as Transformers to achieve high-quality
translations by preserving context, syntax, and semantics.
2. Sentiment Analysis
Sentiment analysis determines the emotional tone behind a body of text. It is heavily used
in social media monitoring, customer feedback analysis, and brand reputation
management to classify opinions as positive, negative, or neutral.
NLP enables intelligent dialogue systems such as chatbots and voice-based virtual
assistants (e.g., Siri, Alexa, Google Assistant) to understand user queries and generate
appropriate responses. These systems combine intent detection, entity recognition, and
context tracking.
Search engines use NLP to interpret user queries and fetch relevant documents. NLP helps
in query expansion, spelling correction, and semantic matching, improving the accuracy
and relevance of search results.
● Example: Google understanding a question like “What is the capital of France?” and
directly answering “Paris.”
5. Text Summarization
● Two types:
NER identifies and categorizes entities such as names of people, organizations, locations,
dates, etc., in a given text. It is crucial for information extraction and knowledge graph
construction.
NLP, combined with speech technologies, enables Automatic Speech Recognition (ASR)
and Text-to-Speech (TTS) systems. These are used in voice assistants, dictation
software, and language learning apps.
● Example: Converting spoken commands into text or reading out emails aloud.
NLP helps classify text into categories such as spam vs. non-spam, topic classification,
or news categories. Machine learning models are trained to detect patterns in text.
● Example: Email services like Gmail filtering spam using NLP-based classifiers.
NLP models are used to measure semantic similarity between documents, helping in
detecting plagiarism, duplicate content, and paraphrasing.
✅ Pros of NLP
1. Efficient Processing of Large Volumes of Text
NLP automates the analysis of massive textual datasets, which would be time-consuming or
impossible for humans.
NLP enables more natural interaction with machines through chatbots, voice assistants, and
intelligent agents.
● Example: Siri, Alexa, and Google Assistant understanding and responding to human
speech.
3. Language Translation
NLP has enabled real-time, multi-language communication through tools like Google
Translate and DeepL, helping overcome language barriers.
NLP helps in gauging public sentiment from social media, product reviews, or political
speeches, enabling better decision-making.
Automated chatbots and virtual agents reduce the need for human intervention, cutting down
costs and increasing availability.
6. Accessibility
NLP-based speech recognition and text-to-speech systems assist people with disabilities
(e.g., visually impaired users or those with motor difficulties).
NLP improves search relevance by understanding intent and context in user queries.
● Example: Answering the question “What is the capital of France?” directly with
“Paris” rather than listing websites.
❌ Cons of NLP
1. Ambiguity in Natural Language
Natural languages are inherently ambiguous. Words may have multiple meanings depending
on context, which makes accurate interpretation difficult.
● Example: The word “bank” could mean a financial institution or the side of a river.
2. Contextual Understanding is Limited
Many NLP models still struggle with understanding deep semantics, sarcasm, idioms, or
cultural references.
● Example: Detecting sarcasm in a review like “Oh great, the phone broke on the first
day!” can be challenging.
Modern NLP systems, especially deep learning models, require vast amounts of labeled
data and computational resources for training.
NLP models may inadvertently learn and amplify biases present in training data, leading to
discriminatory or offensive outputs.
Processing sensitive textual data (like emails, health records, or chat logs) raises serious
privacy and data protection issues.
Many NLP systems are optimized for high-resource languages like English. Low-resource
languages, dialects, and regional vernaculars are often underrepresented.
People may rely too heavily on NLP systems (e.g., machine translation or grammar
checkers), which may still produce errors and lead to misunderstandings or misinformation.
4. Levels/Phases of NLP
Definition
Lexical analysis involves breaking down a text into individual words, phrases, or tokens
while removing unnecessary elements like punctuation and stopwords. It focuses on
word-level processing.
Key Tasks
Example
Definition
Syntactic Analysis checks whether a sentence follows the correct grammatical structure
by analyzing word arrangements using parsing techniques.
Key Tasks
Example
● Incorrect Syntax ❌
● Corrected Sentence: "She goes to school." ✔
Definition
Semantic Analysis focuses on understanding the actual meaning of words and sentences
by considering their relationships and context.
Key Tasks
Example
Definition
Discourse Integration ensures that sentences are connected logically and derive meaning
based on the overall conversation or document.
Key Tasks
Example
Definition
Key Tasks
Example
A Regular Expression (regex) is a formal language used to define search patterns within
text. In Natural Language Processing (NLP), regular expressions are crucial for tasks
involving pattern matching, text preprocessing, tokenization, and information
extraction. Regex provides a compact and flexible way of identifying specific strings or
character patterns within larger bodies of text.
Regular expressions are composed of special symbols and operators that define a
pattern. These patterns are then used to match or extract specific types of data from text.
1. Literals
2. Metacharacters
○ Examples:
○ Examples:
■ *: 0 or more times
■ +: 1 or more times
■ ?: 0 or 1 time
○ () groups characters.
1. Tokenization
○ Splitting text into words, sentences, or sub-words using regex-based
delimiters.
Introduction
Morphological Analysis in Natural Language Processing (NLP) refers to the study of the
internal structure of words and how they are formed from smaller meaningful units called
morphemes. It plays a foundational role in various NLP applications such as text
normalization, part-of-speech tagging, machine translation, and information retrieval.
What is Morphology?
● Morphology is the branch of linguistics that deals with the structure of words.
○ Bound morphemes cannot stand alone and must be attached (e.g., “-ing”,
“-ed”, “un-”).
○ Examples:
2. Derivational Morphology
○ Creates new words by adding prefixes or suffixes, often changing the part of
speech or core meaning.
○ Examples:
1. Lemmatization
2. Stemming
1. Tokenization
4. Lemmatization/Stemming
● Machine Translation
1. Ambiguity
Introduction
What is a Token?
Types of Tokenization
○ Example:
Input: “NLP is exciting. It is used in many applications.”
Output: [“NLP is exciting.”, “It is used in many applications.”]
● English and similar languages: Space and punctuation are sufficient for
tokenization.
Approaches to Tokenization
○ More flexible and accurate for non-standard text (e.g., social media).
Challenges in Tokenization
● Multiword expressions: Phrases like "New York" may need to be treated as one
token.
Applications of Tokenization
● Text Classification
● Machine Translation
● Speech Recognition
● Information Extraction
Definition
Stemming is the process of chopping off prefixes or suffixes from words to get to the base
form (stem), which may not necessarily be a valid word in the language.
Example:
● Improves search relevance: Enables matching of queries with different word forms.
● Increases recall: Retrieves more relevant documents even if they contain word
variations.
○ One of the most widely used stemmers developed by Martin Porter in 1980.
Advantages of Stemming
Disadvantages of Stemming
● Under-stemming: Words with the same root are not stemmed to the same form.
E.g., “analysis” and “analyses”
Applications of Stemming
Definition
Lemmatization is the process of grouping together different inflected forms of a word so they
can be analyzed as a single item called a lemma. The lemma is the dictionary form or base
form of the word.
Example:
● “better” → “good”
● “running” → “run”
● Input: A word along with its part-of-speech (POS) tag (e.g., noun, verb, adjective).
● Rules: Morphological rules are applied depending on the POS to return the base
form.
Types of Lemmatization
● Example:
○ “saw” as noun → lemma: “saw”
Advantages of Lemmatization
Disadvantages of Lemmatization
Applications of Lemmatization
● Sentiment Analysis
Considers different word forms as the same base word to improve accuracy.
● Machine Translation
Reduces complexity by mapping variants to base forms.
● Question Answering Systems
Normalizes input for better matching.
Definition
Term Frequency (TF) is a numerical measure that indicates how often a specific word
appears in a document relative to the total number of words in that document. It is used in
Natural Language Processing (NLP) to analyze the importance of words in a text.
Formula
Explanation
● The more frequently a word appears in a document, the higher its TF value.
● However, common words like "the," "is," and "and" may have high TF values but are
not necessarily important for distinguishing documents.
● TF alone does not consider how common a word is across multiple documents,
which is why it is often combined with Inverse Document Frequency (IDF) in the
TF-IDF approach.
Example
is 1 8 1/8 0.125
an 1 8 1/8 0.125
Definition
Where:
Explanation
● If a word appears in many documents, its IDF value is low (less important).
● If a word appears in fewer documents, its IDF value is high (more important).
● Common words like "the," "is," and "and" have low IDF values, while domain-specific
words have higher IDF values.
Example
● The word "the" appears in all documents, so its IDF is 0 (not useful for distinguishing
documents).
● The word "algorithm" appears in only one document, so it has the highest IDF (most
important for classification).
1. TF-IDF Calculation—IDF is combined with Term Frequency (TF) to assign weights
to words:
Definition
Formula
Where:
○ Words with higher TF-IDF values are more important for document
classification.
Example Calculation
1. Text Classification – Used to convert text into feature vectors for machine learning
models.
2. Information Retrieval – Search engines rank documents based on TF-IDF scores.
3. Topic Modeling – Identifies keywords in documents to determine topics.
4. Spam Detection – Helps filter out spam emails by analyzing word importance.
11. Explain Parts of Speech Tagging in Natural Language Processing.”
Introduction
Parts of Speech (POS) Tagging, also called Grammatical Tagging or POS Labeling, is a
fundamental task in Natural Language Processing (NLP) that involves assigning a part of
speech to each word in a sentence. Parts of speech indicate the grammatical category of a
word based on its definition and context, such as noun, verb, adjective, adverb, pronoun,
preposition, conjunction, etc. POS tagging is essential for understanding the syntactic
structure and meaning of sentences and serves as a foundational step for many advanced
NLP tasks.
Definition
Example:
The exact tagset may vary depending on the language and tagging scheme (e.g., Penn
Treebank, Universal POS tags).
● Ambiguity: Many words are ambiguous and can serve multiple POS categories
depending on context.
● Unknown words: Words not seen during training (e.g., new words, typos).
○ Examples:
● The goal is to find the most probable tag sequence t1,t2,…,tnt_1, t_2, \dots, t_n for
these words.
● The Viterbi algorithm is commonly used to find the best tag sequence efficiently.
Evaluation Metrics
● Machine Translation
● Sentiment Analysis
NER is the process of detecting proper nouns and specialized phrases in text and labeling
them with entity types. This involves two main steps:
Example:
Sentence: “Barack Obama was the president of the United States.”
Importance of NER
● Widely used in domains like finance, healthcare, legal documents, and social media
analysis.
Challenges in NER
● Ambiguity: Some words can refer to different entity types depending on context.
Approaches to NER
○ Requires feature engineering (e.g., word shape, POS tags, context words).
3. Deep Learning-Based Approaches
● Sequence Labeling: Assigning labels to each token indicating the entity type and
boundary using BIO (Beginning, Inside, Outside) or similar schemes.
Evaluation Metrics
● F1-Score: Harmonic mean of precision and recall, commonly used to assess NER
performance.
Applications of NER
Definition
N-grams serve as a foundational tool in many NLP applications because they help capture
local word dependencies and contextual information without requiring complex syntax or
semantic analysis.
How N-grams Work
Example:
Sentence: “Natural Language Processing is fun”
Applications of N-grams
Advantages of N-grams
● Lack of Long-Range Context: N-grams capture only local context and fail to
understand distant dependencies in sentences.
● Fixed Context Window: The context window is fixed to size N, which may not reflect
true linguistic dependencies.
● High Memory Usage: Large N-grams require significant storage and processing
power.
Smoothing Techniques
To handle data sparsity and unseen N-grams, smoothing methods are applied:
● Add-One (Laplace) Smoothing: Adds one count to every N-gram to avoid zero
probabilities.
● When building statistical models like N-gram language models, the model estimates
the probability of word sequences based on their frequency counts in a training
corpus.
● However, many valid word sequences may not appear in the training data,
especially as the length of N-grams increases.
● Without smoothing, such unseen sequences get a probability of zero, which causes
the model to fail when encountering new data.
Objective of Smoothing
● Ensure all possible N-grams have a non-zero probability, thus making the model
capable of handling novel inputs.
● The simplest technique that adds one to the count of every N-gram (including unseen
ones).
● Formula:
Word Frequency
"dog" 3
"barks" 2
"loud" 1
Let’s calculate the smoothed unigram probability of each word using Laplace
Smoothing, and also compute the probability of a new unseen word: "cat".
Step 3: Interpretation
● Even though "cat" was not observed in the corpus, Laplace Smoothing assigns it
a non-zero probability of 0.1.
● The smoothing redistributed probability mass across both seen and unseen words.
2. Add-k Smoothing
3. Good-Turing Smoothing
4. Kneser-Ney Smoothing
● Uses discounting to reduce the probability of seen N-grams and reallocates that
probability to unseen N-grams, taking into account the diversity of contexts in which
words appear.
● Particularly good at handling rare words and capturing the probability of novel word
combinations.
● Prevents zero probabilities, thus enabling the language model to assign a probability
to any input sequence.
Applications of Smoothing
1. Data Acquisition
The first step in any NLP pipeline is collecting relevant textual data. This data can come from
multiple sources such as:
The quality and quantity of data acquired significantly influence the performance of the NLP
system.
2. Text Cleaning
Raw text is often noisy and unstructured. Cleaning is done to remove inconsistencies and
irrelevant elements, such as:
● Punctuation marks
● HTML tags
● Special characters
● Extra spaces
Cleaning ensures that the text is normalized and ready for consistent processing.
3. Text Preprocessing
This step breaks down and simplifies the cleaned text to make it machine-readable. Key
tasks include:
● Stopword Removal: Eliminating common, less informative words (e.g., the, is, a).
Preprocessing transforms text into a standard form suitable for feature extraction.
4. Feature Engineering
In this step, text data is converted into numerical representations that can be fed into
machine learning models. Common methods include:
This step is critical for capturing semantic and syntactic relationships in the text.
5. Model Building
Once features are extracted, suitable models are selected and trained for specific NLP
tasks. Depending on the goal, different algorithms are used:
The choice of model depends on the complexity of the task, data size, and expected
performance.
6. Evaluation
After model training, evaluation is done to assess the model’s accuracy and generalization.
Standard metrics include:
7. Deployment
Once the model is finalized, it is deployed into production environments where it can serve
real users. Deployment involves:
Deployment ensures that the model delivers NLP capabilities to end-users in real time.
The final step involves continuous monitoring of the deployed model's performance in a
real-world setting. This includes:
Monitoring ensures the NLP system remains robust, reliable, and up to date.