0% found this document useful (0 votes)
14 views25 pages

Chapter - 1

Natural Language Processing (NLP) is a subfield of AI that enables computers to understand and generate human language, with applications in text processing, speech recognition, and sentiment analysis. Key components include Natural Language Understanding (NLU) and Natural Language Generation (NLG), which work together to interpret and respond to user input. The document outlines the stages of NLP, processing steps, and techniques like argmax computation and term weighting, highlighting the challenges and methodologies involved in effectively analyzing language data.

Uploaded by

b221056
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views25 pages

Chapter - 1

Natural Language Processing (NLP) is a subfield of AI that enables computers to understand and generate human language, with applications in text processing, speech recognition, and sentiment analysis. Key components include Natural Language Understanding (NLU) and Natural Language Generation (NLG), which work together to interpret and respond to user input. The document outlines the stages of NLP, processing steps, and techniques like argmax computation and term weighting, highlighting the challenges and methodologies involved in effectively analyzing language data.

Uploaded by

b221056
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Chapter – 1

Introduction To Natural Language Processing

# Topic – 01 :- Introduction To NLP


 Definition :- Natural Language Processing (NLP) is a subfield of
Artificial Intelligence (AI) that focuses on enabling computers to
understand, interpret, generate, and respond to human language in a
meaningful way. It integrates computational linguistics, machine
learning, and deep learning techniques to process and analyze natural
language data.

 Applications of NLP :-
1. Text Processing: Tokenization, stopword removal, stemming,
lemmatization.
2. Speech Recognition: Converts spoken language into text (e.g., Siri,
Google Assistant).
3. Machine Translation: Automated language translation (e.g., Google
Translate).
4. Sentiment Analysis: Determines the emotional tone of text (e.g.,
product reviews, social media monitoring).
5. Chatbots and Virtual Assistants: AI-driven conversational agents
(e.g., ChatGPT, Alexa).
6. Information Retrieval: Search engines and document indexing
(e.g., Google Search).
 Challenges Of NLP :-
1. Ambiguity: Words and sentences may have multiple
interpretations.
2. Context Understanding: Difficulty in grasping contextual and
idiomatic expressions.
3. Resource Scarcity: Limited labeled datasets for
underrepresented languages.
4. Scalability: Handling large-scale real-time processing
efficiently.

 Components of NLP :- NLP consists of two major components:


1. Natural Language Understanding (NLU) :- Natural Language
Understanding (NLU) is the component of NLP that enables
machines to comprehend and interpret human language by
extracting meaning from text or speech.
Key Tasks in NLU
a) Lexical Semantics: Understanding the meaning of words and
phrases.
b) Syntactic Parsing: Analyzing sentence structure (POS
tagging, dependency parsing).
c) Semantic Analysis: Identifying relationships between words
and extracting meaning.
d) Named Entity Recognition (NER): Identifying entities like
people, places, and organizations.
e) Coreference Resolution: Linking pronouns and noun phrases
to their references.
f) Intent Recognition: Identifying user intent in chatbot and
AI assistant applications.

Example of NLU in Action

Input: "Book a flight from New York to London tomorrow."

• Intent: Flight Booking •


Entities:
o Departure: "New York"
o Destination: "London"
o Date: "tomorrow"

2. Natural Language Generation (NLG) :- Natural Language


Generation (NLG) is the process of converting structured data or
machine-understood representations into human-readable text or
speech.
Key Steps in NLG
a) Content Determination: Selecting relevant information to
convey.
b) Text Structuring: Organizing information into a coherent
structure.
c) Sentence Planning: Constructing grammatically correct
sentences.
d) Lexical Selection: Choosing appropriate words and phrases.
e) Linguistic Realization: Generating final text output.

Example of NLG in Action

Input Data: Temperature: 25°C, Weather: Sunny


Output: "The weather is sunny today with a temperature of
25°C." Relation Between NLU and NLG
• NLU helps the machine understand human input.
• NLG helps the machine generate meaningful responses.
• Both components work together in AI-driven
applications like chatbots, virtual assistants, and
automated reporting.

For example, in a chatbot:

• NLU processes the user’s query → "What’s the


weather like today?"
• NLG generates a response → "It’s sunny with a
temperature of 25°C."

# Topic – 02 :- Stages Of NLP


Natural Language Processing (NLP) follows a structured pipeline consisting
of various stages. These stages transform raw text into meaningful insights
that machines can process and understand. The key stages of NLP are: 1.
Lexical Analysis (Tokenization & Morphological Processing) Objective: Break
down text into individual words, phrases, or tokens.

Key Processes:

• Tokenization: Splitting text into words or sentences.

o Example: "The cat sat on the mat." → ["The", "cat", "sat",


"on", "the", "mat", "."]

• Stemming: Reducing words to their root form.


o Example: "running" → "run"

• Lemmatization: Mapping words to their dictionary form.

o Example: "better" → "good"

• Stopword Removal: Removing common words like "is," "the," and


"in."

2. Syntactic Analysis (Parsing)

Objective: Analyze sentence structure and grammar to identify


relationships between words.

Key Processes:

• Part-of-Speech (POS) Tagging: Assigns categories like noun, verb,


adjective to words.

o Example: "The cat sat." → [("The", DET), ("cat", NOUN),


("sat", VERB)]

• Dependency Parsing: Identifies relationships between words in a


sentence.

o Example: "The boy plays football." → "boy" → subject of


"plays", "football" → object of "plays".

• Constituency Parsing: Breaks down a sentence into sub-phrases


using a parse tree.

3. Semantic Analysis
Objective: Extract meaning from text by understanding relationships
between words and phrases.

Key Processes:

• Word Sense Disambiguation (WSD): Determines the correct


meaning of words based on context.

o Example: "The bank is on the river" (bank = riverbank, not


financial institution).

• Named Entity Recognition (NER): Identifies proper nouns like


names, places, dates, and organizations.

o Example: "Apple Inc. was founded in 1976 by Steve Jobs." →


["Apple Inc." (ORG), "1976" (DATE), "Steve Jobs" (PERSON)]

• Semantic Role Labeling (SRL): Identifies who did what to whom,


when, and where in a sentence.

4. Discourse Integration

Objective: Understand the relationships between sentences in a


document or conversation.

Key Processes:

• Coreference Resolution: Identifies when different words refer to


the same entity.

o Example: "John went to the store. He bought milk." (He =


John).
• Discourse Analysis: Examines the flow and coherence of multiple
sentences.

• Intent Recognition: Determines user intent in chatbots and AI


assistants.

5. Pragmatic Analysis

Objective: Interpret language based on context, speaker intent, and


realworld knowledge.

Key Processes:

• Context Understanding: Interpreting meaning based on external


factors.

o Example: "Can you open the window?" (A request, not a


literal question about ability).

• Speech Act Theory: Understanding actions performed through


language (e.g., requests, commands).

• Deixis Resolution: Identifying references like "this," "that," "here,"


and "there."

• Implicature Handling: Understanding hidden meanings (e.g.,


sarcasm, indirect speech).
# Topic – 03 :- Steps In NLP
NLP involves a series of processing steps to analyze and interpret human
language. Below are the key steps involved:

1. Tokenization

Objective: Split text into individual words or sentences for further


processing.

Types of Tokenization:

• Word Tokenization: Splits text into words.

o Example: "The cat sat on the mat." → ["The", "cat", "sat",


"on", "the", "mat", "."]

• Sentence Tokenization: Splits text into sentences.

o Example: "I love NLP. It's amazing!" → ["I love NLP.", "It's
amazing!"] Importance:

• Enables further processing like POS tagging and parsing.

• Helps in understanding sentence structure.

2. Lemmatization & Stemming

Objective: Reduce words to their root or base form.

Stemming:

• Removes affixes (prefixes/suffixes) to obtain the root form.

• Uses heuristic rules, sometimes leading to incorrect words.

• Example:
o "running" → "run"

o "studies" → "studi" (incorrect root)

Lemmatization:

• Maps words to their dictionary (lemma) form using linguistic


rules.

• Example:

o "running" → "run" o "better" → "good"


Comparison:
Feature Stemming Lemmatization

Approach Rule-based Dictionary-based

Accuracy Lower Higher


Example "studies" → "studi" "studies" → "study"

Importance:

• Helps in reducing word variations.

• Improves accuracy in text analysis.

3. Part-of-Speech (POS) Tagging

Objective: Assign grammatical categories (noun, verb, adjective, etc.)


to words.

Example:

• "The quick brown fox jumps over the lazy dog."


o The/DET quick/ADJ brown/ADJ
fox/NOUN jumps/VERB over/ADP the/DET lazy/ADJ
dog/NOUN ./.

Importance:

• Helps in understanding sentence structure.

• Useful for parsing and dependency analysis.

4. Named Entity Recognition (NER)

Objective: Identify and classify named entities in text (people, places,


organizations, dates, etc.).

Example:

• "Apple Inc. was founded in 1976 by Steve Jobs in California." o

Apple Inc. → ORG (Organization) o 1976 → DATE o Steve Jobs →

PERSON

o California → GPE (Geopolitical

Entity) Types of Named Entities:

Category Example

PERSON "Elon Musk", "Albert Einstein"

ORG "Google", "NASA"

GPE (Location) "India", "New York"

DATE "2024", "January 1st" MONEY

"$100", "₹5000"
Importance:

• Extracts key information from text.

• Used in search engines, chatbots, and AI applications.

5. Checking (Grammar & Spelling Correction)

Objective: Detect and correct spelling and grammatical errors in text.

Types of Checks:

• Spelling Correction: Fixes misspelled words.

o Example: "I am lerning NLP" → "I am learning NLP"

• Grammar Checking: Ensures correct sentence structure. o

Example: "He go to school" → "He goes to school"

• Context-Based Corrections: Identifies homophones and word


misuses.

o Example: "Their going to the park" → "They're going to the


park"

Techniques Used:

• Rule-Based Systems: Uses predefined grammar rules.

• Machine Learning Models: Uses AI to detect and correct


mistakes.

Importance:

• Enhances text readability and accuracy.


• Used in applications like Microsoft Word, Grammarly, and Google
Docs

# Topic – 04 :- Argmax Computation and Multiclass Classification


Problem :-
The argmax function in NLP is used to determine the index or label with
the highest probability from a set of possible choices. It is commonly used
in classification tasks, where the model predicts the most probable category
for an input.

Example in NLP:

Consider a text classification task where a model predicts the category of a


sentence among three classes: Sports, Politics, and Technology. The model
outputs the following probability scores:

Class Probability

Sports 0.2

Politics 0.7

Technology 0.1

Applying the argmax function:

argmax([0.2,0.7,0.1]) = 1

Since Politics (index 1) has the highest probability (0.7), the model
predicts this category.

Where is Argmax Used in NLP?


• Text Classification: Predicting the category of text.

• Named Entity Recognition (NER): Assigning entity labels like


PERSON, ORG, DATE.

• Part-of-Speech (POS) Tagging: Selecting the most likely POS tag for a
word.

• Machine Translation: Choosing the best word at each step in sequence


decoding.

• Speech Recognition: Selecting the most probable word sequence from


audio.

Multiclass Classification Problem in NLP

A multiclass classification problem in NLP is where a model assigns an


input (text, sentence, document) to one of several possible categories.
Unlike binary classification, which has only two classes (e.g., spam vs. not
spam), multiclass classification has more than two classes.

Examples in NLP:

1. Sentiment Analysis: o Classifying a review as Positive, Negative, or

Neutral.

2. Topic Classification:

o Categorizing news articles into Politics, Sports, Business,


Entertainment, etc.

3. Intent Recognition in Chatbots:

o Identifying user intent such as "Book a Flight," "Check Weather,"


"Order Food".
4. Language Identification:

o Detecting whether a sentence is in English, French, Spanish, etc.

Example:

Consider a text classification model predicting the topic of an article. The


model outputs probabilities:

Class Probability

Sports 0.3

Politics 0.5

Technology 0.2

Applying argmax:

argmax([0.3,0.5,0.2]) = 1

The model predicts Politics as the category.

Evaluation Metrics for Multiclass Classification:

• Accuracy: Measures the percentage of correctly classified examples.

• Precision, Recall, and F1-Score: Evaluates performance per class.

• Confusion Matrix: Shows actual vs. predicted classes.

# Topic – 05 :- Term Weighting


Term weighting is a technique in Natural Language Processing (NLP) used
to assign importance (weights) to words or terms in a document. It helps
in text classification, information retrieval, and document similarity by
emphasizing important words while reducing the influence of less significant
ones.

Two common approaches are:

1. Bag of Words (BoW)

2. Term Frequency-Inverse Document Frequency (TF-IDF)

 Bag Of Words Approach :- Bag of Words (BoW) is a simple


representation of text where a document is converted into a vector of
word counts without considering order or grammar.
• A vocabulary is created from all unique words across documents.
• Each document is represented as a vector, with values indicating
word frequencies.

Consider two documents:

• Doc 1: "NLP is amazing"

• Doc 2: "NLP and AI are related"

Vocabulary: ["NLP", "is", "amazing", "and", "AI", "are", "related"]

Term Doc 1 Doc 2

NLP 1 1 is 1 0 amazing 1 0

and 0 1 AI 0 1

are 0 1 related 0 1

Each document is now represented as a numerical vector:

• Doc 1 → [1, 1, 1, 0, 0, 0, 0]
• Doc 2 → [1, 0, 0, 1, 1, 1, 1]

Advantages of BoW:

✅ Simple and easy to implement


✅ Effective for text classification and spam filtering
✅ Works well with traditional Machine Learning
models Limitations of BoW:

❌ Ignores word order and context


❌ Gives equal importance to all words (even common ones like "is"
and "the")
❌ Produces large, sparse matrices

 Term Frequency-Inverse Document Frequency (TF-IDF) :-

TF-IDF improves BoW by reducing the weight of common words and


increasing the weight of rare but important words across documents.
Formula:
1. Term Frequency (TF):
Measures how often a word appears in a document.
TF(w)=Number of times word w appears in document with respect to
Total number of words in the document
2. Inverse Document Frequency (IDF):
Gives lower importance to common words appearing in many
documents.
IDF(w)=log(Total number of documents with respect to
Number of documents containing word w)
3. TF-IDF Score:
TF-IDF(w) = TF(w) × IDF(w)
Example:
Consider three documents:
1. Doc 1: "NLP is amazing"
2. Doc 2: "NLP and AI are related"
3. Doc 3: "NLP is used in AI"

Step 1: Calculate Term Frequency (TF)

Term Doc 1 (TF) Doc 2 (TF) Doc 3 (TF)

NLP 1/3 1/5 1/5 is 1/3 0 1/5

amazing 1/3 0 0 and 0 1/5 0

AI 0 1/5 1/5

Step 2: Calculate IDF

No. of Docs Containing IDF IDF

Term Score

Term Calculation

NLP 3 log(3/3) 0

is 2 log(3/2) 0.18

amazing 1 log(3/1) 0.48

AI 2 log(3/2) 0.18
Step 3: Compute TF-IDF Score

For Doc 1:

Term TF Score IDF Score TF-IDF Score

NLP 1/3 0 0

is 1/3 0.18 0.06

amazing 1/3 0.48 0.16

Thus, "amazing" gets the highest importance since it appears in only


one document.

Advantages of TF-IDF:
✅ Reduces the importance of common words
✅ Highlights rare but meaningful words
✅ Improves information retrieval (Google, search engines)

Limitations of TF-IDF:
❌ Does not capture word meaning (e.g., synonyms "big" and "large"
are treated separately)
❌ Fails to consider word order and semantics

Comparison: BoW vs. TF-IDF


Aspect Bag of Words (BoW) TF-IDF
Based on TF + IDF:
Based on term frequency
Term frequency of a word adjusted
(TF): frequency of a
Weighting by how rare it is in the
word in a document.
corpus.

Down-weights common words


Treats all words
Consideration of (like "the," "is") that appear
equally, including
Common Words frequently across all
very common ones.
documents.

Frequency of terms in a Importance of terms relative to


Focus
specific document. the document and corpus.

Creates a weighted vector based


Creates a vector
on both term frequency and
Feature of term
inverse document frequency,
Representation frequencies, often
often resulting in more
sparse.
meaningful feature vectors.

No context, Context is implicitly considered


Context and
word order through IDF, which adjusts term
Meaning
ignored. importance.

No special handling Rare terms that appear


Effectiveness for
of rare terms, they in fewer documents are
Rare Terms
are just counted. given higher weight.

Simpler and More computationally intensive


Computational
faster to due to the need to calculate IDF
Complexity
compute. for the whole corpus.
Useful for simple Useful for tasks that require
Use tasks or when distinguishing important terms, such
Case context is not as as document classification or
important. information retrieval.

#Topic – 06 – Term Weighting :-

Syntactic collocation in Natural Language Processing (NLP) refers to the


tendency of certain words to appear together based on syntactic
structures or patterns. These word pairs or groups are often more likely
to co-occur in specific syntactic configurations (like noun + adjective,
verb + noun, or preposition + noun) than others. In other words,
syntactic collocations are based on the grammatical relationships and
structures between words that make them more likely to be found in
certain sequences within sentences.

For example:

1. Verb + Noun Collocation: "Make a decision" (instead of "do a


decision").

2. Adjective + Noun Collocation: "Strong coffee" (instead of "powerful


coffee").

3. Preposition + Noun Collocation: "In time" (instead of "on time").

In NLP, syntactic collocations are important because they help improve


the accuracy of tasks such as:

• Parsing: Identifying the grammatical structure of sentences.


• Machine Translation: Translating phrases that maintain the correct
syntactic relationships between words in different languages.

• Text Generation: Creating grammatically correct sentences by ensuring


syntactic collocations are respected.

• Information Retrieval: Understanding the syntactic relationships in


queries to return the most relevant documents.

Syntactic collocation is different from semantic collocation, which


focuses on words that frequently appear together because they share a
similar meaning (e.g., "fast car," where "fast" and "car" are related
semantically). Syntactic collocations depend more on the grammatical
patterns that govern how words combine.

The terms non-compositionality, non-substitutability, and non-


modifiability are important characteristics often associated with syntactic
collocations in Natural Language Processing (NLP). These criteria help
define why certain word combinations are treated as "collocations" rather
than just ordinary word pairs. Let's explore each of these in detail:

1. Non-compositionality:

• Definition: A collocation is said to be non-compositional when the


meaning of the entire phrase cannot be derived from the meanings of
the individual words. In other words, the meaning of the whole
expression is not simply the sum of the meanings of the individual
words in the combination.

• Example: "Kick the bucket" is a common idiomatic expression (a


syntactic collocation) that means "to die." The meaning of this phrase
cannot be derived by just looking at the meanings of "kick" and
"bucket," because it has a figurative meaning.

• Significance in NLP: This is a crucial criterion because syntactic


collocations often involve combinations of words that carry meanings
that go beyond the literal interpretation. Recognizing such expressions
helps NLP systems distinguish between standard word pairs and
phrases that may require specialized handling (such as idioms or
metaphors).

2. Non-substitutability:

• Definition: Non-substitutability refers to the fact that the words in a


collocation cannot be freely replaced with synonyms or other words
without affecting the naturalness or grammaticality of the phrase.

• Example: In the collocation "make a decision," the verb "make" is


typically used, and replacing it with a synonym like "do" (e.g., "do a
decision") sounds awkward or incorrect.

• Significance in NLP: This characteristic is vital for NLP systems


because it helps them recognize that certain word combinations are
fixed in form. Even if words have similar meanings, they may not
always be interchangeable in syntactic collocations due to specific
syntactic or idiomatic constraints.

3. Non-modifiability:

• Definition: Non-modifiability refers to the fact that collocations often


resist modification by adding other words (especially in a way that
would alter their natural syntactic structure). In other words, the
collocating words cannot be easily modified or expanded without
disrupting their naturalness.

• Example: The phrase "strong coffee" is a syntactic collocation. While


you might say "very strong coffee" or "black strong coffee," the core
adjective "strong" typically cannot be replaced by just any adjective,
like "powerful" (e.g., "powerful coffee" is less natural or uncommon).

• Significance in NLP: This criterion helps identify collocations that have


a rigid syntactic structure, and this is essential for tasks such as
machine translation or language generation, where the preservation of
collocations is important. Collocations often follow specific rules about
what can or cannot modify them.

The terms non-compositionality, non-substitutability, and non-


modifiability are important characteristics often associated with syntactic
collocations in Natural Language Processing (NLP). These criteria help
define why certain word combinations are treated as "collocations" rather
than just ordinary word pairs. Let's explore each of these in detail:

1. Non-compositionality:

• Definition: A collocation is said to be non-compositional when the


meaning of the entire phrase cannot be derived from the meanings of
the individual words. In other words, the meaning of the whole
expression is not simply the sum of the meanings of the individual
words in the combination.

• Example: "Kick the bucket" is a common idiomatic expression (a


syntactic collocation) that means "to die." The meaning of this phrase
cannot be derived by just looking at the meanings of "kick" and
"bucket," because it has a figurative meaning.

• Significance in NLP: This is a crucial criterion because syntactic


collocations often involve combinations of words that carry meanings
that go beyond the literal interpretation. Recognizing such expressions
helps NLP systems distinguish between standard word pairs and
phrases that may require specialized handling (such as idioms or
metaphors).

2. Non-substitutability:

• Definition: Non-substitutability refers to the fact that the words in a


collocation cannot be freely replaced with synonyms or other words
without affecting the naturalness or grammaticality of the phrase.

• Example: In the collocation "make a decision," the verb "make" is


typically used, and replacing it with a synonym like "do" (e.g., "do a
decision") sounds awkward or incorrect.

• Significance in NLP: This characteristic is vital for NLP systems


because it helps them recognize that certain word combinations are
fixed in form. Even if words have similar meanings, they may not
always be interchangeable in syntactic collocations due to specific
syntactic or idiomatic constraints.

3. Non-modifiability:

• Definition: Non-modifiability refers to the fact that collocations often


resist modification by adding other words (especially in a way that
would alter their natural syntactic structure). In other words, the
collocating words cannot be easily modified or expanded without
disrupting their naturalness.

• Example: The phrase "strong coffee" is a syntactic collocation. While


you might say "very strong coffee" or "black strong coffee," the core
adjective "strong" typically cannot be replaced by just any adjective,
like "powerful" (e.g., "powerful coffee" is less natural or uncommon).

• Significance in NLP: This criterion helps identify collocations that have


a rigid syntactic structure, and this is essential for tasks such as
machine translation or language generation, where the preservation of
collocations is important. Collocations often follow specific rules about
what can or cannot modify them.

You might also like