NLP Notes
NLP Notes
NLP Notes
SHRDLU
LUNAR
5. Language Translation
Issue: NLP often requires access to large amounts of text data, which can
include sensitive or personal information.
Impact: Potential privacy violations or security risks, especially if data is
mishandled or improperly anonymized.
Phases in NLP
1. Tokenization
o Definition: The process of splitting text into smaller units, called
tokens, which can be words, phrases, or symbols.
o Example: In the sentence "The cat sat on the mat," tokenization
would result in the tokens: ["The", "cat", "sat", "on", "the", "mat"].
WORDS and Their Components:
word components:
1) Tokens 2) Lexemes
3)Morphemes 4) Typology
Tokens :
1. Word tokens
2. Character tokens
3. Sub word tokens
Morpheme :
Definition: The smallest meaningful unit of language that cannot be
further divided. Morphemes can be classified into two main types:
Definition of Lexeme
Example of a Lexeme
Lexeme: "run"
Inflected Forms:
o "run" (base form)
o "runs" (third person singular present)
o "running" (present participle)
o "ran" (past tense)
All these forms are tokens of the same lexeme "run," but they differ in
grammatical context.
Lexeme: "happy"
Inflected Forms:
o "happy" (base form)
o "happier" (comparative)
o "happiest" (superlative)
o "unhappy" (derived form)
Again, all these variations relate to the same underlying lexeme "happy," which
represents the concept of being pleased or content.
Semantic Analysis–
• Semantic analysis is concerned with the meaning representation.
• It mainly focuses on the literal meaning of words, phrases, and sentences.
• Named Entity Recognition (NER):
Example:
Pragmatic Analysis:
Understanding the underlying intention behind a statement (e.g., whether
it’s a question,command, or request).
Example:
• Example:
• Consider the sentence: "The quick brown fox jumps over the lazy dog."
• Tokenization: ["The", "quick", "brown", "fox", "jumps", "over", "the",
"lazy", "dog"]
• POS Tagging: ["The/DT", "quick/JJ", "brown/JJ", "fox/NN",
"jumps/VB", "over/IN", "the/DT", "lazy/JJ", "dog/NN"]
• Chunking:
• NP: [The/DT quick/JJ brown/JJ fox/NN]
• VP: [jumps/VB]
• PP: [over/IN the/DT lazy/JJ dog/NN]
• Here, the sentence is divided into three main chunks: a noun phrase, a
verb phrase, and a prepositional phrase.
Different approaches for different languages segmentation:
Example
Chinese documents there are no white spaces to find the word to word in
sentences .
In Chinese documents by segmenting character by character in order to
get particular word
• But English language for segmentation of document we can identify
based on the whitespaces in word to word in a sentences.
The main aim of segmentation is decides whether to mark a boundary
between 2 tokens
as sentence boundary or topic boundary.
Two types of segmentation
1) Sentence boundary detection
2) Topic boundary detection
Sentence boundary detection :
• It is also called sentence segmentation
• The process of segmenting sequence of words into units
• Sentences can be identified In written English
• Beginning of the sentence with uppercase letters (Ex..A,B) and ending
with the sentences (?, . , ! )
• But sometimes Capitalization capital letters are used for nouns and
abbreviations in some paragraphs are sentences .
• Punctuation marks are used inside sentences
Example of Sentence Boundary Detection:
Consider the following text:
"Dr. Smith went to New York. He met with Mr. Johnson at 3:00 p.m.
They discussed the project, and then he returned to L.A."
Sentence Boundaries:
1. "Dr. Smith went to New York."
• "Dr." ends with a period, but it is not a sentence boundary because
it's an abbreviation. The actual boundary is after "York."
2. "He met with Mr. Johnson at 3:00 p.m."
• "Mr." and "p.m." both contain periods, but they are not sentence
boundaries. The sentence ends after "p.m."
3. "They discussed the project, and then he returned to L.A."
• "L.A." ends with a period, but it's part of an abbreviation, not a
sentence boundary. The sentence ends after "L.A."
Example Breakdown:
1. Input Text:
• "Dr. Smith went to New York. He met with Mr. Johnson at 3:00
p.m. They discussed the project, and then he returned to L.A."
2. Detected Sentences:
• "Dr. Smith went to New York."
• "He met with Mr. Johnson at 3:00 p.m."
• "They discussed the project, and then he returned to L.A."
Topic boundary detection:
• It is also called discourse segmentation or text segmentation
• The text segmentation is nothing but the process of dividing the
text/speech into homogenous blocks is called topic segmentation.
Applications
1) Text extraction and retrieval
2) Text summarization.
Identification
Text segmentation clues
By following the headlines
By paragraph breaks
• Example of Topic Boundary Detection:
• Consider the following passage:
• **"The global economy is facing significant challenges. Inflation rates
are rising, and many countries are struggling to recover from the effects
of the pandemic. Governments are implementing various measures to
stabilize their economies.
• In the tech industry, innovation continues at a rapid pace. Companies are
investing heavily in artificial intelligence and machine learning. These
technologies are expected to revolutionize industries from healthcare to
finance."**
• Detected Topic Boundaries:
• First Topic:
• "The global economy is facing significant challenges. Inflation
rates are rising, and many countries are struggling to recover
from the effects of the pandemic. Governments are
implementing various measures to stabilize their economies."
• This segment discusses economic challenges and government
responses.
• Topic Boundary:
• A boundary is detected before the sentence starting with "In the
tech industry," which signals a shift from the economy to
technology.
• Second Topic:
• "In the tech industry, innovation continues at a rapid pace.
Companies are investing heavily in artificial intelligence and
machine learning. These technologies are expected to
revolutionize industries from healthcare to finance."
• This segment discusses technological innovation and its impact.
Sentence Boundary
Feature Topic Boundary Detection
Detection
Identifies the boundaries
Identifies the boundaries where
Definition between individual
topics change within a document.
sentences in a text.
To segment text into
To segment text into coherent
Primary Goal sentences for further
sections based on topic changes.
processing.
Sentence-level
Focus Topic-level segmentation.
segmentation.
Punctuation detection,
Techniques Topic modeling, semantic analysis,
capitalization rules,
Used clustering algorithms.
machine learning.
More complex, requires
Relatively straightforward,
Complexity understanding of context and
with well-defined rules.
semantics.
Text preprocessing, speech
Document summarization, content
Applications recognition, machine
indexing, topic-based retrieval.
translation.
Identifying the shift from a
Splitting "The cat sat. It was
Example discussion on "climate change" to
hungry." into two sentences.
"economic policies" in a report.
Handling abbreviations,
Detecting subtle or gradual topic
Challenges ellipses, and ambiguous
transitions within a text.
punctuation.
Works on paragraphs or
Input Works on entire documents or long
streams of text to identify
Granularity text to identify topic shifts.
sentences.
Segments or chunks of text, each
Output A list of sentences.
representing a distinct topic.
Generative Sequence Classification Methods: