NLP Notes

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 18

NLP definition

• Natural Language Processing (NLP), is a branch of artificial intelligence


that deals with the interaction between computers and humans using the
natural language.
• Natural language processing (NLP) is a branch of artificial intelligence
(AI) that enables computers to comprehend, generate, and manipulate
human language.
• NLP is a part of computer science and Artificial Intelligence and Human
languages

Early NLP systems


1) ELIZA 2) Sys Tran 3) TAUM METEO
4)SHARDLU 5) LUNAR

SHRDLU

SHRDLU is a program written by Terry Winograd in 1968-70. It helps users


to communicate with the computer and moving objects. It can handle
instructions such as "pick up the green boll" and also answer the questions like
"What is inside the black box." The main importance of SHRDLU is that it
shows those syntax, semantics, and reasoning about the world that can be
combined to produce a system that understands a natural language.

LUNAR

LUNAR is the classic example of a Natural Language database interface system


that is used ATNs and Woods' Procedural Semantics. It was capable of
translating elaborate natural language expressions into database queries and
handle 78% of requests without errors.
NLP Applications:
1. Question & Answering system
2. Spam Detection
3. Sentiment Analysis
4. Machine Translation
5. Spelling Correction
6. Speech Recognition
7. Chatbot
8. Information Extraction

NLU & NLG in NLP:


 NLU (Natural Language Understanding)
 NLG (Natural Language Generation)
Natural Language Understanding (NLU) which involves transforming human
language into a machine-readable format.
It helps the machine to understand and analyse human language by extracting
the text from large data such as keywords, emotions, relations, and semantics.
It produces non-linguistic outputs from natural language inputs.
Natural Language Generation (NLG) acts as a translator that converts the
computerized data into natural language representation. It mainly involves Text
planning, Sentence planning, and Text realization.
It produces constructing natural language outputs from non-linguistic inputs.
The NLU is harder than NLG.
Advantages of NLP

1. Enhanced Communication with Machines

 Advantage: NLP allows for more natural and intuitive interaction


between humans and machines, enabling users to communicate in their
own language rather than needing to learn specific commands or coding.
 Example: Voice assistants like Siri and Alexa can understand spoken
language and perform tasks based on user requests.

2. Automated Text Analysis

 Advantage: NLP can automatically process and analyse large volumes of


text data, extracting valuable insights and patterns without manual
intervention.
 Example: Analysing customer reviews to determine overall sentiment
towards a product or service.

3. Improved Customer Service

 Advantage: NLP-powered chatbots and virtual assistants can handle


customer inquiries around the clock, providing quick and consistent
responses without human intervention.
 Example: Automated customer support systems that can resolve common
issues or route complex queries to the appropriate human agents.

4. Efficient Information Retrieval

 Advantage: NLP enhances search engines and databases by allowing


them to understand the context and intent behind user queries, leading to
more accurate and relevant search results.
 Example: Google Search's ability to understand natural language queries
and provide answers that match the intent behind the search.

5. Language Translation

 Advantage: NLP enables accurate and real-time translation of text or


speech between different languages, breaking down language barriers and
facilitating global communication.
 Example: Google Translates ability to translate web pages, documents,
and conversations in multiple languages.
Disadvantages of NLP
1. Ambiguity and Complexity of Language

 Issue: Human language is inherently complex and ambiguous, with


words having multiple meanings and sentences having various possible
interpretations. NLP systems often struggle with accurately interpreting
these nuances.
 Impact: Misunderstandings in context, incorrect responses, or failure to
grasp the true intent behind a statement.

2. Data Privacy Concerns

 Issue: NLP often requires access to large amounts of text data, which can
include sensitive or personal information.
 Impact: Potential privacy violations or security risks, especially if data is
mishandled or improperly anonymized.
Phases in NLP

Lexical Analysis or Morphological analysis


 The first phase of NLP is the Lexical Analysis.
 This phase converting given sentence into stream of lexemes/morphemes.

Lexical analysis is a crucial step in Natural Language Processing (NLP) that


involves the process of converting a sequence of characters (text) into a
sequence of tokens. These tokens are meaningful units that represent words,
phrases, or symbols in the text. Lexical analysis serves as the foundation for
further processing in NLP tasks. Here’s a breakdown of the key aspects of
lexical analysis:

1. Tokenization
o Definition: The process of splitting text into smaller units, called
tokens, which can be words, phrases, or symbols.
o Example: In the sentence "The cat sat on the mat," tokenization
would result in the tokens: ["The", "cat", "sat", "on", "the", "mat"].
WORDS and Their Components:

word is a meaningful unit of a sentence.

word components:

1) Tokens 2) Lexemes

3)Morphemes 4) Typology

Tokens :

1. Word tokens
2. Character tokens
3. Sub word tokens

example I love reading newspaper.

• This process is called Tokenization.

Morpheme :
 Definition: The smallest meaningful unit of language that cannot be
further divided. Morphemes can be classified into two main types:

o Free Morphemes: Morphemes that can stand alone as words

(e.g., "book," "run").

o Bound Morphemes: Morphemes that cannot stand alone and must


be attached to other morphemes

(e.g., prefixes like "un-" in "unhappy," or suffixes like "-ing" in


"running").

This process is called Morphological process


Lexemes :
• They are the base or canonical form of words.
• Definition: A lexeme is a basic unit of meaning that represents a set of
word forms. It can be thought of as the "dictionary form" or "base form"
of a word.

Example The base word of running , run is RUN

largest, larger is large.

Definition of Lexeme

 Lexeme: An abstract unit of meaning that represents a single word or


root, including all its inflections and derivations. It is not tied to a specific
grammatical form but rather denotes the underlying concept.

Example of a Lexeme

Consider the lexeme "run."

 Lexeme: "run"
 Inflected Forms:
o "run" (base form)
o "runs" (third person singular present)
o "running" (present participle)
o "ran" (past tense)

All these forms are tokens of the same lexeme "run," but they differ in
grammatical context.

Another Example: "happy"

 Lexeme: "happy"
 Inflected Forms:
o "happy" (base form)
o "happier" (comparative)
o "happiest" (superlative)
o "unhappy" (derived form)

Again, all these variations relate to the same underlying lexeme "happy," which
represents the concept of being pleased or content.

This process is called Lemmatigation


Typology:

Syntactic Analysis (Parsing)–


• Syntactic Analysis is used to check grammar, word arrangements, and
shows the relationship among the words.
• The sentence such as “The school goes to boy” is rejected by English
syntactic analyser.
• “The boy goes to school” is the meaningful sentence.

Semantic Analysis–
• Semantic analysis is concerned with the meaning representation.
• It mainly focuses on the literal meaning of words, phrases, and sentences.
• Named Entity Recognition (NER):

Identifying entities like people, places, organizations, dates, etc.,


within the text.

• Word Sense Disambiguation:

Determining the correct meaning of a word based on the context in


which it is used.

• Named Entity Recognition (NER):


• Input: "Barack Obama was born in Hawaii."
• Output: ["Barack Obama" - PERSON, "Hawaii" - LOCATION]
• Word Sense Disambiguation:
• Input: "He went to the bank."
• Output: Depending on context, "bank" could mean a financial institution
or the side of a river.
Discourse Integration–
• integrating the sentences in such a way that the meaning of the sentence
is affected by preceding sentence.

Example:

sentence 1: Ramu is in 4 th standard.

sentence 2: Ramu is good boy.

Result: Ramu is in 4 th standard, he is good boy.

Pragmatic Analysis:
Understanding the underlying intention behind a statement (e.g., whether
it’s a question,command, or request).

Example:

Input: "Can you pass the salt? “

Output: Understanding that this is a request, not just a question.


Issues and Challenges:
1. Irregularity
2. Ambiguity
3. Productivity
Irregularity:
The phenomenon where certain words or word forms does not follow regular
patterns or rules in terms of their morphology or syntax.
it is a challenge to algorithms which follow particular patterns.
Irregular verbs and nouns:
which does not follow standard pattern or inflection.
Example: past present future
Choose Chose Chosen -------- base word is Choose
bite bit bitten ---------- Base word is Bite
went go will go --------- went is a irregular word
Exceptional inflection :
comparative and superlative Adjectives
• For comparative adjectives, the suffix -er will be added, or it will be
preceded by more.
• For superlative adjectives, the suffix -est will be added, or it will be
preceded by most.
Example : Big Bigger Biggest
Dark Darker Darkest
Good Better Best
Here Algorithm thinks Gooder & Goodest But Better & Best are
comparative adjectives and superlative adjectives .
2 ) Ambiguity
Ambiguity in Natural Language Processing (NLP) refers to situations where a
word, phrase, or sentence can have multiple meanings or interpretations.
Ambiguity arises in morphological processing and language processing.
2.1 Lexical Ambiguity
• Definition: Occurs when a word has multiple meanings.
• Example: The word "bank" can refer to a financial institution or the side
of a river.
• Challenge: The NLP model must use context to decide which meaning is
intended.
2.2 Syntactic Ambiguity
• Definition: Occurs when a sentence can be parsed in more than one way
due to its structure.
• Example: "I saw the man with the telescope." This could mean either the
speaker used a telescope to see the man or that the man had a telescope.
• Challenge: The model needs to determine the correct parse tree based on
context.
2.3 Semantic Ambiguity
• Definition: Happens when a sentence or phrase can have multiple
meanings.
• Example: "The chicken is ready to eat." This could mean the chicken is
ready to be eaten, or the chicken is ready to eat something.
• Challenge: The model must infer the meaning from the surrounding text.
3) Productivity :
• Ability to generate new words and word forms using productive rules.
• Example :
• Googol meaning 1 followed by 100 zeros.
• From Googol new word Google is generated.
Finding the structure of Documents:
• Chunking is important to know what is segmentation?
• Extracting the structure of documents helps NLP tasks like parsing,
machine translation
• By chunking(parse) the input text or speech into tokens this process is
called segmentation. ( or )
• Chunking in Natural Language Processing (NLP) is a process of dividing
a sentence into syntactically correlated parts, such as phrases (noun
phrases, verb phrases, etc.). It is also known as shallow parsing because,
unlike full parsing, which creates a complete syntactic structure for a
sentence, chunking focuses on identifying and grouping small segments
of the sentence.
Chunking Process:
• Tokenization: The text is first broken down into tokens (words or
punctuation).
• Part-of-Speech (POS) Tagging: Each token is labeled with its part of
speech (e.g., noun, verb, adjective).
• Chunking: Using rules or machine learning models, the tagged tokens
are grouped into chunks.

• Example:
• Consider the sentence: "The quick brown fox jumps over the lazy dog."
• Tokenization: ["The", "quick", "brown", "fox", "jumps", "over", "the",
"lazy", "dog"]
• POS Tagging: ["The/DT", "quick/JJ", "brown/JJ", "fox/NN",
"jumps/VB", "over/IN", "the/DT", "lazy/JJ", "dog/NN"]
• Chunking:
• NP: [The/DT quick/JJ brown/JJ fox/NN]
• VP: [jumps/VB]
• PP: [over/IN the/DT lazy/JJ dog/NN]
• Here, the sentence is divided into three main chunks: a noun phrase, a
verb phrase, and a prepositional phrase.
Different approaches for different languages segmentation:
Example
Chinese documents there are no white spaces to find the word to word in
sentences .
In Chinese documents by segmenting character by character in order to
get particular word
• But English language for segmentation of document we can identify
based on the whitespaces in word to word in a sentences.
 The main aim of segmentation is decides whether to mark a boundary
between 2 tokens
as sentence boundary or topic boundary.
Two types of segmentation
1) Sentence boundary detection
2) Topic boundary detection
Sentence boundary detection :
• It is also called sentence segmentation
• The process of segmenting sequence of words into units
• Sentences can be identified In written English
• Beginning of the sentence with uppercase letters (Ex..A,B) and ending
with the sentences (?, . , ! )
• But sometimes Capitalization capital letters are used for nouns and
abbreviations in some paragraphs are sentences .
• Punctuation marks are used inside sentences
Example of Sentence Boundary Detection:
Consider the following text:
"Dr. Smith went to New York. He met with Mr. Johnson at 3:00 p.m.
They discussed the project, and then he returned to L.A."
Sentence Boundaries:
1. "Dr. Smith went to New York."
• "Dr." ends with a period, but it is not a sentence boundary because
it's an abbreviation. The actual boundary is after "York."
2. "He met with Mr. Johnson at 3:00 p.m."
• "Mr." and "p.m." both contain periods, but they are not sentence
boundaries. The sentence ends after "p.m."
3. "They discussed the project, and then he returned to L.A."
• "L.A." ends with a period, but it's part of an abbreviation, not a
sentence boundary. The sentence ends after "L.A."
Example Breakdown:
1. Input Text:
• "Dr. Smith went to New York. He met with Mr. Johnson at 3:00
p.m. They discussed the project, and then he returned to L.A."
2. Detected Sentences:
• "Dr. Smith went to New York."
• "He met with Mr. Johnson at 3:00 p.m."
• "They discussed the project, and then he returned to L.A."
Topic boundary detection:
• It is also called discourse segmentation or text segmentation
• The text segmentation is nothing but the process of dividing the
text/speech into homogenous blocks is called topic segmentation.
Applications
1) Text extraction and retrieval
2) Text summarization.
Identification
Text segmentation clues
By following the headlines
By paragraph breaks
• Example of Topic Boundary Detection:
• Consider the following passage:
• **"The global economy is facing significant challenges. Inflation rates
are rising, and many countries are struggling to recover from the effects
of the pandemic. Governments are implementing various measures to
stabilize their economies.
• In the tech industry, innovation continues at a rapid pace. Companies are
investing heavily in artificial intelligence and machine learning. These
technologies are expected to revolutionize industries from healthcare to
finance."**
• Detected Topic Boundaries:
• First Topic:
• "The global economy is facing significant challenges. Inflation
rates are rising, and many countries are struggling to recover
from the effects of the pandemic. Governments are
implementing various measures to stabilize their economies."
• This segment discusses economic challenges and government
responses.
• Topic Boundary:
• A boundary is detected before the sentence starting with "In the
tech industry," which signals a shift from the economy to
technology.
• Second Topic:
• "In the tech industry, innovation continues at a rapid pace.
Companies are investing heavily in artificial intelligence and
machine learning. These technologies are expected to
revolutionize industries from healthcare to finance."
• This segment discusses technological innovation and its impact.
Sentence Boundary
Feature Topic Boundary Detection
Detection
Identifies the boundaries
Identifies the boundaries where
Definition between individual
topics change within a document.
sentences in a text.
To segment text into
To segment text into coherent
Primary Goal sentences for further
sections based on topic changes.
processing.
Sentence-level
Focus Topic-level segmentation.
segmentation.
Punctuation detection,
Techniques Topic modeling, semantic analysis,
capitalization rules,
Used clustering algorithms.
machine learning.
More complex, requires
Relatively straightforward,
Complexity understanding of context and
with well-defined rules.
semantics.
Text preprocessing, speech
Document summarization, content
Applications recognition, machine
indexing, topic-based retrieval.
translation.
Identifying the shift from a
Splitting "The cat sat. It was
Example discussion on "climate change" to
hungry." into two sentences.
"economic policies" in a report.
Handling abbreviations,
Detecting subtle or gradual topic
Challenges ellipses, and ambiguous
transitions within a text.
punctuation.
Works on paragraphs or
Input Works on entire documents or long
streams of text to identify
Granularity text to identify topic shifts.
sentences.
Segments or chunks of text, each
Output A list of sentences.
representing a distinct topic.
Generative Sequence Classification Methods:

You might also like