0% found this document useful (0 votes)
22 views72 pages

NLP Finalll

Uploaded by

allwomensucks
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views72 pages

NLP Finalll

Uploaded by

allwomensucks
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Morphology is the study of the structure and formation of words.

It focuses on how words are built


from smaller units called morphemes, which are the smallest meaningful units of language. For
example, the word “unhappiness” can be broken down into three morphemes: "un-", "happy", and "-
ness."

Types of Morphology:
1. Inflectional Morphology: Changes a word's form to express different grammatical categories
(e.g., tense, number, case). Example: "walk" — "walked."

2. Derivational Morphology: Creates new words by adding prefixes or suffixes. Example: "happy"
— “unhappy.”
Parameter inflectional Morphology Derivational Morphology

Function Modifies a word to express grammatical Creates a new word by adding prefixes or
categories (e.g., tense, number, case) suffixes, often changing the meaning and
without changing its core meaning. sometimes the grammatical category.

Word Category Does not change the grammatical category Often changes the grammatical category of
of the word (e.g., verb remains a verb). the word (e.g., noun to adjective).

Meaning Retains the original meaning of the word, Can change the meaning of the base word
adding grammatical information. significantly.

Productivity Highly regular and applies broadly across Less regular and can be limited to specific
word classes (e.g., pluralization of nouns). words or contexts.

Number of Usually adds a single morpheme (e.g., "-s," May add one or more morphemes (e.g.,
Morphemes "-ed"). un-," "-ness").

Position Typically occurs at the end of a word Can occur at the beginning (prefixes) or end
(suffixes) In English (e.g., "-s," "-ed"). (suffixes) of a word (e.g., “pre-," "-ly").

Obligatoriness Required to convey correct grammatical Optional and used to create new words or
meaning (e.g., tense, plurality). change meaning.

Examples "walk" — “walked” (past tense) 9


o
= "happy" - “unhappy” (change in meaning),
“dogs” (plural) "joy" = "joyful" (noun to adjective)
6.6 Sentiment Analyses

* Sentiment analysis is the process of detecting positive or negative sentiment in text.


e It’s often used by businesses to detect sentiment in social data, gauge brand reputation, and
understand customers.

« Since customers express their thoughts and feelings more openly than ever before, sentiment
analysis is becoming an essential tool to monitor and understand that sentiment.
® Automatically analysing customer feedback, such as opinions in survey responses and social
media conversations, allows brands to learn what makes customers happy or frustrated, so
that they can tailor products and services to meet their customers’ needs.
* For example, using sentiment analysis ta automatically analyse 4,000+ reviews about your
product could help you discover if customers are happy about your pricing plans and
customer service.
just list 5 advantage of
sentimental analysis system in
alle)

©) 1. Market Insights

2. Improved Customer Engagement

3. Brand Monitoring

4. Enhanced Decision-Making

5. Automated Insights
Word Meaning Relationships

Hyponymy and Hypernymy

* Hyponymy: A word whose meaning is


included in the meaning of another word.

- Example: "Cat" is a hyponym of "animal"


because all cats are animals.

* Hypernymy: A word whose meaning


includes the meaning of another word.

- Example: "Animal" is a hypernym of


"cat" because It is a broader term that
encompasses Cats.

Meronymy and Holonymy

* Meronymy: A part-whole relationship.

- Example: "Wheel" is a meronym of "car"


because it is a part of a car.

* Holonymy: A whole-part relationship.

- Example: "Car" is a holonym of "wheel"


because it is the whole that contains the
wheel.
Finite State Automata (FSA) play a significant role in morphological analysis in natural language
processing (NLP) by modeling the structure and formation of words. Here's a concise overview:

1. Modeling Morphological Rules: FSAs represent the rules governing how words are formed from
roots and affixes (prefixes, suffixes), capturing valid transitions for combining morphemes.

2. Word Segmentation: FSAs segment words into their constituent morphemes (e.g.,
“unhappiness” > “un-", "happy," "-ness"), helping to understand meanings and grammatical
functions.

3. Generating Word Forms: They can generate all possible inflected forms of a base word, like
"play" to "plays," “playing,” "played," which is essential for languages with rich inflections.

4. Recognizing Valid Word Forms: FSAs determine if a word conforms to a language's


morphological rules, useful for spell checking and grammar correction.

5. Efficiency: FSAs are computationally efficient, making them suitable for processing large
datasets and complex morphological structures.

6. Integration: FSAs can be combined with other models, like context-free grammars, to capture
both morphological and syntactic aspects of language.
A. il

Tagging
a2 PA rt-Of-Speech
part-of-speech tagging (or just tagging for short) is the process of assigning a part-of-speech

lexical class marker to each word in a corpus. Tags are also usually applied to
or other
punctuati on markers; thus tagging for natural language is the same process as tokenization
for computer languages, although lags for natural languages are much more ambiguous.
The input to 4 tagging algorithm is a string of words and a specified tagset.

VB DT NN

Book that flight.

Tagging algorithms automatically choose multiple tags for single word and select only one
best appropriate tag for that word. Although, tagging can be hard as it face lot of
and can be noun
disambiguation. E.g. word book can be consider as verb for book that flight
for please give me that book
@ «641.5.1 Phonetic and Phonological Knowledge

Phonetic knowledge is the knowledge of sound-symbol relations and sound patterns


represented in a language.
It is when a child is learning to talk, communicate and then they develop phone
mic
awareness, which is an awareness of distinctive speech sounds and they use
phonemes (smallest unit of sound) to create words.
The primary differences between phonological and phonemic awareness is that
phonological awareness is the ability to recognise words made up of different sounds.
In contrast, phonemic awareness is the ability to understand how sound functions in
words.
Example of phonological knowledge

Counting the number of syllables in a name, recognising alterations, segmenting a


sentence into words, and identifying the syllables in a word.
Example of phonemic knowledge.
Counting the number of sounds a word would be a phonemic awareness activity.
Ur wilal WO Larget Lo customers. eo RESSES SERS EN SUES Wh UV es 88s Well as
disadvantages.
(8) Automatic Text Classification
¢ Automatic text classification is
«1.9.1 Advantages of NLP
another fundamental solution of
zo NLP. It is the
ne of assigning tags to text acc (i) Once implemented, NLP is less expensive and more time efficient than employing a
ording to its content and semant
or rapid, easy col ics. It allows
lection of information in the person.
search phase.
¢ This NLP applicicaattiion can
dii fferentiiate Span fr om non-spam based on it co
ntent. _
MU-New Syllabus w.¢.f academic (MU-New Syllabus w.e.f academic year 22-23) (M7-83) & Tech-Neo Publications...A SACHIN SHAH Venture
wer year 22-23) (M7-83) Tech-Neo Publications...A SACHIN SHAH Venture

: )....Page no. : 18
(1-
(Introduction to NLP
(MU - Sem 7- Comp) Natural Language Processing (MU - Sem 7- Comp) (Introduction to NLP)....Page no. (1-19)
Natural Language Processing
response times.
er service
NLP can also help bus ine sse s. It o ffers faster custom A transit eration technology is one that allows user to type words as, they would
(ii)
answers to their questions.
Customers can receive immediate usually do (like ‘rashtrabhasha’ instead ‘RASHTRASHA) such as case sensitive
litate different
availab le' for developers to faci
(iii) Pre-trained learning models are typing rules.
to implement.
applications of NLP ; It makes them easy « Transit eration tools expect users to type English words phonetically. This allows
(iv) Natural Language Processing is the practice of teaching machines to understand and users to communicate in their own regional language of their choice.
interpret conversational inputs from humans.
and
(ii) Fonts Download
establish communication channels between humans
(v) NLP can be used to
Technology development for Indian language (TDIL) programme initiated by the
machines,
department of electronic and IT (DEIT), govt. of Indian has the objective to
(vi) The different implantations of NLP can help businesses and individuals save time,
develop information processing tools to facilitate human machine interaction in
improve efficiency and increase customer satisfaction.
Indian language and to develop technologies to access multilingual knowledge
2a «61.9.2 Disadvantages
of NLP resources.

(i) Training can be time-consuming. If a new model needs to be developed without the The fonts are being made available free for public through language CDS and
use model needs to be developed without the use of a pre-trained model, it can take web downloads for the benefit of massen.

weeks before achieving a high level of performance. (iii) Padma Plugin


(ii) There is always a possibility of errors in predictions and results that need to be taken « Padma is a technology for transforming Indic text between public and
into account. proprietory formate. The technology currently supports Telusu, Malyalam Tamil,
(iii) NLP may not show context. Devenagri (including Marathi), Gujarathi, Bengali and Gurmukhi.

(iv) NLP may require more keystrokes. * Padma’s goal is to bridge the gap between closed and open standard until the day
Unicode support is widely available on all platforms.
(v) NLP is unable to the new domain, and it has a limited function. That is why NLP is
built for a single and specific takes only. * Padma transforms Indic text encoded in proprietary formats automatically
Unicode.
Applications of NLP
It has led to the automation of speech-related tasks and human interaction. Some applications
of NLP include
° Translation Tools: Tools such as Google Translate, Amazon Translate, etc. translate sentences from one
language to another using NLP.

° Chatbots: Chatbots can be found on most websites and are a way for companies to deal with common
queries quickly.

© Virtual Assistants: Virtual Assistants like Siri, Cortana, Google Home, Alexa, etc. can not only talk to you
but understand commands given to them.
Applications of NLP
Question Answering
° Type in keywords to ask Questions in Natural Language.

Text Summarization
° The process of summarizing important information from a source to produce a shortened version

Machine Translation
° Use of computer applications to translate text or speech from one natural language to another. E.g.
Google translate- translating short, simple sentences
° DeepL : for translating longer and more complex texts.
Lee ——
Pee

>>} 2.1 TOKENIZATION


Processing (NLP). It’s g
common tas k in Natural Language
Tokenization is a and
NLP methods like Count Vectorizer
fundamental step in both trad itional
nc ed De ep Le ar ni ng -b as ed ar ch itectures like Transformers.
Adva
ural Language.
Tokens are the building blocks of Nat
ara tin g a pie ce of tex t into sma lle r uni ts called tokens,
Tokenization is a way of sep
d ter s, or sub wor ds. Hen ce, tokenization can be
Here, tokens can be either wor s. cha rac
sifi ed into 3 typ es — wor d, cha rac ter , and subword (n-gram characters)
broadly clas
tokenization.
give up”.
For example, consider the sentence: “Never
ed on space. Assuming space aS 4
The most common way of forming tokens is bas
imi ter , the tok eni zat ion of the sen ten ce res ult s in 3 tokens — Never-give-up. As
del
h tok en is a wor d, it bec ome s an exa mpl e of Word tokenization.
eac
words. For example, let us consider
Similarly, tokens can be either ch aracters or sub
“smarter”:

1. Character tokens: s-m-a-r-t-e-r


2. Subword tokens: smart-er
to do all of this?
But then is this necessary? Do we really need tokenization
ee
©) Tokenization in NLP is the process of
breaking text into smaller parts
(tokens) like words or phrases. Here
are the common types:

1. Word Tokenization: Splits text into


individual words.
Example: "| love NLP" — ["I", "love",
"NLP"

2. Sentence Tokenization: Splits text


into sentences.
Example: "I love NLP. It's fun." | ["l
love NLP", "It's fun."]

3. Subword Tokenization: Breaks


words into smaller parts
(subwords), especially useful for
rare words.
Example: "unbelievable" — ["un",
"peliev", "able"
4. Character Tokenization: Splits text
into individual characters.
Example: "NLP" , ["N", "L" "P"]

5. Whitespace Tokenization: Splits


text by spaces.
Example: "Hello world" — ["Hello’,
"world"|
Stemming
It is the process of obtaining the Word Stem of a word.

Word Stem gives new words upon adding affixes to them

‘Skip —— Skip + ed

Stemming

running run

Cats Cat

Programming Program
Wolves Wolf

Decreases Decrease
Stemming Lemmatization
Stemming is faster because it Lemmatization is slower as
chops words without knowing compared to stemming but it
the context of the word in knows the context of the word
given sentences. before proceeding.
It is a rule-based approach. It is a dictionary-based
approach
Accuracy is less. Accuracy is more as compared
to Stemming.
When we convert any word Lemmatization always gives the
into root-form then stemming dictionary meaning word while
may create the non-existence converting into root-form.
meaning of a word.
Stemming is preferred when Lemmatization would be
the meaning of the word is not recommended when the
important for analysis. meaning of the word is
Example: Spam Detection important for analysis.
Example: Question Answer

For Example: For Example:


“Studies” => “Studi” “Studies” => “Study”
‘troubled’ -> ‘troubl’ ‘troubled’ -> ‘trouble’
1. Over-Stemming Error:
« The stemmer reduces words to the same root when they should remain distinct.

* Example:

* "University" and “Universal” both stemmed to “Univers.”

* Impact: Loss of precision; unrelated words are conflated.

2. Under-Stemming Error:
* The stemmer fails to reduce words to the same root when they should be combined.

* Example:

* "Connect" and “Connected” remain as separate stems.

* Impact: Loss of recall; related words are treated as different.


ro
23. 2.3.4 Applications
of Lemmatization
fs

The process of lemmatizastion is used extensively in test mining. The test mining
process enables computers to extract relevant information from a particular set of text.
Some of the other areas where lemmatization can be used are as follows :
> 1. Sentiment analysis
s Sentiment analysis refers to an analysis of people’s messages, reviews or comments to
understand how they feel about something before the text is analysed, it is
lemmatized
> 2. Information retrieval environments

Lemmatizing is used for the purpose of mapping documents to common topics and
displaying search results. To do so, indexes when documents are increasing to large
numbers.
> 3. Biomedicine
Lemmatization can be used while morphologically analysing biomedical literature.
The Biolemmatizer tool has been for this purpose only.
It pulls lemmas based on the use of a word lexicon. But if the word is not found in the
lexicon, it defines the rules which turn the word into a lemma.
> 4. Document Clustering

Document clustering (or text clustering is a practice of group analysis conducted on


text documents.)
Topic extraction and rapid information retrieval are vital applications of it.
Both stemming and lemmatization are used to diminish the number of tokens to
transfer the same information. That boost up the entire method.
After the pre-processing is carried out, feature are estimated Via determining the
frequency of each token, and then clustering methods are implemented.

> 5. Search engines

Search engines like Google make use of lemmatization so that they can provide
better, more relevant results to their users.
even allows search engines to display relevant results and even
Lemmatization
expand them to include other information that reader may find useful.
. Search Engines: Stemming helps
search engines match user queries
with documents by reducing words
to their root forms, ensuring that
different word variations (e.g., run,
running) are treated as the same.

. Text Classification: In text


classification, stemming reduces
vocabulary size by grouping
similar words together, making
machine learning models more
efficient.

. Sentiment Analysis: Stemming


ensures that variations of a word
(e.g., love, loving) are recognized
as expressing the same sentiment,
improving analysis accuracy.

. Information Retrieval: By
stemming words in documents
and queries, information retrieval
systems retrieve more relevant
results despite word variations.
Lexicon in NLP:

* Lexicon: A collection of words and their meanings, used by NLP systems for understanding and
processing language.

* Purpose: Helps in identifying the correct meaning, grammatical features, or other linguistic
properties of words.

® Types:

* Lexical Entries: Words and their associated meanings, roots, or forms.

* Word Sense Disambiguation: Helps in identifying the correct sense or meaning of a word
based on context.

* Role in NLP: Used in tasks like POS tagging, machine translation, and morphological analysis.
N-gram Models in NLP
N-grams are contiguous sequences of n items (words, characters, etc.) from a given text or speech.
N-gram models are used to predict the probability of the next word based on the previous words.

Types of N-gram Models:


1. Unigram Model:

® Definition: Considers each word independently (single word at a time).

* Example:

* Sentence: "I love NLP"

® Unigrams: ["l", "love", "NLP"

e Use: Simple model for word prediction but does not capture word order or context.

¢ Probability: P(w) — The probability of a word occurring in the text.


2. Bigram Model:

* Definition: Considers pairs of consecutive words (2 words at a time).

« Example:

* Sentence: "| love NLP"

e Bigrams: ["l love", "love NLP"]

« Use: Captures some context by considering the relationship between adjacent words.

¢ Probability: P(w, | w.-:) — The probability of a word occurring given the previous word.

3. Trigram Model:

* Definition: Considers triples of consecutive words (3 words at a time).

« Example:

* Sentence: "| love NLP"

e Trigrams: ["l love NLP"

« Use: Captures more context by considering two previous words when predicting the next
word.

« Probability: P(w, | w.-2, W.-1) - The prohability of a word occurring given the previous two
words.
Model Number of Words Context Example

Unigram 1 word No context, independent words rl’, “love”, "NLP*]

Bigram 2 consecutive words Considers the previous word for prediction [I love", “love NLP*]

Trigram 3 consecutive words Considers the previous two words far prediction ("I love NLP"]
N-gram n consecutive words Considers the previous n-7 words for prediction [| love NLP models"
bbl 2.12 SMOOTHING

s of fla tte nin g a pro bab ili ty dis tri but ion implied by a languay
e Smoothing 1 s the proces
so me probability.
model so that all reasonable word sequences can occur with
by redistributing weight from hig
This involves broadening the distribution
probability regions to zero probability regions.
attempts to improve the accuracy;
Smoothing not only prevents zero probabilities, it
the model as a whole.
.a 2.12.1 Laplace smoothing

We use maximum likelihood estimation (MLE) for training the parameters of an N-


gram model.
The problem with MLE is that it assigns zero probability to unknown (unseen) words.
It is because, MLE uses a training corpus.
If the word in the test set is not available in the training set, then the count of that
particular word is zero and that leads to zero probability.
To eliminate this zero probability, we do smoothing.
Smoothing is about taking some probability mass from the events seen in training
and assigns it to unseen events.
Add-1 smoothing (is also called as Laplace smoothing) is a simple smoothing
technique that Add1 to the count of all n-grams in the training set before normalizing
into probabilities.
ppl 3.1 SYNTAX ANALYSIS ——

Syntax analysis or parsing is the second phase of


a compiler.

A lexical analyser can identify tokens with the help of regular expressions ang
dy,
t check the syntax of a given sentence
pattern rules. But a lexical analyser canno
to the limitations of the regular expressions.
e Regular expressions cannot check
balancing tokens, such as parenthesis.
Therefore, the phase uses context free
grammar (CFG). Thus, CFG is a
Fig. 3.1.1
superset of regular grammar.
The diagram implies that every regular grammar is also context-free CFG is an
important tool which describes the syntax of programming language.
Tanne eee eee- ;
eeen
ee ie

ceolain POStagging=
-_ = ==

=
a= = S| S| = = = =
_—=—
*: —_—=— > = i
== === =
-™ ee = U
ii

part-Of- Speech (POS) Tagging is a process of converting a sentence to forms - list of


words, list of tuples (where each tuple is having a form (word, , tag))
tag)).
sign;
is a P part-of-speech tag g and Itit sig
The tag nifies whether the word is a noun, adjective,
yerb and so on.
Part of Speech Tag

Noun ™

Verb v

Adjective a

Adverb r

may be defined as the


We also say that tagging is a kind of classification that
can
s.
automatic assignment of description to the token
t semantic information.
The descriptor is called tag, which may also represen
is a task of labeling each word in a
In simple words, we say that POS tagging
speech.
sentence with its appropriate part of
that part s
mentionedof spee ch incl ude nouns, verb, adverb, pronouns,
We have
adjectives, conjunction and so on.
tag gin g fall s und er Rul e-B ase POS tagging, stochastic POS tagging
Most of the POS
and transformation based tagging.
%& 3.1.3 Rule-based POS Tagging ca ares

— = -— — -_=—— ee

_ = =< <= eee Se


——_— =e eee ew ew

each word. ta gg er s use hand-written


then rule-b as ed
le tag,
If the word has more than one possib
rules to identify the correct tag: features
ing the linguistic
gg in g ca n
sambiguity by analv?
handle any di g 48 well as following words. =
d ta
Rule-base And that is done by its precedin
ofé a word.
——_Vi
lications...A SA
CHIN SHAH Vanture
e || in
of a wo rd is arti cle or adjective, then the Wong
word
For example, if the preceding
must be a noun.
rul e-b ase d POS tag gin g is cod ed in the for m of rul eg
All such kind of information in
e These rules may be either -
(i) Context-pattern rules,
finit e state automata, and is intersected with
(ii) Regula r exp res sio n com pil ed in to
resentation.
lexically ambiguous sentence rep
in g can be vis ual ise d by its tw o- stage archite cture —
Rule-based POS tagg
potential parts-o¢
(i) First stage : Here dictionary is used to assign each word a list of
speech.
Sec ond sta ge : Her e, the met hod use s l arge list of hand-written disambiguation rules
(ii)
sort dow n the list to a sin gle par t-o f-s pee ch for each word.
to

Ya. 3.1.3.1 Properties of Rule-based POS Tagging


taggers :
We mention below the properties of Rule-based POS
(i) These taggers are knowledge-driven taggers.
ually.
(Gi) The rules in Rule-based POS tagging are done man
ii) There are around 1000 number of rules.
ers and is done
‘iv) Smoothing and language modelling are defined in rule-based tagg
explicitly.
2a 3.1.4 Stochastic POS Tagging
Se ee eS — _ <==
. ; :
Rae Se Oe —S cere Ee eS eS ——Ss Ss es eee ee eS ee eS eS eS i

——
Se eS ee
-
—_—=— —— rrr rr rr rE rrr rere eee er eee eee ee ec el lc eErhlcraerlaerlaeer ee e

e Stochastic model is the model that includes frequency or probability (statistics).


e Different approaches to the problem of a model that includes probability to the
problem of part-of-speech tagging is referred to as stochastic tagger.
e The simplest stochastic tagger uses the following approaches for POS-tagging :
gging
9 3.1 4.1 Properties of Stochastic POS Ta

nt io n be lo w it s properties :
We me
i) The POS tagging is based on the probability of tag occurring.

(i) Training corpus is required here.

(iii) If the words do not exist in the corpus, then there is no probability.
(iv) Different testing corpus, other than training corpus, are used.
frequent tags associated with a
(v) Itis the simplest POS tagging because it chooses most
word in training corpus.
wm 3.1.5 Transformation-based Lagging (TBL)
ed Brill tagging.
Transformation based tagging is also call
of the tra nsf orm ati on- bas ed lea rni ng. It is a rule-based algorithm
It is the instance
text.
for automatic tagging of POS to the given
wle dge in a rea dab le for m. It transforms one state to
TBL allows to have linguistic kno
on rules.
another state by using transformati
tur e of bo th the abo ve- men tioned taggers-rule-based
TBL can be though of as the mix
and stochastic.
t spe cif y whi ch tag s need to
it is als o ba se d on the rules tha
Like rule-based ta gg in g,
s.
be assigned to which word ma ti on ta i lar to
gger. Simi
i sf or
larity between stochast ¢ and tr an
Also we can see simi s ar e au to ma ti ca ll ced
y indu
ue-in which ru le
e :
lea rni ng tec hni q
stochastic, it is machin
from data.
Mate I
7 3.2.2 Difficulties/Challenges in POS Tagging

We mention below the difficulties and challenges in POS tagging

(i) The main problem with POS tagging is ambiguity,


(ii) In English, many common words have multiple meanings and hence multiple POS.
(iii) The job of a POS tagger is to resolve this ambiguity accurately based on the context of
use. For example, the word ‘shot’ can be a noun or verb.

(iv) If a POS tagger gives poor accuracy, then this has an adverse effect on other tasks
that follow. This is called as downstream error propagation.
To improve accuracy, POS tagging is combined with other processing.
For example, joint POS tagging and dependency parsing is an approach to improve
accuracy compared to independent modelling.
(v) Sometimes a word on its own can give useful clues. For example, ‘the’ is a determiner.
Prefix ‘un’ suggests an adjective, such as ‘unfathomable’. Suffix ‘ly’ suggests adverb,
such as ‘importantly’. Capitalisation can suggest proper noun, such as ‘Angeles’.
(vi) A word can be tagged, it depends upon the neighbouring words and the possible tags
that those words can have.
ity. For
Word probabilities also play a part i n selecting the right tag to resolve ambigu
used as a verb and mostly used as a nou n.
example, ‘man’ is rarely
roa ch, one can cou nt tag frequencies of words in a tagged corpus
‘vii)In a statistical app
the mos t pro bab le tag. Thi s is called unigram tagging.
and then assign i‘.
Ty
aa — =-— = eeee ie
= ee
=
ee ee,

It is a hidden variable model which can give an obs


sing Markov assumption.
u
The hidden state is the
variable which cannot be
directly observed but can be
inferred
py observing one or more states using Markov assumption,

Markov assumption is the assumption that a hidden variable is dependent only on


state.
the previous hidden
A Markov model is made up of two components : the state transition and hidden
random variables that are conditioned on each other.
rere a

>> 3.4 ISSUES IN HMM POS TAGGING — = —


———

(i) The main problem with HMM POS Tagging is ambiguity.

(i) The POS tagging is based on the probability of tag occurring.


is no probability for the words that do not exist in the corpus.
(iii) There
(iv) It uses different testing corpus, other than training corpus.
2 3.4.2 Maximum Entropy Model
evolve in time
In many systems, there is a time or state dependency. These systems
through a sequence of states and current state is influenced by past states.
For example, there is a high chance of rain today if it had rained yesterday. Other
examples include stock prices, DNA sequencing, human speech or words in a
sentence.

It may happen that we have observations but not the states. For example, we have
sequence of words but not the corresponding part-of-speech tags.
In this case, we model the tags as states and use the observed words to predict the
most probable sequence of tags. This is exactly what Maximum-Entropy Markov
Model (MEMM) does.
MEMM is a model that makes use of state-ti
me dependencies. It uses predictions of
the past and the current observation
to make current prediction,
PPS a oe oe a

>>| 3.5 CONDITIONAL RANDOM FIELD (CRF)


TR i ee, eo se. om ee ow ew, ow me me em on ep mp
GQ. Explain CRF with applications, ®
0
a ae SM aS Ns ee ey er aha ep go -_ -
la -_=

are applied in pattern recognitio


n and used for structured pred
iction.
—_— - (Syntax Anaive:

A classifier predi cts a label for a single
. Kite, Nalysis). )....Page no. (3-24
eamples, 8 CRF can take context into account
Without considering ‘neighbouring’
qo achieve this, the
predictions are model
. ed asa .
the presence of dependencies between the Predictions. L And they represent

Depending on the application the type


processing, “linear chain” CRFs are popul
only on its immediate neighbours.
In image processing, the graph connec
ts locations
enforce that they receive similar predictions, to nearby similar locations to

Other examples where CRF are used : labeling or —

(i) | of sequential data for natural language processing,


(ii) for biological sequences

(iii) for POS tagging


(iv) shallow parsing
(vy) named entity recognition

(vi) gene finding.


(vii) peptide critical functional region finding,

(vill) object recognition,

(ix) image segmentation in computer vision.


Parameter Top-Down Parsing Bottom-Up Parsing

Definition Starts from the root (start symbol) and Starts from the input string and attempts
derives the input string. to construct the parse tree up to the
root.

Direction of Left-to-right, constructing the parse tree Left-to-right, constructing the parse tree
Parsing from the top (root) to the leaves. from the leaves to the root.

Approach Derives strings by expanding production Reduces strings by applying production


rules. rules in reverse.

Focus Tries to match the input string with the Tries to identify substrings in the input
grammar rules recursively, that match right-hand sides of grammar
rules.

Intermediate Starts with the start symbol and derives Reduces substrings of the input to non-
Form intermediate forms until the input string is terminals until the start symbol is
matched. reached.

Efficiency May explore unnecessary branches Typically more efficient with fewer
(backtracking required). unnecessary explorations.

Use Cases Used in recursive descent parsing and LL Used in LR parsers, such as SLR, CLR,
parsers, and LALR parsers.

Error Handling Errors may be detected late in the process. Errors are often detected early, as
reductions fail.

Grammar Works better with grammars that are left- Handles a wider range of grammars,
Compatibility factored and do not have left recursion. including those with left recursion.

Example For string "abc", starts from S$ +~A-B-+C For string "abc", reduces "abc" + C — B
— "abc". L + AS.
" 5.8.2 Formal Definition of PcFg
ilistic context-free ammar (;
A probabili ‘ er G is defineg by a quintuple
ple -
= (M, T; R, Ss, P)

Where
(i) Mis the set of non-terminal symbols
Gi) Tis the set of terminal symbols.

(iii) R is the set of production rules.


(iv) S is the start symbol,
(vy) Pis the set of probabilities on production rule
s.
Parameter Predictive Parser Shift-Reduce Parser

Definition A type of top-down parser that uses A type of bottom-up parser that uses
lookahead symbols to predict the next rule shifts and reductions to process the input.
to apply.

Parsing Strategy Top-down: Starts with the start symbal and Bottom-up: Starts with the input and
expands rules to match the input. reduces it to the start symbol.

Parsing Actions Expands non-terminals using a predictive Performs shift (push input to stack) and

table and matches terminals with input. reduce (apply grammar rules) operations.

Grammar Requires LL(1) grammar (left-factored and Handles a broader range of grammars,
Compatibility no left recursion). including LR grammars with left
recursion.

Lookahead Uses a single lookahead symbol to decide Uses the stack and input symbols to
the next move. decide actions, often without requiring
lookahead.

Backtracking No backtracking is needed if the grammar No backtracking is needed; handles


is LL{1). conflicts using parsing tables (e.g... LR).

Error Detection Detects errors during expansion of non- Detects errors when no valid shift or
terminals when no match is found. reduce operation is possible.

Data Structures Uses 3 parse table to guide rule selection. Uses a stack to store symbols and
intermediate results.

Efficiency May be less efficient for complex More efficient and robust for a wider
grammars. range of grammars.

Use Cases Typically used in recursive descent parsing Used in LR parsers (SLR, CLR, LALR) for

for simpler grammars. complex and ambiguous grammars.


Statistical Machine Translation (SMT)

SMT uses Statistical methods to translate text by learning patterns from a bilingual corpus.

Translation is based on probabilities to find the most likely target sentence for a given source
sentence.

Translation Model estimates the probability of translating a source sentence to a target


sentence.

Language Model ensures fluency by estimating the likelinood of a target sentence in the target
language.

Decoder combines the models to find the most probable translation.

Training data includes a parallel corpus (aligned sentences in source and target languages).

Alignments between words or phrases are learned statistically.

Probabilistic techniques like the Expectation-Maximization (EM) algorithm are used for
parameter estimation.

Outputs tend to be fluent but may lack semantic accuracy without a large and high-quality
training corpus.
The Porter Stemming Algorithm is a rule-based method for reducing words to their root form or
stem.

lt works by sequentially applying a set of predefined rules to remove common suffixes from
words.

The algorithm consists of 5 phases, each applying a series of rules to shorten or remove suffixes.

The goal is to reduce words like “running,” "runner," and "runs" to the same stem “run.”

Example:

® “running” > “run”

* “runner” > “run”

® “runs” — “run”

The rules are based on suffix stripping and vowel/consonant patterns to avoid over-stemming.

lt handles a wide variety of English word forms, including regular and irregular inflections.

Example:

*® "connection" > “connect”

e "better" — “better” (no stemming as it is an irregular form)

e “happiness” > “happi"

The algorithm is designed to be efficient and fast, often used in information retrieval and text
mining applications.

While effective for many cases, it does not always produce linguistically correct stems and may
create errors like over-stemming or under * mming.
Potential Problems in Context-Free Grammar (CFG)
1. Agreement:

¢ Problem: CFGs struggle to enforce agreement (e.g., subject-verb agreement) between


words like “She walks” vs. “They walk”.

e Issue: CFGs don't track dependencies like number, gender, or person automatically,
requiring complex rules to handle agreements.

2. Sub-categorization:

* Problem: Verbs require specific arguments (e.g., "eat" needs a direct object, but "sleep"
does not).

« Issue: CFGs have difficulty encoding verb-specific argument requirements, leading to overly
complex rules.

3. Movement:

¢ Problem: Syntactic movement (e.g., “What did she eat?" or passive constructions) is hard to
represent in CFGs.

e Issue: CFGs are not designed to handle reordering of sentence components, requiring
additional mechanisms to account for movement.
eee !S)--.-Page no. (4-3)
41 INTRODUCTION To semantic Rdves
po
Semantic Analysis is the process of findin 8 the meanin
oom puters to understand g from text It can direct
and interpret sentenc
€8, paragraphs, or whole docu
analysing their grammatical ments, by
structure, and identifying
the relationships between
individual words of the sentence in a pa
rticular context.

The purpose of a semantic analyser is


to check the text for meaningfulness.
The most important task of semantic analysis is
to get the proper meaning of the
sentence. For example, analyse the sentence “Govind is great”. In this sentence, the
speaker is talking about Lord Govind or about a person whose name is
Govind.
& 4.1.1 Use of Semantic Analysis
Po ie ie. tne a SR SS Se ee

1 ei ee eet we ee ee ee eens pe ei ——

* Semantic analysis is used in extracting important information from achieving human


level accuracy from the computers.
* It is used in tools like machine translations, chatbots, search engines and text
analysis.
— =!
fon 1

| 4.3. LEXICAL SEMANTICS. 2


It studies the meaning of individual
Lexical semantics is a part of semantic analysis.
incl udes word s, subw ords , affi xes (sub - unit s), compound words and phrases.
words. That
wor ds, sub — wor ds etc. are col lec tiv ely called lexical items.
All the
ntic s is the rela tion ship betw een lexi cal item s, meaning of sentences .
Thus lexical sema
an,
d syntax of sentences.

in vo lv ed in lex ica l se ma nt ic s are as follows :


The Steps
ical
(i) . affixes etc. is preformed in lex
Classification of lexical items like words, sub-words,
Semantics. ; .
(ij) Decomposition of lexical items like words, sub-words, affixes, etc. 1s preformed in

”Mi) A“xn ical semantic: s. ex t ? i o u s le xi


;
cal semantic struct ures
_ “ m a l y s e t h e d i f f e r e n ces and similarities betwee var

2a. 4.4.3 Limitation of the Lexical Approach
eee eee eens
ee
1
1 GQ. Whatisa limitation of the Lexical Approach ?= a pte
a cate:
ose ——
ope oa oe ——
eee boule

While the lexical approach can be a quick way for students to pick up phrases, it does
not produce much creativity.
phrases.
It can have the negative side effect of limiting people’s responses to safe fixed
learn the intricacies of
Since they don’t have to build responses, they don’t need to
language.

23. 4.4.4 Principle of Lexical Approach


-....7 ee
ch ? )
'SGQ. _ What iis the principle of Lexical Approa
| ee
: 4 ‘ =
ee oe a ee oa
PCS eras yo Ea

basic principle of lexical approach is “Language is grammaticalised lexis, no!


e The
lexicalised grammar”.
plays a subsidialy
» In other words, lexis is central in creating meanings, grammar
managerial role.
(Semantic Analysis)....Page no. (4-7)

Corpus study 1s SEPEe linguistics and is rapidly growing meth


odology that uses the
statistical analysis of large collections of written or spoken data to investigate
linguistic phenomena.

Corpus linguistics is the language that is expressed in its text corpus, its body of
“real
world” text.
Corpus study maintains that a reliable analysis of a language is more practicable
with corpora, that is collected in the natural context of that language.
The text — corpus method uses the body of texts written in any natural language. It
derives the set of abstract rules which govern that language.
These results can be used to find the relationships between the subject language and
these other languages which have been undergone a similar analysis.
Corpora have not only be used for linguistics research, they have also been used to
form dictionaries (e.g. The American Heritage Dictionary of the English Language in
of the English
1969), and grammar guides, such as A Comprehensive Grammar
Language, Published in 1985.
ae re ed Niet ae eal

t& «4.5.2 Corpus Approach

lS ee
— al a la —— eS

The corpus Approach utilizes a large and prin


cipled collection of naturally occurring
texts as the basis for analysis.
The characteristic of the corpus approach
refers to the corpus itself.
One may work, with a written corpus, a spoken
corpus, an academic spoken corpus,
etc.

Fe ee ee ics eR Te IIe m9 am we ioe

In corpus Biglatia common analytical techniques are dispersion, frequen‘);


clusters, keywords, concordance and collocation.
This part mentions how these techniques can contribute to uncovering discours’
practices.
Tm! 4.5.4 con peal

e An example of a general corpus is the British National Corpus. Some corpora contal#
texts that are chosen from a particular variety of a language. —.

(MU-New Syllabus w.e.f academic year 22-23) (M7-83) Tech-Neo Publications...A SACHIN SHAH vent

age Processing (MU - Sem 7. Comp)


(Semantic Analysis). ...Page no. (4-9)
aes ample, from a partic
ular dialector from
a particular subject are
a

qhese corpora are somevimes called ‘Ssublanguage corpora’
— ———
w 4.10.3 Limitations of Lesk-Based Methods

(i) Lesk’s approach is very sensitive to the exact wording of definitions

(ii) The absence of a certain word can change the results considerably.
(iii) The algorithm determines overlaps only among the glosses of the senses being
considered.

(iv) Dictionary glosses are fairly short and do not provide sufficient vocabulary to relate
sense distinctions.
Pragmatic Analysis in NLP
Considers speaker's intentions, social context, and real-world knowledge.

Resolves ambiguity in word or sentence meaning based on context.

Identifies the purpose of an utterance (e.g., request, command, or question).

Links pronouns or references to their antecedents.

Interprets implied meanings not explicitly stated.

Utilizes real-world facts to make sense of statements.

Used in chatbots, sentiment analysis, and virtual assistants for realistic interactions.
Discourse analysis is a technique in
Natural Language Processing (NLP)
focused on understanding language
beyond individual sentences. Key
points:

¢ Analyzing Context: Examines


relationships between sentences to
understand meaning in
conversations or texts.

¢ Identifying Structure: Studies how


sentences form cohesive and
coherent text (e.g., paragraphs,
dialogues).

¢ Detecting Relations: Looks for


connections like cause-effect,
comparison, or contrast between
parts of text.

¢ Coreference Resolution: Identifies


when different expressions refer to
the same entity (e.g., "John" and
"he").
¢ Applications: Used in chatbots,
sentiment analysis, and
summarizing documents.
t kinds [erat |
Referents which complicate —

resolution
1) more

Discontinous sets
the reference

efers
n one
text. Pages
Fig 5.4.2: Types of Referents which
Complicate the Reference Resolution
* Inferrables:

* Referents not explicitly mentioned but implied through context or logical reasoning.

« Example: "The picnic was wonderful. The sandwiches were delicious.” ("Sandwiches" are
inferred from “picnic’).

¢ Challenge: Requires advanced reasoning and world knowledge.

e Discontinuous Sets:

¢ Referents formed by multiple, non-contiguous entities in the text.

« Example: John met Mary at the park. Later, Mike joined them." ("Them" evolves to include
John, Mary, and Mike).

e Challenge: Needs tracking of dynamic groupings across discourse.

* Generics:

e Referents applying to an entire category rather than a specific entity.

« Example: “Dogs are loyal animals." ("Dogs" refers to the category, not individual dogs).

e Challenge: Differentiating generalizations from specific instances based on context.


Discourse Reference Resolution

e Discourse reference resolution involves identifying the entities, events, or concepts that
pronouns, noun phrases, or other referring expressions point to in a larger context.

¢ It is essential for understanding the flow of information in conversations, documents, or stories.

e This process connects different parts of a text to create coherent meaning.

¢ Challenges include resolving ambiguous pronouns, linking distant referents, and understanding
implicit or inferred meanings.

e |t requires handling anaphora (referring back), cataphora (referring forward), and bridging
references (implicit connections).

¢ Techniques often rely on syntactic parsing, semantic analysis, and machine learning madels
trained on annotated corpora.

e Successful resolution improves NLP tasks like machine translation, text summarization, and
question answering.
Reference Phenomena

« Reference phenomena in NLP involve how language refers to entities, objects, or concepts
within a text or discourse.

« It includes identifying and resolving how nouns, pronouns, and other linguistic elements refer to
the same or different things.

« Common forms of reference include direct reference (e.g., "the book"), pronominal reference
(e.g., "he," “she"), and definite/indefinite reference (e.g., "the dog" vs. “a dog").

« Resolving reference is crucial for maintaining coherence and continuity across sentences in a
text.

« Reference resolution is a key task in many NLP applications such as machine translation,
summarization, and question answering.

¢ Challenges include handling ambiguities, ellipses, and referents that are implied but not
explicitly mentioned.
Anaphora Reference
Anaphora reference occurs when a pronoun or noun phrase refers back to a previously
mentioned entity or concept within the text.

The word or phrase that the pronoun refers to is called the antecedent.

Example: “Tom went to the store. He bought some milk." ("He" refers to "Tom").

Anaphora resolution is the task of determining which antecedent a pronoun or noun phrase
refers to.

This process can be straightforward when the antecedent is close to the referring expression but
can become challenging when antecedents are distant or ambiguous.

It requires contextual understanding and often involves tracking entities across sentences or
even paragraphs.

Anaphora resolution techniques include rule-based methods, statistical models, and neural
network approaches in modern NLP systems.
Hobbs Algorithm in NLP
The Hobbs algorithm resolves anaphora by selecting the closest antecedent that matches the
grammatical and semantic properties of the pronoun.

It uses syntactic rules to assign scores to potential antecedents based on their proximity and
compatibility with the pronoun.

The nearest noun phrase that satisfies these constraints is chosen as the antecedent.

Example:

e Sentence 1: “John is tired. He will go to bed soon."

e The algorithm would select “John” as the antecedent for "He" because it is the closest noun
phrase in the same discourse and syntactically matches.

The algorithm works well for resolving simple, unambiguous references but can face challenges
with more complex or distant anaphora.
Advantages

e Simple to implement with


syntactic parsers.

e Works well for many cases in


English due to its focus on
sentence structure.

Limitations

¢ Doesn't account for semantic


meaning or context (e.g., world
knowledge).

e Assumes correct syntactic


parsing, which might fail in
complex sentences.
Statistical-Based

Parameter Rule-Based (REMT) (SMT) Neural (NMT) Hybrid

1. Approach Uses linguistic rules and Relies on Learns patterns Combines


dictionaries. probabilities from using neural features of RBMT,

bilingual corpora. networks. SMT, or NMT.

2. Data Minimal; relies on Needs large Requires massive Varies; depends


Dependency human-defined rules. bilingual corpora. datasets for on integrated
training. methods,

3. Quality High grammatical Good with Fluent, context- Balances accuracy


accuracy, low flexibility. common phrases, aware translations. and fluency.
errors in syntax.

4. Hard to adapt to new Moderate; High; learns directly High; adaptable


Adaptability lanquages/domains. depends on from data. by combining
training data. strengths.

5. Speed Slow; rule processing Faster once Fast but Moderate;


takes time. trained, but computationally depends on
resource-heavy. intensive, underlying
models.

6. Example Systran Geogle Translate Google Translate Modern systems


(pre-2016) (post-2016) combining
NMT+RBMT.

7. Strengths Grammatically precise Good for domain- Context-aware, Combines


in rule-covered areas. specific phrases. natural translations. precision, fluency,
and context.
6.1.5 The Advantages of Machine Translation
l, lators charge enough for each page, but we frequently
trans
Low Cost : Professional! machine translation
Oo f what is being stated. The
only want a rough understanding this situation.
Mechanism is accurate and useful in
se a substantial quantity of information, MT
Speed ; When you need to summar
allow
Me s you to ae leng
me thy de== and can
translate any text from any field.
it } = Pe : The sys

Profeggj tor, in co n trast, has @ specific area of expertise. Using a single


€ssional translator, oti languag es.
‘ool, you may translate betwee? .
(M4
fe! Tech-Neo Publications...A SACHIN SHAH Venture
New, -83)
Syllabus w.e.f academic year 22-23) (M7

a
Natural Language Processing (MU - Sem 7- Comp) (Applications of NLP)...Page no, (6-1 3)

4. Privacy : You wouldn't hand your personal letters to an unknown translator together
with your financial affairs, therefore this is about things like private emails iin
financial records.

tw 6.1.6 The Disadvantages of Machine Translation


(1) Level of accuracy can be very low
(2) Accuracy is also very inconsistent across different languages
(3) Machines can’t translate context
(4) Mistakes are sometimes costly
Text Summarization

* Text summarization is the process of condensing a large piece of text into a shorter version
while preserving its essential meaning and key information.

e |t aims to make large amounts of text more accessible and easier to understand by highlighting
the main points.

* Extractive Summarization:

« Involves selecting and extracting key sentences or phrases directly from the source text to
form a summary.

*® The summary is composed of verbatim sentences or parts of sentences without altering the
original wording.

* Example:

e Original: “John went to the store to buy groceries. He picked up apples, oranges, and
bread. After that, he went home."

e Extractive Summary: “John went to the store to buy groceries. He picked up apples,
oranges, and bread."

*® Challenge: The summary may lack fluency and coherence because it directly extracts
sentences from the original text.
Abstractive Summarization:

e Involves generating a summary by rephrasing or paraphrasing the original text, creating


new sentences that capture the main ideas.

* This approach is more flexible and can result in more coherent and natural summaries,

« Example:

* Original: “John went to the store to buy groceries. He picked up apples, oranges, and
bread. After that, he went home.”

* Abstractive Summary: “John bought some groceries, including apples, oranges, and
bread, before returning home.”

e Challenge: Requires advanced natural language generation techniques to ensure the


summary is both concise and accurate.
« Hybrid Summarization:

* Combines both extractive and abstractive methods, where key sentences are first extracted
and then rephrased or condensed for coherence.

* It aims to leverage the strengths of both approaches.

* Example:

® Original: “John went to the store to buy groceries. He picked up apples, oranges, and
bread. After that, he went home."

® Hybrid Summary: "John bought groceries like apples, oranges, and bread before
heading home.”

e Key Challenges:

e Ensuring that the summary remains faithful to the original meaning while being concise.

* Handling complex or technical texts where important details must be preserved.

« Balancing informativeness with brevity to avoid omitting critical context.

e Applications:

* Text summarization is used in news aggregation, automatic report generation, academic


research, and customer support systems.
tT 6.3.2 Types of Sentiment Analysis

1, Fine-grained sentiment analysis : Sentiment are divided into additional


categories, typically highly positive to very negative, to provide a more exact level of
polarity. This can be compared to ratings on a 5-star scale in terms of opinio
n.
Emotion detection : Instead of identifying positivity and negativity, emotion
detection recognises particular emotions. Examples could include shock,
rage, grief,
frustration, and happiness.
Intent based sentiment analysis : Analysis is performed with an eye towar
d intent
sees more than just the author's opinion. For instance, a frustrated online comme
nt
about changing a battery can motivate customer care to get in touch to addres
s that
particular problem.
Aspect based analysis : Aspect-based analysis collects the precise
component that is
being mentioned either favourably or unfavourably. For instance,
a client may write
in a product review that the battery life is too short. The system will then
respond
that the battery life is the main complaint and not the product as
a whole.
Question Answering System
e A question answering (QA) system is an NLP application designed to automatically answer
questions posed by users in natural language.

e |tuses techniques like information retrieval, natural language processing, and machine learning
to extract or generate answers from a text corpus, database, or knowledge base.

e Challenges in Question Answering:

e Ambiguity:

e Questions can often be ambiguous, and a QA system may struggle to determine the
exact meaning or intent.

e Example: “What is the bank?" could refer to a financial institution, a riverbank, or a

place to store items, making it difficult to determine the correct answer without
additional context.

e Complex Question Understanding:

e Questions that involve multiple pieces of information or require logical reasoning are
challenging for QA systems to process.

e Example: “Who was the president of the United States before the Civil War, and what
was his stance on slavery?" requires understanding both historical context and a multi-
step answer.
* Context Dependency:

A question may rely heavily on context, making it difficult for systems to answer
without understanding the surrounding information.

Example: “What is it made of?" requires knowing what "it" refers to, which may not be
clear in isolation.

« Handling Long and Complex Documents:

Extracting answers from long, unstructured documents or complex paragraphs can be


difficult, as the answer may be spread across multiple parts of the text.

Example: For a question like “What were the causes of World War |?" a QA system may
need to extract and synthesize information from various sources to form a
comprehensive answer.

e Factuality and Reliability:

Ensuring that the system provides factually accurate answers is a major challenge,
especially when dealing with unverified or conflicting information from multiple
SOUrCES.

Example: A QA system may retrieve outdated or incorrect information from web


sources, affecting the quality of the answer.
« Multilingual Support:

e Handling questions in different languages adds complexity, as the system must


account for linguistic differences and cultural contexts.

® Example: A system may struggle with idiomatic expressions or word sense


disambiguation when processing languages other than its training data.

e Answer Generation vs. Retrieval:

e While extractive QA systems simply retrieve information, abstractive systems need to


generate answers, which can introduce errors or inconsistencies.

e Example: Generating answers that are both accurate and fluent is a challenge,
especially when there are multiple ways to phrase the same information.

*® Noisy or Incomplete Input:

® Questions that are poorly phrased, incomplete, or grammatically incorrect can lead to
inaccurate answers or failed queries.

e Example: "Where is the Eiffel Tower" vs. “Eiffel Tower where is" — handling such
variations requires robust natural language understanding.
Parameter Information Retrieval (IR) Information Extraction (IE)

Definition Retrieves relevant documents or resources Extracts specific, structured information


from a large collection based on a query. (e.g.. entities, relationships) from text.

Input A user query and a collection of documenis. Unstructured or semi-structured text

data.

Output Ranked list of documents or resources. Structured data (e.g., tables, graphs, or
JSON with entities and attributes).

Goal To find and rank documents that are most To identify and extract key information,
relevant to the user's query. like names, dates, and relationships.

Techniques Search algorithms, TF-IDF, BM25, semantic Named Entity Recognition (NER), relation
Used matching, and deep learning. extraction, dependency parsing, etc.

Use Cases Search engines, document retrieval. Knowledge graph creation, database
recommendation systems. population, sentiment analysis,

Evaluation Precision, Recall, Fl-score, Mean Reciprocal Precision, Recall, Fi-score, extraction

Metrics Rank (MRR), and MAP. accuracy, and slot-filling performance.

You might also like