NLP Unit 1
NLP Unit 1
I!] Ans.:
• Natural Language Processing (NLP) is a cross disciplinary field of linguistics, computer science. and artificial
intelligence. It is related to the interactions between digitalized computing devices and human language or
precisely natural language.
• The field of natural language processing deals with designing and programming digital computational devices
(particularly computers) to process and analyse large amounts Qfnatural language data.
• Natural languages take different forms, such as writing, speech or signing. They are unrestricted from
constructed and formal languages such as those used to program computers or to study logic. As a result of its
natural language data is highly unstructured in nature.
• For example, NLP makes it possible for computers to read text, hear speech, interpret it, measure sentiment
and determine which parts are important.
(fflClrUm of natw"al languages is very wide. As per the linguistic science There are thousands of spoken
in the world. These languages can be grouped together as members of a langu~ge family.
ll't three main language in the world
(Includes English)
. . .Tibetan (Includes Chinese)
c (Includes Arabic)
is I complicated thing. In many languages, words are delimited in the orthography by
punctuation.
11N ward f ~ that need not change much with the changing ~ontext. On the other hand,
tbal are highly sensitive about the choice of word fonns according to context.
(1 • 1)
Natural Language Processing
(1 • 2) Structure of Words and Documents
• FC\r some of the languoges, the context does not
impoct the gend er of the noun , \\ hile some langu
oges do not
hove the conc ept of gender.
• Nnturnl l:inguoges show struc ture (nrun dy
gram mar) of diffe rent kinds and comp lexity .
It cons ists of more
ekmentruy CC\mponents whos e co-oc curre nce
in conte xt refin es the purp ose when used in
isola tion, but they
extenJ it furth er to meaningful relations betw een
other comp onen ts in the; sente nce.
• As II resul t unJ erst:inding notw-al langunge
in word block s is not a viabl e appro och. How
ever. the first level
w1Jcr stand ing of a word is wry important.
Q.l Write a note on word morphologyln natural
language.
OR Discuss why It Is Important to unde
rstand word s In natural lang uage.
l!l Ana. : WCII'l.is arc the most indic ative block s of a natur
al sente nce. How ever. they are trid.")' to defin
primnrily due to ambiguity and cC\nte~-rual mean e. This is
ing of word s in sente nces. Know ing how to
allows. the de\'clopment C\f s~11tactic and sema work with word s
ntic unMrst:mdi.ng.
• The process of unJc:rstwid ing word s in any
n:m.1.ral language invol ves morp holog_v. word
linguistic exin ssicm.
struc tureond its
m
■ AIIL :
.........
• Wa rdln +tl
0.,..,
0
..... . .---. . j
ualt aof lalll ldln .-o ... ...).
Nstursl Lsngusge Processing (1 - 3) Structure of Words snd Documents
, specific to a particular language the exact boundaries of separating words from morphemes and phrases is
varied and it is specific to that language. Here is an example with nouns as a valid word in English used to
understand the obo,·e concept. Refer Fig. Q.4.1. ·
• In the above sentence for reasons of generality, linguists prefer to analyse don't as two syntactic words (do
not). or tokens. each of which has its independent role nnd can be reverted to its nonnalized form. On the
other hand. all other words in this sentence ore treated as nn independent single token.
• In English. such tokenization and nonnalization may be applied to a limited set of cases. However. in other
languages, these phenomena have to be treated in a less trivial manner.
---
Q.I
OR
DISCUH clltlcs In morphology.
...iltt~tiilill.
• '- __._ a--
~---m
....... _. ,,.•. ~-- an.
• • CIID'til fbuDd la w1faus llapl&es like Latin.Ancient Gttc:k. Chinese. Japanese,
TECHNICAL PIAUCATIONII'-
Structure of Words end Documents
Natural Language Processing (1 - 5)
Q Allomorphs
I
• The alternative forms of a morpheme are termed allomorphs.
• Allomorphs are variants of a morpheme that differ in pronunciation but are semantically identical. For
example, the English plural marker -(e)s of regular no~ns can be pronounced /-s/ (bags), (bushes). depending
on the final sound of the noun's plural form.
Q.9 Write a short note on following terminologies :
(!] Ans.:
0 (a) Typology
• Typology (or Morphological typology) is a way of classifying the languages in the world. l~groui?s
languages according to their common morphological structures. Typology organizes languages on the basis
------~----,,-,-,----:-== -
of how those languages form wor~ oy combining morphemes.
• The typology that is based on quantitative relations between words, their morphemes, and their features as
follows. '
0 (b) Isolating, or an~lytlc typology
• These languages include no or relatively few words that has more than one morpheme. Examples are
Chinese, Vietnamese, and Thai.
• Analytic: languages show a low ratio of morphemes to words, nearly one-to-one.
~
• Some IDll.ydc tendencies are also found in languages like English and Afrikaans.
IJcc,.,..... .........
• . . . _ _ . . . _ . IINR naphemel In one word and are further divided into aalutinativc and fusional
•
Natural Language Processing (1 _ 6) Structure of Words snd Documents
-
· · fi
• Word order is Jess important for these languages than tt 1.!_Q ~al)rtic languages.
_
since individuaJ words
express the grammatical relations that would otherwise be indicated
by syntax.
• In addition. there tends to be a high degree agreeroeoL...Ot cross-r
eference betwee n different parts of the
-
---
sentence.
.
• Theref ore, morpb olow in synthetic languages is more important
• Most lndo-European languages are moderately synthetic.
than syn~
rds use
• for example. in English. mostly plurals are usually formed by adding the sutlix - s. certain wo
nonconcatenotive processes for their plural forms as
foot ➔ feet
• Many irregular verbs form their past tenses. past participles or both in the same manner :
freeze ➔ froze ➔ frozen
• This specific form of nonconcatenative morphology is known as base modification or oblaut, a form in
\,hich part of the root undergoes a phonological change without necessarily adding new phonological
material
• f or example the English stem song. results in the four distinct words as
Sing ➔ sang ➔ sung ➔ song
00 Ans. :
• Morphological parsing helps to eliminate or improve the inconsistency of word forms. It is required to provide
higher-level linguistic units whose lexical and morphologkal properties are explicit and well defined.
• Every Natural language inherently _has some irregularity and ambiguity. Morphological parsing attempts to
remove unnecessary irregularity and control ambiguity.
• Irregularities
o In this context irregularity means existence of such forms and structures that are not described
appropriately by a prototypical linguistic model.
o Some irregularities can be understood by redesigning the model and improving its rules, but other lexically
dependent irregularities often cannot be generalized.
• Ambiguity
o Ambiguity is an inability in interpretation of expressions of language.
o Accidental ambiguity and ambiguity due to lexemes with multiple senses, cause syncretism, or systematic
ambiguity.
• Morphological modelling also faces the poblem of productivity and creativity in language. This gives birth to
IIDCCIDVcntional bul perfectly meaningful new words or new senses to the language.
• Becae these newly·coined wards are DOI present in the lexical and morphological properties, such words
• -----
Will~ completely unpned in morphological system. This unknown word problem is panicular.ly severe
.
madelllna i. Ulllble 10 pane • word. that comes &am an expected domain of lbe linguistic
- - mclldy wllm special 1er1111 or forelp w
arclialecll•mlxcdac,actber,
are involved in the discourse or whm
(f _ B) Structure of Words snd Document,
~N~a~tu~,a~f±L~an~g~u~ag~a~P~roc~e~s~si~ng~---------
~':!..---- - - - - - - - - - - - - ---
Q.11 Explain morp holog ical Irregularities In NLP.
OR Discuss how the morphological Irregularitie
s are removed.
l!l Ans.:
• The design principles of the morphological model are \'Cry
important to control the irregularities in words.
• Morphological parsing is designed for generalization and
abstraction of words to make the model simple and
) et powerful.
• However, the immediate descriptions of gi\'en for a v. ord
may not be the final ones. due to
o Inadequate accuracy description
o' Inappropriate complexity of morphological mode
l
o Need of impro\ed formulations
• Remo,•al of morp hological irregu laritie s
o A deeper study of the morphological processes is essen
tial for mastering the whole morphological and
phonological system.
o Morphophonemic templates capture morpbolog:ical proce
sses. It is done by organizing stem patterns and
generic affixes.
o These templates are designed v.ithout any context-dep
endent variation of the affixes or ad hoc
modification of the stems.
o A very terse merge rules ensure that morphophonemic templ
ates can 1?e converted into exactly the surface
forms namely, orthographic and phonological.
·I
o Applying the merge ruJes is independent of and irresp I
ective of any grammatical parameters or information
other than that contained in a template.
o Thus. most morphological irregularities in the morphophon
emic templates are successfully removed.
Q.12 Discuss morphological irregularities In any
two natural languages.
l!J Ans. :
• Morphological irregularities in Arabic
o Morphophonemic templates can be used for discovering
the regularity of Arabic morphology where
uniform strucrural operations apply to different kinds of sterns
.
o Some irregularities are bound to panicuJar lexemes or
contexts, and cannot be ac~ounted for by general
rules.
• Morp holog ical irregularities in Korean
0
These languages are abundant with morphological alternations that are fonnalized by precise phonological
rules.
Q,13 What ls morphologlcal ambiguity? Discuss at least two examples .
[!] Ans,:
• Morphological ambiguity is the possibility that word forms be understood in multiple ways out of the context.
• Words forms that look the same but have distinct functions or meaning are called homonyms.
• Ambiguity is present in all aspects of morphological processing and language processing al large.
• Morphological parsing cannot complete disambiguation of words in their context, but it can control the valid
interpretations of a given word form.
• .MorphologkalAmbiguity in Korean
0 In Korean, homonyms are one of the most problematic objects in morphological analysis. This is because
they prevail all around frequent lexical items.
• MorphologkalAmbiguity in Arabic
0 Arabic has rich derivationaJ and inflectional morphology. Because Arabic script usually does not encode
short vowels and omits yet diacritical marks, its morphological ambiguity is considerably increased. 1n
addition, Arabic orthography collapses certain word forms together.
o The problem of morphological disambiguation of Arabic encompasses
⇒ The resolution of the structural components of words
⇒ Actual rnorphosyntactic properties
⇒ Tokeniz.ation and nonnalization
⇒ Lernmatization, stemming
⇒ Diacritization
• Morpholo£ical ambiguity in Sanskrit
o When inflected syntactic words are combined in an unerance, additional phonological and orthographic
changes can take place.
o In Sanskrit, one such euphony rule is known as external sandhi. Inverting sandhi during tokenization is
usually nondeterministic as it can provide multiple solutions.
• In any language, toenitation decisions may impose constraints on the morpbosyntactic propenies of the
fObns beina reconstructed.
• '1111 morpholopal phPmenon tbll some words or word classes show instances ofsystematic homonymy is
aaUed syncretism. In pertic:ullr, bomonymy can oc:cm due to neUlrllization and unaffectedness of words.
,
(t _ ) Structure of Words and Docurne
!N~a~tu~ra~IL~a~n~g~u~ag~e~P~roc~e=ss~ l n ~ g ~ - - - - - - - - - 10
-~~~-----------------=~
Q.14 What Is morph ologlc al produ ctivity 7
OR Discu ss the comp etenc e versu s pe rforman ce dualit y by noam chom sky In the conte xt Of
morph ologic al produ ctivity.
[!] Ans.:
• In a naturnl langua ge as a system ()angu e). structural device
s like recurs ion, iterati on, or comp oundi ng allow
to produce an infinit e set of concre te lingui stic ut1erances.
• This general potent ial bolds for morph ologic al proces ses
as well and is caIJed morph ologic al produ ctivity .
• In a perspe ctive natural language can be seen as a collec
tion of uttera nces (parol e) prono unced or written
(performance). Hence for the linguistic corpo ra, parole and
perfor mance data set is practi cal.
• Such corpor a arc a finite collec tion of linguistic data that
are studie d with empir ical metho ds. It can be used
for compa rison when linguistic model s are develo ped.
Q,15 Discu ss ..80/20 rule," of llngul stlc word corpu
s7
OR Write a note on ..80/20 rule" of lingui stic word
corpu s.
[!] Ans.:
• Linguistic corpo ra are a finite collec tion of lingui stic data that
are studie d with empir ical metho ds.
• The set of word forms found in the corpu s ofa langua ge
is referre d as its vocab ulary.
• The members of this set arc word types, where as every origin
al instan ce of a word form is a word token.
• The distribution these words or other eleme nts of langua
ge follow s the ..80/ 20 rule," also known as the law of
the vital few.
• It says th.at most of the word tokens in a gjven corpu s can
be identi fied with just a couple of word types in its
vocabulary. and words from the rest of the vocab ulary occur
much less or rarely in the corpu s.
• New, unexpected words will alway s appea r in the lingui
stic data only when it is expan ded or enlarg ed.
Q.16 Olacu aa how creati vity and the Issue of
unkno wn word • meet to enhan ce th• morp holog
ical
produ ctivity In • natura l langu age.
OR Dlacu a• how the newly coine d word googl e hes
enhan ced the morp holog ical produ ctivity of many
natur al langu ages.
TECH1-IICAL l'&A-...
(1 - 11) Structur9 of words and Documents
Natural Language Processing
(!] Ans.: ·
is used for a single purpose.
• A Domain Specific Language (DSL) is a specialized programming language that
and roi_aimal programming
• Various domain-specific languages have been created for achie~ing intuitive
effort.
ar problem represenution
• Pragmatically, a DSL may be speciaJized to a particular problem domain. a particul
techruque, a particular solution techruque, or other aspects of a domain.
s and are interpreted
• These special-purpose languages usually introduce idiosyncratic notations of prog:r..m
using some restricted model of computation.
resou:rc-es were too limited
• The motivation for this approach lies in the fact that. historically, computational
compared to the requirements and complexity of the tasks being solved.
izing model for the
• Other motivations are theoreticaJ given that finding a simple. accurate anJ )Ct gener:il
practical use in the specific domain.
e anJ elegant l.1ngwge.
• The design objective of DSL is to get be pure, intuitive, adequa1.:, comrletc, reu..'1t-l
·-=
OR Dllcue1 dictionary • a morphological model
• u.xJkup (lp<:rn1io111 with dlcll'1naricN ure rclur lvd y 1,implc onJ u..unlly qu,""k. Vit..tionuflc 'f can hcunplcrnc111c11,
for IMwnce. a., lisr,. binmy 1wMch tree,, 1ric11, ha.,h whlc:s, ere.
• JJenee dictionary lo"kup lei com,idcrcd 111 one of the clk<.tivc Morphologicul mndcl11.
Q.19 Whet er• lh• drewbech of enumoretlv • Morpholog lcal modol?
00 Ana. :
• Cnumcrurivc li'll is a 11e1 of 11.,,ociuricm!J between word forms and rhcir def.ired dc.,cript ion<t,
• Jr is declared by plain enumerat ion. llcnce 1hc coverage of the model i'I fi n ire und the gcncru tivc porent iul of
1he language is not exploited.
• Development, lookup and verific:11ion of 1he os~ociarion li!>I ls tedious, liable to crrors,i11efficicn1 unJ
inaccurate unless the duta are retrieved outom111ically from large ond reliable lingu istic resources.
• Despite oil thor, un enumerative model is often sufficient for the g iven purpo1>c, d eals ca~ily wirh c,cccprions,
and can implement even complex morphology.
Q.20 Wrlle • ahort note on flnlte-stare morpholog lcal.
l!l Ana.:
• finite-slate morphological modelsore the morphologi cal models in which the specitieotio ns writlen by human
programmers a.re directly compiled into finite-store transducers.
• The finite stale morphological models can be used for multiple natural languages.
• The rwo popular online tools supponing this approach are XFST (Xerox Finite-Stale Tool) and LcxTools.
Q.21 Dlacuaa finite state tranaducer •.
OR Dlacuaa how the flnlle state transducers can translate the Infinite regular language.
l!l Ans. :
• Finite-state transducers are computational devices extending the power of finite-state automata.
• They consist of a finire ser of nodes connected by directed edges labeled with pairs of input and output
symbols.
• In such a network or graph. nodes are also called states, while edges are calJed arcs.
• Traversing the network from the ser of inirial stares to the set of final states along the arcs is equivalent to
reading the sequences of encountered input symbols and writing the sequences of correspond ing output
symbols.
• The set of possible sequences accepted by the transducer defines the input language; the set of possible
sequences emined by the transducer defines the output language.
• For example, a finite-stare transducer could translate the infinite regular language consisting of the Sanskrit
words pita' prapira• praipra'P'·io, ... to thc mate h"mg words m• . . .
the mfinite regular English language words
defined as father, grand-father, great-grand-father.
• In finite-state transducers it . "bl . .
. is poss, e to invert the domain and the range of a relation' that is ' exchange the
input and the output.
• In finite-state computational mo hol . . .
11 15 common to refer to the mput word fonns as surface strings and
to the output d . . rp ogy,
- cscnptions as lexical strings.
TECHNICAL PUBUCA TION~ . •rt ut>-thrun lo, knoWMd(J•
:;rrur.f11r• of WtJ(t/S 011d or,r,uff/l lnff
Nllfurnl L1111ouo(J" Pr0t,11.~1fn(J
1 (1 • 13)
(!] Ant. :
, In human lungungc, word11 and i,enrenccs do not appear randomly bul u:,uJlly have
a structure.
, r or example, comhi nut ions of words form sentences • meaningful grammatical units, such a!! S!Otcmcnl!J,
requests, and commands.
about o particular point
, Likewise, in wri11en text, sentences form paragraphs. sclf-con raincd unit<J of discourse
or idea.
(!] Ans.:
• In human language or natural language, words and sentences usually
have o structure. This can be
:nts, requests, and
combinations of words form sentences • meaningful grammatical units, such as !itatcrrn.
commands.
idea, which is expressed
• Similarly, in written text, paragraphs are the self-contained units about a point or an
structure is important
in the form of group of sentences. Following are the some of the reasons why document
in human languages and therefore for natural language processing.
easy in NLP. The NLP
• When the structure of documents is extracted, it makes the further process ing of text
ic role labelling in
tasks that depends on the document structure are, parsing, machine translation and semant
sentences.
lity, it is important to
• To improve the reliability of Automatic Speech Recognition (ASR) and human readabi
identify the sentence boundary annotation. Document structure helps in this process.
y coherent blocks that
• Document structure helps in breaking apart the input text or speech into topicall
provides better organization and indexing of the data.
e of textual and audio
• Thus, in most speech and language processing applications extracting the structur
documents is a meaningful and necessary pre-step.
Q,24 Write I note on Hntence boundary detection.
OR What 11 sentence boundary detection ?
@ Ana.:
g where sentences begin
• Sentence boundary detection is the problem in natural language processing of decidin
and end.
(1 • 14) Structure of Words and Documents
Natural Language Processing
d be perfonned at the beginning of a text processing
· • Sentence detection is an important task, which shoul
pipeline.
segmentation) deals with automatically segmenting a
• Sentence boundary detection (also called sentence
sequence of word tokens into sentence units.
input to be divided into sentences; however, sentence
• Natural language processing tools often require their
tial ambiguity of punctuation marks.
boundary identi fication can be challenging due to the poten
the b'eginning of a sentence is usually marked with an
• In written text in English and some other languages,
marked with a period (.), a question mark (?), an
uppercase letter, and the end of a sente nce is explicitly
exclamation mark or another type of punctuation.
dary markers, capitalized initial letters are used to
• However, in addition to their role as sentence boun
ns and numbers and other punctuation marks are used
distinguish proper nouns, periods are used in abbreviatio
inside proper name s.
between period characters tha~ are enclosed between
• A character-wise analysis of text allows for a distinction
are followed by at least one, non-alphabet\c character,
two alphanumeric characters, and period characters that
such as a funher punctuation sign, a space, tab or new line.
written as well as spoken text and code switching.
• There are various chaJlenges associated with SBD, for
Q.25 Discuss the challenges of sentence boun
dary dete ction In writte n text.
I!! Ans. :
most common problems of sentence segmentation in
• Ambiguous abbreviations and capitalizations are the
written text.
ic. The primary reason for this is the speaker may have
• Quoted sentences are more complex and problemat
e the quotes are also marked with punctuation marks.
uttered multiple sentences and sentence boundaries insid
boundary detection may ·result in cutting some sentences
• As a result of this an automatic method of sentence
n instead -of written, prosodic cues usually mark
incorrectly. In case if the preceding sentence is spoke
structure.
Service (SMS) texts or Instant M~ssaging (IM) texts,
• "Spontaneously" written texts, such as Short Message
ng punctuation. which makes sentence segmentation
tend to be nongrammaticaJ and have poorly used or missi
even more challenging.
nition (OCR) or ASR, aims to translate images of
• The automatic systems, such as optical character recog
nces into machine-editable tex.
handwritten, typewritten, or printed text or spoken uttera
the finding of sentence boundaries must deal with the
• When the sentences comes from such automatic system,
errors of these systems as well.
commas and can result in meaningless sentences. ASR
• For example, OCR system easily confuses periods and
transcripts typically lack punctuation marks and are usual
ly mono-case.
[!) Ans. :
es by multilingual
• Code switching - that is, the use of words, phrases, or sentences from multiple languag
, when switching to
speakers - is another problem that can affect the characteristics of sentences. For example
to the
a different language, the writer can either keep the punctuation rules from the first language or resort
s).
code of the second language (e.g., Spanish uses the inverted question mark to precede question
can be redefmed, as in
• Code switching also affects technical texts for which the meanings of punctuation signs
must detect and parse
Uniform Resource Locators (URLs), programming languages, and mathematics. We
those specific constructs in order to process technical texts adequately.
on patterns to identify
• Conventional rule-based sentence segmentation systems in well-formed texts rely
potential ends of sentences and lists of abbreviations for disambiguating them.
abbreviations at the
• Although rules cover most of these cases, they do not address unknown abbreviations,
ends of sentences, or typos in the input text.
chats, and biogs, or to
• f urthermore, such rules are not robust to text that is not well formed, such as forums,
a specific set of rules.
spoken input that completely lacks typographic cues. Moreover, each language requires
• . Hence code switching is considered as a problem in sentence boundary detection.
effective than a rule based
Q.28 How sentence segmentation as a classlflcatlon problem Is more
problem?
I!! Ans.:
on patterns to identify
• Convent ional rule-based sentence segmentation systems in well-formed texts rely
potential ends of sentences and lists of abbreviations for disambi guating them.
• Sentence segmentation in text usually uses the punctuation marks as delimiters and aims
to categorize them as
ies are usually
sente~ce ending,'beginning or not. On the other hand, for speech input, all word boundar
considered as candidate sentence boundaries.
abbreviations at the
• Although rules cover most of these cases, they do not address unknown abbreviations,
ends of sentences, or typos in the input text.
I
and biogs, or to
! • Furthennore, such rules are not robust to text that is not well fonned, such as forums, chats,
set of rules.
.spoken input that completely lacks typographic cues. Moreover, each language requires a specific
I • To improve on such a rule-based approach. semence segmentation is stated as a classific
ation problem. Given
nining dala whae all sentence boundaries are marked, • can train a classifier to rccognizc them.
I
I
I
~111,I UU:,t. . . . . . .. . . . , . , , _ , I-•
L rl
(1 - 16) . Structure of Words and Documen ti
Natural Languag e Processin g
st
ndaries b, atement
• Similarly, in spoken language, a three-way classification can be made between non-bou
b\ and quescion boundaries bq.
Q. Discuss the method of classlflcatlon In sentence or topic segmentation.
31
tatJon.
OR Explain the cla551ficatlon method used In sentence or topic segmen
.
[!) An••:
e sentence or topic
• For sentence or topic segmentation, the problem is defined as finding the most probabl
boundaries.
es, with assumption
• The natural unit of sentence segmentation is words and of topic segmentation is sencenc
that assume topics typically do not change in the middle of a sentence.
sentence or topic - that is,
• The words or sentences are then grouped into contiguous stretches belonging to one
ndaries .
the word or sentence boundaries are classified into sentence or topic boundaries and non-bou
the aim is to estimate the
• The classification can be done at each, potential boundary i (local modelling); then,
most probable boundary type, 91, for each candidate example, x, :
91 = argn:iax
y, m y
P (y; I x.)
used to show possible
• Here, the ,.. is used to denote estimated categories, and a variable without a ,.. is
categories.
• In local modelling, features can be extracted from the surrounding example
context of the candidate boundary
e and search for
to model such dependencies. It is also possible to see the candidate boundaries as a sequenc
given the candidate
the sequence of boundary types, Y= 9, , ..... y0 , that have the maximum probability
examples, X = Xp···• x0 :
I\
Y = argmax P (Y I X)
y
s.
OR Compare between generative and discriminative categarlzatlon method
lil Ans.:
0 Generative sequence model
tion) and the labels
a) ,It 'estimate the joint distribution of the observations, P (X, Y) (e.g., words, punctua
(sentence boundary, topic boundary).
and have good
b) It requires specific assumptions (such as backoff to acco.unt for unseen events)
generalization properties.
0 Discriminative sequence model
es.
a) It focus on features that characterize the differences between the labeling of the exampl
b) Such ~thods (as described in the following sections) can be used for sentenc
e and topic segmentation in
both wntten and spoken language, with one differen ce.
__ _,
bo
00 An s.:
un dar ies
Ex pla ln ge
bet
ne
we
rat
en co
ive
nse
mp lex ity.
hig he ror de r n-g ram s at the co st of an inc rea sed co
ded to
• The bigram cas e can be eA.'len n is the nu mb er of
d of usi ng tw o sta tes , n sta tes are use d. wh ere
Jly ins tea as PO S tag s of the
• Ifo r. topH
ic scgmcnt3 . tion. rypica an y inf orm ati on be yo nd wo rds, su ch
possible in HM M to use
op ics. owc,-n:, n is no t
sod ic cue s, for speech seg me nta tio n.
words or Pro
hri em it the
• Two simple extensions ha,
-c bee0 nn. ---- '
be rg et aJ sug ge ste d us ing ex pli cit sta tes to
. pro..,..,..-u : S de ls.
boundarv Jex jca J inf orm ati on via co mb ina tio n "'i th oc he r mo
• tokens. hence tncTu orporating oon . .
C1 al use d lh tio ns
• For topic 5eg me nta tio n. r
a an d mo de led top 1c- sta rr an d top ic- fin al sec
e sam e rde •
.._
~ ,;_ itlv -·=-L.
• ' W'- 0
hclped gJQtJy for broad
cast ne ws rop
.
,c seg mc nta tro n. Th e sec on d ex ten sio n is ins
pir ed fro m
_, _ -- '-l s wh ich . rds b syn tac tic , an d oth er
~'"tOred bq~- ~ . .ftl\ K' caprurc no t 0 nl ut als o morphological,
d . Y wo PO S tag s in
information. Guz C1 alp rop osc LM (fH EL fl.f ) for sen ren ce seg me nta tio n us ing
usi ng factOf"Cd HE
addition ro words.
Natural Language Processing (1 • 19) Structure of Words and Documents
- • A number of discriminative classification approaches, such as support vector machines, boosting. maximum
entropy, and regression, are based on very different machine learning algorilhms.
• While discriminative approaches have been shown to outperform generative methods in many speech and
language processing tasks, training typically requires iterative optimization.
is
• In discriminative local classification, each boundary is processed separately with local and contextual features.
• No global (i.e., sentence or document wide) optimization is performed, unlike in sequence classification.
• for sentence segmentation, supervised learning methods have primarily been applied to newspaper articles.
• Many classifiers have been tried for the task : regression trees , neural networks , a C4.5 classification tree ,
1t maximum entropy classifiers , support vector machines (SVMs), and naive Bayes classifiers .
• Mikheev treated the sentence segmentation problem as a subtask for POS tagging by assigrung a tag to
y punctuation similar to other tokens . For tagging he employed a combination of HMM and maximum entropy
approaches .
1t .35 Write a note on TextTlllng method for topic segmentation.
OR Discuss how TextTlllng method Is used for topic segmentation. ·
OR Explain block comparison and vocabulary Introduction methods for topic segmentation.
l!l Ans.:
• The popular TextTiling method of hearst for topic segmentation uses a lexical cohesion metric in a word
vector space asap indicator of topic similarity.
• TextTiling can be seen as a local classification method with a single feature of similarity.
• Below Fig. Q.35. l depicts a typical graph of similarity with respect to consecutive segmentation units. The
document is chopp~d when the similarity is below some threshold.
o.1·r--.---.----r--....--....--....--....----r----r----,
4 5 6 7 8 1718192
0.1
0 10 20 30 40 50 60 70 80 90 100
_ __:S.::.fro:,:c::.tu:re.:..:::o::_
f:_:Wi:o1<~d.:s:_s~nd~D~oc~urn6 nts
_ _ _ _ _ _.:_(1:...·...:2:.:0:...,_ _ _ _
...:g:..__ _ _
_ru_ra_l_L_a_n_gu_a_g_e_R_roc_e_ss_ln -- -
-N_a y scores were proposed for
TextTiling :
for co~puti ng the similarit
• Originally, two methods
• Block comparison - words the ad'
t to see ho w sim ila r tl1 ey are according to how many ~acenr
cks of tex
a. It compares adjacent blo
blocks have in common. t instead at a
ssarily loo kin g on ly at the consecutive blocks bu
variable, not ne ce
b. T~e block size can be
l
window.
s (se nte nces or paragraphs), the similarity (or topica
d b , each having k token
c. Given two blocks, b1 an 1 .
sion) score is compute d by the formula given below
cohe
Lt w, b, . w, bl
✓ Lt (1): b, Lt (1): bl
may be computed
to term t in block b. The weights can be binary or
assigned
where ro1, b is the weight quency.
er inform ation retrieva l-based metrics such as term fre
using oth
on -
• Vocabulary introducti a tok en -sequen ce gap on the basis of how ma
ny
re to
tion method, assigns a sco
a. The vocabulary introduc oint.
erval in which it is the midp
new words are seen in the int o co ns ecu tive blocks, b,. and b2, of equal
number of
arison formulation . giv en tw
b. Similar to the block comp th the follow ing formu la : Where NumNewTerms(b)
sion score is computed wi
words, w, the topical cohe time in text.
in block b, seen for the first
returns the number of terms ewTerms(b1)
NUI"QNewTerms(b ,) + NumN
2x w
g at all words,
laten t sem an tic an aly sis . Instead of simply lookin
d to exploit s because this
c. This method is extende ica l sp ac e, wh ich has led to improved result
transform ed lex
researchers worked on the
antic similarities im pli cit ly.
approach also captures sem
lflcatlon methods.
l a l n dlscrfmlnatfve sequence class
~p
ragraph) highly
CiJ An s.:
ic de cis ion for a giv en example (word, sentence, pa
sentence or top
• In segmentation tasks, the
in its vigini!Y . iw ii ~ ~t") ~ of ~e/.-1e ~
the exam ples
depends on the decision for ten sions of local discriminative models
with
in ge ne ral ex
classification methods are ighbouring decisions to label
• Discriminative sequence of labels by loo king at ne
coding stages that find the
best assignment
additional de
an example. tension of
ten sion of ma xim um entropy, SV M strucr is an ex
(CRFs) are an ex HMMs .
• Conditional Random Fields rgin. M ark ov ne tw ork s (M 3N ) are extensions of
tputs, and maximum ma
SVM to handle structured ou es loading of one
m (M IR A) is an on lin e learning approach that requir
Algorith
• The Margin Infused Relaxed
ining.
sequence at a time during tra luding sen ten ce segmentation in spe
ech.
e lab elling tas ks, inc
many sequenc
• CRFs have been successful for are tra ined by finding the A param
eters
struc tur es. CR Fs
models for labelling tion term to avoid overfining
.
- ~ ~ - ~• CRFs are a class of Jog-linear rnmallv wi th a re__gulariz.a
,. AM ia
(1 - 21) Stru cture of Word s snd Doc ume nts
Natu ral Lang uage Proc essi ng
00 Ans . :
whi ch is criti cal for the
sequ enti al disc rimi nati ve c/as silic atio n algo rithm s typi cally igno re the cont ext,
• Non
segm enta tion task.
ly cons ider cont ext, thes e
le we may add cont ext as a featu re or simp ly use CRF s, whi ch inhe rent
• Whi se du.ration or pitc h rang e.
oach es are s ubop tima l whe n deal ing with real -val ued features, s uch as pau
appr er man ually or auto mat ical ly
prob lem by binn ing the feature spac e eith
Mos t earl ier stud ies sim ply tack.led this
berg et al. .
sific atio n appr oach , as sugg este d by Shri
• An altemati ~e is to use a hyb rid clas obta ined from the
is to use the post erio r prob abil ities , PcCy,/x.), for each bou ndar y cand idat e,
• The mai n idea
obse rvat ion like liho ods by
, by sim ply conv ertin g them to state
othe r classifiers, such as boo sting or CRF
-kno wn Bay es rule as follo ws
divi ding to thei r prio rs follo win g the well
Pc (y; Ix,)
argm ax = P(y. ) = argmy.ax P(x, I Y,)
Y; '
enta tion. To hand le dyn ami c
App lyin g the Vite rbi algo rithm to the HM M then retu rns the mos t like ly segm
• me as is usuaJly desc ribe d
obse rvat ion like liho ods, a weig htin g sche
ranges of state tran sitio n prob abil ities and
in the literature can be appl ied.
s, nam ely boo sting , mw mu m
disc rimi nati ve local classifi~ation met hod
• Zim mer man et al. com pare d vari ous tion of multilinguaJ spee ch.
r hybrid vers ions for ~entence segm enta
entr opy, and deci sion trees, alon g with thei
aJways superior.
He conc lude d that hybrid appr oach es are
men tatio n ?
glob al mod elin g for sen tenc e seg
Q.38 Wha t are the exte nsio ns for
OR
g exte nsio ns ?
e seg men tatio n Is carr ied out usin
How glob al mod elin g for sen tenc
00 Ans . :
ies rath er than sentence s in
tion have focused on reco gniz ing bou ndar
• Mos t appr oach es to sent ence segm enta
them selv es. that mus t be asse ssed in
ratic num ber of sent ence hyp othe ses
• This has occu rred because of the quad
com pari son to the num ber of boun dari es. hed by a loca l
lem , inpu t is segm ente d acco rdin g to like ly sent ence bou ndar ies esta blis
• To tackJe this prob
the n-be st lists.
mod e.I. Late r it is train ed as a re-ra nker on a synt acti c pars er or glob al
appr oach allo ws leve ragi ng of sent ence -lev el feat ures such as scor es from
• This
prosodic features.
ws com bini ng loca l scor es
et al. prop osed to exte nd this conc ept to a pruned sent ence latti ce, whi ch allo
• Fav re
ient man ner.
with sent ence -lev el scor es in a mor e effic
d Document~
Structure of Words an
(1 - 22)
sing
Natural Language Proces
W @ co m p le xi ty of
Approaches
entation Is evaluate
d.
-
complexity of sentence/topic segm
Q. 39 Dl1cu11 ho w the
memory) of their
00 Ans. :
can be rared in ter ms of complexity (time and y a/so
rion approaches l-world datasets . Some ma
• Sentence/topic segmenra the ir pe rfo rm an ce on rea
training and prediction alg
orithms and in terms of co ntinuous features to discrete
features.
ing or no rm alizin g
ing, such as conven
require specific pre-process
ive
• Dlscriminati,·t approa
ch
pro ac hes is mo re comp lex than training of generar
ap
t~ining of discriminative ir feature weights.
a) ln terms of complexity, sses ov er the tra ining data to adjust for the
ltiple pa
ones because they require mu
'rfo u.,~{ic ~Ju.olio,.
0~ '1u 'tW>1e. ~ ·
• Gtntrativt modelJ r training sets and
/1iple orders of magnitude large
b) Generarive models
such as HELM
benefit, for instance, from de
- s-c an
cades of news wire
ha nd le
tra
mu
nsc rip lS. Bu t the y do not cope well with unseen
events.
• Di1criminati¥t da nl
lie n ng sets.
rie ty of fea tures an d pe rform bener on smaller traini
c) They allow for a wi
der va relatively simple
fiers is als o slo we r, ev en though the models are
d) Predicting with discri
minarive classi more features.
ear), be cau se it is do mi na red by the cost of extracting
(linear or Jog-lin
• ~q ut nc t approaches g: finding
pro ac he s bring the ad ditional complexity of decodin
e) Compared to local ap
proaches, sequence ap of decisions.
de cis ion s req uires evalu ating a/I possible sequences e
the best sequence of
all ow the use of dy na mi c programming to trade tim
ependence assumptions
fJ Fortunately, conditional ind omial time.
lyn
for memory and decode in po (nu mb er of bouncfury candidates
processed
er of the mo de l
g) This complexity is
then exponential in the ord Jltcs).
tC'lhcr) and the nu mb er of d~ s..-s (number of boundary &L
to-,
t rlauifitN,
• Dftaimiaadvt ttqw:nc nc c on the tr:iining d.ita, wl,i,h might become
inf< :re
fk:.e.d w repe"1c:dly ~r fo rm
h) fo r t,u,.mplc CR h, .alli-0
ttp t't l)I Vt,
e Approaches
W 1.7 Performance of th
t.t lon 1pproa.d1 H In d6tall.
t1 lh • ,.,._rforrnan te of Mnt.nee M ~m •n
O.AO 01 Ku
:
Oft Wrltt • .tir.,,t no t. on
,n In ,p..c.h
a) ~ M?JltntMJc
In •~ •, h
c) ~ Mgmtr,tatJr,n
IJ AM .:
•r .h
'..I •J ~ M 2m tn t.t ,c,
n In •~
1 I • IJ nulu;,11;1) u·,i11i .
• I'" "'- 'iU :ru .,.Qr1ttrt;,Iv,n ,·n t{lu,1. ,, {,t;f ,,,,,s:,1i1.-e ,. u~1JiJ/
,,t mJr,,1,., ,,f Of t/f t"' rl,1; ,,u,,,J,cr ,,t u:m,pl
i:ts)
I) 1IIIC '" '" , . (r;,111,
t111: ,rir&n ,,f ,,u lJ t,,, IIJ" JlfCJ..J';,,Jllfl)
2J I J-1r,•twc (1J;c J,flflf,1, •
Natural Language Processing (1 • 23) Structure of Words and Documents
where
I) Recall is defined as the ratio of the number of correctly returned sentence boundaries to the numbt:r of
sentence boundaries in the refere nee anl"\otations.
2) Precision is the ratio of the number of correctly returned sentence boundaries to the number of all
automatically estimated sentence boundaries), and the National Institute of Standards and Technology
(NIST) error rate (number of candidates wrongly labeled divided by the number of actual boundaries).
O b) Sentence segmentaUon In text
• For sentence segmentation in text, researchers have reported error rate results on a subset of the Woll Street
Journal Corpus of about 27,000 sentences.
• For instance, Mikheev reports that his rule-based system performs at an error rate of 1.41 %.
• The addition of an abbreviation list to thjs system lowers its error rate to 0.45 % and combinjng it with a
supervised classifier using POS tag features leads to an error rate of 0.31 %.
• Without requiring handcrafted rules or an abbreviation list, Gillick's SVM-based system obtains even
fewer errors, at 0.25 %.
• Even though the error rates presented seem low, sentence segmentation is one of the first processing steps
f~r any NLP task, and each error impacts subsequent steps, especially if the resulting sentences are
presented to the user as for example, in extractive summari111tion.
0 c) Sentence Hgmentation In speech
• For sentence segmentation in speech, Doss ct al. report on the Mandarin TDT4 Multilingual Broadca!>1
News Speech Corpus, an FI-measure using the same set of features is as of
o 69.1 % for a Max£nt classifier •
o 72.6 % with Adaboc,~t
o 72.7 % with SVMs
• A combination of the three cla\sifiers U\ing logistic regres~ion is alw pr<,po'>Cd.