0% found this document useful (0 votes)
20 views23 pages

NLP Unit 1

The document discusses natural language processing and some of the key challenges in understanding natural languages. It covers topics like the spectrum of natural languages, word morphology, tokens, and clitics from a morphology perspective. Understanding words and their structures is an important part of natural language processing.

Uploaded by

r24179120
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views23 pages

NLP Unit 1

The document discusses natural language processing and some of the key challenges in understanding natural languages. It covers topics like the spectrum of natural languages, word morphology, tokens, and clitics from a morphology perspective. Understanding words and their structures is an important part of natural language processing.

Uploaded by

r24179120
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Stru~ture of Words and Documents

Part : I • Finding the Structure of Words


Q.1 Write a short note on natural language processing.
OR Discuss what Is NLP.

I!] Ans.:

• Natural Language Processing (NLP) is a cross disciplinary field of linguistics, computer science. and artificial
intelligence. It is related to the interactions between digitalized computing devices and human language or
precisely natural language.

• The field of natural language processing deals with designing and programming digital computational devices
(particularly computers) to process and analyse large amounts Qfnatural language data.
• Natural languages take different forms, such as writing, speech or signing. They are unrestricted from
constructed and formal languages such as those used to program computers or to study logic. As a result of its
natural language data is highly unstructured in nature.
• For example, NLP makes it possible for computers to read text, hear speech, interpret it, measure sentiment
and determine which parts are important.

Q.2 DlscuH the spectrum of natural languages.


WIiy underatandlng natural language is challenging

(fflClrUm of natw"al languages is very wide. As per the linguistic science There are thousands of spoken
in the world. These languages can be grouped together as members of a langu~ge family.
ll't three main language in the world

(Includes English)
. . .Tibetan (Includes Chinese)
c (Includes Arabic)
is I complicated thing. In many languages, words are delimited in the orthography by
punctuation.
11N ward f ~ that need not change much with the changing ~ontext. On the other hand,
tbal are highly sensitive about the choice of word fonns according to context.
(1 • 1)
Natural Language Processing
(1 • 2) Structure of Words and Documents
• FC\r some of the languoges, the context does not
impoct the gend er of the noun , \\ hile some langu
oges do not
hove the conc ept of gender.
• Nnturnl l:inguoges show struc ture (nrun dy
gram mar) of diffe rent kinds and comp lexity .
It cons ists of more
ekmentruy CC\mponents whos e co-oc curre nce
in conte xt refin es the purp ose when used in
isola tion, but they
extenJ it furth er to meaningful relations betw een
other comp onen ts in the; sente nce.
• As II resul t unJ erst:inding notw-al langunge
in word block s is not a viabl e appro och. How
ever. the first level
w1Jcr stand ing of a word is wry important.
Q.l Write a note on word morphologyln natural
language.
OR Discuss why It Is Important to unde
rstand word s In natural lang uage.
l!l Ana. : WCII'l.is arc the most indic ative block s of a natur
al sente nce. How ever. they are trid.")' to defin
primnrily due to ambiguity and cC\nte~-rual mean e. This is
ing of word s in sente nces. Know ing how to
allows. the de\'clopment C\f s~11tactic and sema work with word s
ntic unMrst:mdi.ng.
• The process of unJc:rstwid ing word s in any
n:m.1.ral language invol ves morp holog_v. word
linguistic exin ssicm.
struc tureond its

o ~k•rphology is the study the varia ble form s and


fWlCtions of word s. while synta.x is conc erned
mangcmem of word s into phras es. clauscs. and sente \\i th the
nces.
o \\'ord struc an c:omttaints due to pronuncia
tion are described by phon ology , wher eas conv entio
•Titina comtitule ns for
die Ofthoanpby of • . . . . _
o The linpi saic e,qaeaion is ks __, ,;,s . a
WOids _, cxpl lin.d lr
n
..,.,ic; ma.......,p:11..adcmr
~ and cover
lexic olOI) ' espec ially the evolu tion of
liab 11110D& Chem.
• Word 111Uct1ftCGIIIIIUIII due IDpw +• H ■- I•
t by plaonola&y, whe nas «11M11tions for \\Titing
c:an sdw 1be anbaelapbJ ala 1 9 I Ill
• c••• 1t•cfoalca!pmlna.
• lbew hiqu eol4 1tm. . .
0 ldend&1INlldlcl... ... _
o Model . . . . LI al
CCftC jill.

m
■ AIIL :

.........
• Wa rdln +tl

• TIP ...,_ ,DII II


• l>lp idl;e

0.,..,
0
..... . .---. . j
ualt aof lalll ldln .-o ... ...).
Nstursl Lsngusge Processing (1 - 3) Structure of Words snd Documents

, specific to a particular language the exact boundaries of separating words from morphemes and phrases is
varied and it is specific to that language. Here is an example with nouns as a valid word in English used to
understand the obo,·e concept. Refer Fig. Q.4.1. ·

Noun Noun + s (plural) Noun + s (possessive) Pronunciation(both)

thrush thrushes thrush's iz

toy toys toy's z

block blocks block's s

Fig. Q.4.1 : Example to separate words from morphemes

,_.9:!- Write a short note on tokens.


(i] Ans . : Tokens are syntactic words. Let's consider a simple sentence given below.
I don't want to buy this product.

• In the above sentence for reasons of generality, linguists prefer to analyse don't as two syntactic words (do
not). or tokens. each of which has its independent role nnd can be reverted to its nonnalized form. On the
other hand. all other words in this sentence ore treated as nn independent single token.
• In English. such tokenization and nonnalization may be applied to a limited set of cases. However. in other
languages, these phenomena have to be treated in a less trivial manner.

---
Q.I
OR
DISCUH clltlcs In morphology.

Explain the concept of clltlcs from morphology perspective.


Ii Ana. : In morphology and synta.x, a cliticis a morpheme that has syntactic characteristics of a word. but
depends phonologically on another word or phrase.
• It is synllletically independent but phonologically dependent - always onached to a host.
lib an atl"ix. but plays a syntactic role at the rhrnse le,·el.
11&\ie lhe form of afti.xes. but the distribution of function words.
forms of the auxiliaey verbs in I'm and we\·e are clitics.
. , _ . unils are thereby transfonneJ into one compact string of letters ond

N1• 11ttoken Namutf is a Clitic fonneJ due to


. . .tnlm<. . .+ tr> (tditi•itlilli: + ~"" ~ -

l'lltadbi chlnaed token


....... ~ - + ~

...iltt~tiilill.
• '- __._ a--
~---m
....... _. ,,.•. ~-- an.
• • CIID'til fbuDd la w1faus llapl&es like Latin.Ancient Gttc:k. Chinese. Japanese,

......_._._ •'- lmowD • wd ~ is die (Undamcntal step of morphok,g.kal


lllllyalallldapnnqulsllelbrmoll . . . . . ~...,alcalion s.
-
Natural Lsngusgs Processing (1 - 4) Structure of Words and Documents

Q.7 Explain the Importance of lexemes as a linguistic form.


OR Discuss the lexemes.
00 Ans.:
• In a natural language a word often denotes one linguistic form in the given context and the concept behind the
form and the set of alternative f9nns that can express it. Such sets are caUed lexemes or lexical items. They
together form the lexicons of a language.
• Lexemes can be divided by their behaviour into the lexical categories of verbs. nouns. adjectives,
conjunctions. particles, or other parts of speech.
• The citation form of a lexemeis also called its lemma.
• The notion of the lexeme is central to morphology and it is the basis for defining other concepts in the
morphology. For example. the difference between inflection and derivation can be stated in terms of lexemes :
o ln.flectionaJ rules relate a lexeme to its foims.
o When we convert a word into its other forms. such as turning the singular mouse into the plural mice or
mouses, we say we inflect the lexeme.
o DerivationaJ rules relate a lexeme to another lexeme.
o \\'hen we transform a lexeme into another one that is morphologically related, regardless of its lexical
category, for instance, the nouns consumer and consumption are derived from the verb to consume.
Q.8 Discuss morphemes as the smallest meaningful unit In a language.
OR Explain the difference between morphemes and words In NLP.
oo·-Ana.:
• There are different opinions on whether and how to associate the properties of word forms with their structural
components. These components are usually called segments or morphs.
• The morphs that by themselves r_epresent some aspect of the meaning of a word are called morphemes of
some function.
• A morpheme is the smallest meaningful unit in a language.
• A morpheme is not identical to a word. ]ne main difference between Chem is that a morpheme sometimes
does not stand alone, but a word, by definition, always stands alone.
• When a morpheme stands by itself. it is considered as a root bccaulel'lillameaning of its own (sueb as the
morpheme dog). When it depends on another morpheme to e Jt
is an atrax becau,c it has a
grammatical function (such as the ➔ in dogs to indicate that it is
• Natural languages use different techniques by which morphs and
The simplest morphological process concatenates morphs one by
• For example, as in the word, mis-manage-ment-s , where
elements are morphemes adding some meaning to the whole
• For example, in Korean language, many morphemes c
cancexL
fig. Q.8 .1 shows some Korean morphemes -ess-, -ass-, -y~

TECHNICAL PIAUCATIONII'-
Structure of Words end Documents
Natural Language Processing (1 - 5)

f'◄ )l t<'i1t !'fli') t, •,I ('11111 rnckd


(a) y_ ~- , po-o,i- ;}. J'li'd .,,, • ·Jian• :-<i·u·
(I>) 7l 7-l ~- l,ll.('HJ.,'1- Jl j. J.,1.q;1s,1- 'h1t\·,• t ub·u·
,.
{ ) lm-!JI'-<"· •1"" <· do: w.
Id) ,,,,,., ........ ~.
!_.~-
''" :J--.'i•
/IJ'(/1/-.'>• · 1111\"(' I t('('OIIH ..

{<·) fl.lJ/1 - f/ '-" - ·h,l\"l' p11t'

Fig. Q.8.1 : Korean morphemes Indicating past tense

Q Allomorphs
I
• The alternative forms of a morpheme are termed allomorphs.
• Allomorphs are variants of a morpheme that differ in pronunciation but are semantically identical. For
example, the English plural marker -(e)s of regular no~ns can be pronounced /-s/ (bags), (bushes). depending
on the final sound of the noun's plural form.
Q.9 Write a short note on following terminologies :

/ (a) Typology {b) Isolating, or analytic typology

(c) Synthetic languages (d) Agglutinative languages

(e) Fusional languages (f) t~onllnear languages

(!] Ans.:

0 (a) Typology

• Typology (or Morphological typology) is a way of classifying the languages in the world. l~groui?s
languages according to their common morphological structures. Typology organizes languages on the basis
------~----,,-,-,----:-== -
of how those languages form wor~ oy combining morphemes.
• The typology that is based on quantitative relations between words, their morphemes, and their features as
follows. '
0 (b) Isolating, or an~lytlc typology

• These languages include no or relatively few words that has more than one morpheme. Examples are
Chinese, Vietnamese, and Thai.
• Analytic: languages show a low ratio of morphemes to words, nearly one-to-one.
~

• . Sentences in anal)1ic languages are composed of independent root morphemes.


• .Grammatical relations between words are expressed by separate words \\ here they might otherwise be
expressed by affixes, which are pre5:ent to a minimal degree in such languages.
• • I •

• Some IDll.ydc tendencies are also found in languages like English and Afrikaans.
IJcc,.,..... .........
• . . . _ _ . . . _ . IINR naphemel In one word and are further divided into aalutinativc and fusional

• ........,le hm tbe raot. or they ~ not They may be Nied with it or


Natural Language Processing (1 _ 6) Structure of Words snd Documents
-
· · fi
• Word order is Jess important for these languages than tt 1.!_Q ~al)rtic languages.
_
since individuaJ words
express the grammatical relations that would otherwise be indicated
by syntax.
• In addition. there tends to be a high degree agreeroeoL...Ot cross-r
eference betwee n different parts of the
-
---
sentence.

.
• Theref ore, morpb olow in synthetic languages is more important
• Most lndo-European languages are moderately synthetic.
than syn~

0 (d) Agglutinative langua ges



These languages have morphemes associated with only a single functio
n at a time.
• Agglutinative languages have words containing several morphe
mes that are always clearly differe ntiable
from one anothe r.
• Each morpheme represents only one grammatical meanin g and the
boundaries betwee n those morphe mes
are easily demarc ated
• The bound morphe mes are affixes and they may be individually identifi
ed.
• Agglut inative languages tend to have a high numbe r of morphe
mes per word, and their morpho logy is
usually highly regular.
• Agglutinative languages include Finnish, Hungar ian, Turkish. Mongo
lian, Korean, Japanese, Indone sian,
Tamil etc.
0 (e) Fuslonal langua ges

• These l~guag es are defined by their feature-per-morphem e ratio higher


than other langua ges.
• Morph emes in fusional languages are not readily disting uishabl
e from the root or among themse lves.
• Several grammatical bits of meanin g may be fused into one affix.
• Morph emes may also be express ed by internal phonol ogical change
s in the rool.
• The Indo-E uropea n and Semitic languages are the most typical ly
cited exampl es offusio nal langua ges.
• Examp les of fusional lndo-E uropea n languages are : Kuhm
iri, Sanskrit. Pashto, New lndo-Aryan
langua ges such as Punjabi. Hindustani. Bengal i; Oreck (classic al
and modem). .Latin, Italian, French ,
Spanis h. Portugu ese, Romanian, Irish. German. F- - - . Iceland
ic, Albmian and all Daito-Slavic
languag es.
Concatenafiv e languages
• These langua ges link morphs and morphe mes
□ (f) Nonlinear langu agn

• Nonlinear langua ges allow structural co


change the conson antal or vocalic templates
• It is also called discon tinuous morphoJoo
which the root is modifi ed and which does in
Natural Language Processing (1 - 7) Structure of Words and Documents

rds use
• for example. in English. mostly plurals are usually formed by adding the sutlix - s. certain wo
nonconcatenotive processes for their plural forms as
foot ➔ feet

• Many irregular verbs form their past tenses. past participles or both in the same manner :
freeze ➔ froze ➔ frozen
• This specific form of nonconcatenative morphology is known as base modification or oblaut, a form in
\,hich part of the root undergoes a phonological change without necessarily adding new phonological
material
• f or example the English stem song. results in the four distinct words as
Sing ➔ sang ➔ sung ➔ song

W 1.2 Issues and Challenges


Q.10 Explain the Importance of morphological parsing and modelling In NLP.

00 Ans. :
• Morphological parsing helps to eliminate or improve the inconsistency of word forms. It is required to provide
higher-level linguistic units whose lexical and morphologkal properties are explicit and well defined.
• Every Natural language inherently _has some irregularity and ambiguity. Morphological parsing attempts to
remove unnecessary irregularity and control ambiguity.
• Irregularities
o In this context irregularity means existence of such forms and structures that are not described
appropriately by a prototypical linguistic model.
o Some irregularities can be understood by redesigning the model and improving its rules, but other lexically
dependent irregularities often cannot be generalized.
• Ambiguity
o Ambiguity is an inability in interpretation of expressions of language.
o Accidental ambiguity and ambiguity due to lexemes with multiple senses, cause syncretism, or systematic
ambiguity.
• Morphological modelling also faces the poblem of productivity and creativity in language. This gives birth to
IIDCCIDVcntional bul perfectly meaningful new words or new senses to the language.
• Becae these newly·coined wards are DOI present in the lexical and morphological properties, such words

• -----
Will~ completely unpned in morphological system. This unknown word problem is panicular.ly severe

.
madelllna i. Ulllble 10 pane • word. that comes &am an expected domain of lbe linguistic
- - mclldy wllm special 1er1111 or forelp w
arclialecll•mlxcdac,actber,
are involved in the discourse or whm
(f _ B) Structure of Words snd Document,
~N~a~tu~,a~f±L~an~g~u~ag~a~P~roc~e~s~si~ng~---------
~':!..---- - - - - - - - - - - - - ---
Q.11 Explain morp holog ical Irregularities In NLP.
OR Discuss how the morphological Irregularitie
s are removed.
l!l Ans.:
• The design principles of the morphological model are \'Cry
important to control the irregularities in words.
• Morphological parsing is designed for generalization and
abstraction of words to make the model simple and
) et powerful.
• However, the immediate descriptions of gi\'en for a v. ord
may not be the final ones. due to
o Inadequate accuracy description
o' Inappropriate complexity of morphological mode
l
o Need of impro\ed formulations
• Remo,•al of morp hological irregu laritie s
o A deeper study of the morphological processes is essen
tial for mastering the whole morphological and
phonological system.
o Morphophonemic templates capture morpbolog:ical proce
sses. It is done by organizing stem patterns and
generic affixes.
o These templates are designed v.ithout any context-dep
endent variation of the affixes or ad hoc
modification of the stems.
o A very terse merge rules ensure that morphophonemic templ
ates can 1?e converted into exactly the surface
forms namely, orthographic and phonological.
·I
o Applying the merge ruJes is independent of and irresp I
ective of any grammatical parameters or information
other than that contained in a template.
o Thus. most morphological irregularities in the morphophon
emic templates are successfully removed.
Q.12 Discuss morphological irregularities In any
two natural languages.
l!J Ans. :
• Morphological irregularities in Arabic
o Morphophonemic templates can be used for discovering
the regularity of Arabic morphology where
uniform strucrural operations apply to different kinds of sterns
.
o Some irregularities are bound to panicuJar lexemes or
contexts, and cannot be ac~ounted for by general
rules.
• Morp holog ical irregularities in Korean

o Korean irregular ,,erbs provide examples of such irregu


larities. Korean shows exceptional constraints on
the selection of grammatical morphemes.
o Korean language features lexically dependent s1em allffll
ation.

TECHMCAL PUIU CA~ . _, ...illlijf.a


Structure of Words and Documents
Natural Language Procasslng (1- 9)

• J\lorphological Irregularities io other Natural languagu


It is hard to find irregular inflection in agglutinative languages : Two irregular verbs in Japanese, one in
0
Finnish.

0
These languages are abundant with morphological alternations that are fonnalized by precise phonological
rules.
Q,13 What ls morphologlcal ambiguity? Discuss at least two examples .

[!] Ans,:
• Morphological ambiguity is the possibility that word forms be understood in multiple ways out of the context.
• Words forms that look the same but have distinct functions or meaning are called homonyms.
• Ambiguity is present in all aspects of morphological processing and language processing al large.
• Morphological parsing cannot complete disambiguation of words in their context, but it can control the valid
interpretations of a given word form.
• .MorphologkalAmbiguity in Korean
0 In Korean, homonyms are one of the most problematic objects in morphological analysis. This is because
they prevail all around frequent lexical items.
• MorphologkalAmbiguity in Arabic
0 Arabic has rich derivationaJ and inflectional morphology. Because Arabic script usually does not encode
short vowels and omits yet diacritical marks, its morphological ambiguity is considerably increased. 1n
addition, Arabic orthography collapses certain word forms together.
o The problem of morphological disambiguation of Arabic encompasses
⇒ The resolution of the structural components of words
⇒ Actual rnorphosyntactic properties
⇒ Tokeniz.ation and nonnalization
⇒ Lernmatization, stemming
⇒ Diacritization
• Morpholo£ical ambiguity in Sanskrit
o When inflected syntactic words are combined in an unerance, additional phonological and orthographic
changes can take place.
o In Sanskrit, one such euphony rule is known as external sandhi. Inverting sandhi during tokenization is
usually nondeterministic as it can provide multiple solutions.
• In any language, toenitation decisions may impose constraints on the morpbosyntactic propenies of the
fObns beina reconstructed.
• '1111 morpholopal phPmenon tbll some words or word classes show instances ofsystematic homonymy is
aaUed syncretism. In pertic:ullr, bomonymy can oc:cm due to neUlrllization and unaffectedness of words.
,
(t _ ) Structure of Words and Docurne
!N~a~tu~ra~IL~a~n~g~u~ag~e~P~roc~e=ss~ l n ~ g ~ - - - - - - - - - 10
-~~~-----------------=~
Q.14 What Is morph ologlc al produ ctivity 7

OR Dlscu H the morph ologic al produ ctivity.

OR Discu ss the comp etenc e versu s pe rforman ce dualit y by noam chom sky In the conte xt Of
morph ologic al produ ctivity.

[!] Ans.:
• In a naturnl langua ge as a system ()angu e). structural device
s like recurs ion, iterati on, or comp oundi ng allow
to produce an infinit e set of concre te lingui stic ut1erances.
• This general potent ial bolds for morph ologic al proces ses
as well and is caIJed morph ologic al produ ctivity .
• In a perspe ctive natural language can be seen as a collec
tion of uttera nces (parol e) prono unced or written
(performance). Hence for the linguistic corpo ra, parole and
perfor mance data set is practi cal.
• Such corpor a arc a finite collec tion of linguistic data that
are studie d with empir ical metho ds. It can be used
for compa rison when linguistic model s are develo ped.
Q,15 Discu ss ..80/20 rule," of llngul stlc word corpu
s7
OR Write a note on ..80/20 rule" of lingui stic word
corpu s.
[!] Ans.:
• Linguistic corpo ra are a finite collec tion of lingui stic data that
are studie d with empir ical metho ds.
• The set of word forms found in the corpu s ofa langua ge
is referre d as its vocab ulary.
• The members of this set arc word types, where as every origin
al instan ce of a word form is a word token.
• The distribution these words or other eleme nts of langua
ge follow s the ..80/ 20 rule," also known as the law of
the vital few.
• It says th.at most of the word tokens in a gjven corpu s can
be identi fied with just a couple of word types in its
vocabulary. and words from the rest of the vocab ulary occur
much less or rarely in the corpu s.
• New, unexpected words will alway s appea r in the lingui
stic data only when it is expan ded or enlarg ed.
Q.16 Olacu aa how creati vity and the Issue of
unkno wn word • meet to enhan ce th• morp holog
ical
produ ctivity In • natura l langu age.
OR Dlacu a• how the newly coine d word googl e hes
enhan ced the morp holog ical produ ctivity of many
natur al langu ages.

OR Dlacu as unexp ected word • will alway • appea r In


the Hngul stJc data only when It la expan ded or
enlarg ed.
(!l Ans.:

• Th~ word googo l is a d1cti0Nl£Y ,-.o,d in [.ngl~ h. It means


MlfflClhina lhal ii a madc•up v,ord dmoliaa dll
numb er ..one follow ed by one hundr ed .urut., ...

• The name of the comp any Googl e i, .,, inadvcrlcni Bllf


lllf W ....._, ,. However, bolh al .... .... .
succe ssfull y en1ered the lex.icon ofl:.ng)i!>h whcff -,ho lll1
-,,-i , tltily llllted ~
• Today we under stand Engl i~h verb to aoogl r and

TECH1-IICAL l'&A-...
(1 - 11) Structur9 of words and Documents
Natural Language Processing

d their O'-"n morphological


• This new word google is adopted by other languages, too. Thjs has triggere
processes.
•to googJe out', googlovanf
• In Czech, one says googlovat, googlit 'to google' or vygooglovat, vygooglit
'googling', and so on.
• In Arabic, the names· are transcribed as giigiil 'googol ' and gugil 'Google '.
linguistic data only when
• Thus we can observe that unexpected words in a language will always appear in the
it is expanded or enlarged.

llJ 1.3 Morphological Models


_ Discuss the motivation of using domain-specific languages.
0 17
OR What Is Domain Specifi c Language (OSL) ?

(!] Ans.: ·
is used for a single purpose.
• A Domain Specific Language (DSL) is a specialized programming language that
and roi_aimal programming
• Various domain-specific languages have been created for achie~ing intuitive
effort.
ar problem represenution
• Pragmatically, a DSL may be speciaJized to a particular problem domain. a particul
techruque, a particular solution techruque, or other aspects of a domain.
s and are interpreted
• These special-purpose languages usually introduce idiosyncratic notations of prog:r..m
using some restricted model of computation.
resou:rc-es were too limited
• The motivation for this approach lies in the fact that. historically, computational
compared to the requirements and complexity of the tasks being solved.
izing model for the
• Other motivations are theoreticaJ given that finding a simple. accurate anJ )Ct gener:il
practical use in the specific domain.
e anJ elegant l.1ngwge.
• The design objective of DSL is to get be pure, intuitive, adequa1.:, comrletc, reu..'1t-l

• Examples of such domain-specific programming languages an: HThll.. SQL. A\\X.


GDL. etc.
model ?
Q.11 Why dictionary lookup Is coneldend N one of the ef'fKtive Morphological

·-=
OR Dllcue1 dictionary • a morphological model

• ~ model needs a system in which analysq a "'Old form is RJuccd krpc


in sync \\ith more
tU,:fld&111d lllOdela ofdll ....... .. Dictionllies. Dllabaa 111d li1IS 1ft ~ ofsucll foam.
11 IDllntood • a dlla IDUCllft . . ~ ..we. okeinina S'1me PftCOIIIPUlCd results i.e. ·
( 1 • 12•1 ~trucftJr• of Wore/, ond Dor.urn,,,
_ 11
,
tlaturol Lt111(Jungo Procouloq

• u.xJkup (lp<:rn1io111 with dlcll'1naricN ure rclur lvd y 1,implc onJ u..unlly qu,""k. Vit..tionuflc 'f can hcunplcrnc111c11,
for IMwnce. a., lisr,. binmy 1wMch tree,, 1ric11, ha.,h whlc:s, ere.
• JJenee dictionary lo"kup lei com,idcrcd 111 one of the clk<.tivc Morphologicul mndcl11.
Q.19 Whet er• lh• drewbech of enumoretlv • Morpholog lcal modol?

00 Ana. :
• Cnumcrurivc li'll is a 11e1 of 11.,,ociuricm!J between word forms and rhcir def.ired dc.,cript ion<t,
• Jr is declared by plain enumerat ion. llcnce 1hc coverage of the model i'I fi n ire und the gcncru tivc porent iul of
1he language is not exploited.
• Development, lookup and verific:11ion of 1he os~ociarion li!>I ls tedious, liable to crrors,i11efficicn1 unJ
inaccurate unless the duta are retrieved outom111ically from large ond reliable lingu istic resources.
• Despite oil thor, un enumerative model is often sufficient for the g iven purpo1>c, d eals ca~ily wirh c,cccprions,
and can implement even complex morphology.
Q.20 Wrlle • ahort note on flnlte-stare morpholog lcal.

l!l Ana.:
• finite-slate morphological modelsore the morphologi cal models in which the specitieotio ns writlen by human
programmers a.re directly compiled into finite-store transducers.
• The finite stale morphological models can be used for multiple natural languages.
• The rwo popular online tools supponing this approach are XFST (Xerox Finite-Stale Tool) and LcxTools.
Q.21 Dlacuaa finite state tranaducer •.
OR Dlacuaa how the flnlle state transducers can translate the Infinite regular language.
l!l Ans. :
• Finite-state transducers are computational devices extending the power of finite-state automata.
• They consist of a finire ser of nodes connected by directed edges labeled with pairs of input and output
symbols.
• In such a network or graph. nodes are also called states, while edges are calJed arcs.
• Traversing the network from the ser of inirial stares to the set of final states along the arcs is equivalent to
reading the sequences of encountered input symbols and writing the sequences of correspond ing output
symbols.

• The set of possible sequences accepted by the transducer defines the input language; the set of possible
sequences emined by the transducer defines the output language.
• For example, a finite-stare transducer could translate the infinite regular language consisting of the Sanskrit
words pita' prapira• praipra'P'·io, ... to thc mate h"mg words m• . . .
the mfinite regular English language words
defined as father, grand-father, great-grand-father.
• In finite-state transducers it . "bl . .
. is poss, e to invert the domain and the range of a relation' that is ' exchange the
input and the output.
• In finite-state computational mo hol . . .
11 15 common to refer to the mput word fonns as surface strings and
to the output d . . rp ogy,
- cscnptions as lexical strings.
TECHNICAL PUBUCA TION~ . •rt ut>-thrun lo, knoWMd(J•
:;rrur.f11r• of WtJ(t/S 011d or,r,uff/l lnff
Nllfurnl L1111ouo(J" Pr0t,11.~1fn(J
1 (1 • 13)

- f Part : fl - Finding Structure of Docum ent ]

Q.22 Whot I• • tlructure of document ?


OR What I• • documont 1tructure.

OR Write • 1hort note of document 1tructure.

(!] Ant. :
, In human lungungc, word11 and i,enrenccs do not appear randomly bul u:,uJlly have
a structure.

, r or example, comhi nut ions of words form sentences • meaningful grammatical units, such a!! S!Otcmcnl!J,
requests, and commands.
about o particular point
, Likewise, in wri11en text, sentences form paragraphs. sclf-con raincd unit<J of discourse
or idea.

'1J 1.4 Introd uction


Q.23 Discuss the Importance of docum ent ,tructure In human languag
e.

OR Why document structure la Important In NLP 7

(!] Ans.:
• In human language or natural language, words and sentences usually
have o structure. This can be
:nts, requests, and
combinations of words form sentences • meaningful grammatical units, such as !itatcrrn.
commands.
idea, which is expressed
• Similarly, in written text, paragraphs are the self-contained units about a point or an
structure is important
in the form of group of sentences. Following are the some of the reasons why document
in human languages and therefore for natural language processing.
easy in NLP. The NLP
• When the structure of documents is extracted, it makes the further process ing of text
ic role labelling in
tasks that depends on the document structure are, parsing, machine translation and semant
sentences.
lity, it is important to
• To improve the reliability of Automatic Speech Recognition (ASR) and human readabi
identify the sentence boundary annotation. Document structure helps in this process.
y coherent blocks that
• Document structure helps in breaking apart the input text or speech into topicall
provides better organization and indexing of the data.
e of textual and audio
• Thus, in most speech and language processing applications extracting the structur
documents is a meaningful and necessary pre-step.
Q,24 Write I note on Hntence boundary detection.
OR What 11 sentence boundary detection ?
@ Ana.:
g where sentences begin
• Sentence boundary detection is the problem in natural language processing of decidin
and end.
(1 • 14) Structure of Words and Documents
Natural Language Processing
d be perfonned at the beginning of a text processing
· • Sentence detection is an important task, which shoul
pipeline.
segmentation) deals with automatically segmenting a
• Sentence boundary detection (also called sentence
sequence of word tokens into sentence units.
input to be divided into sentences; however, sentence
• Natural language processing tools often require their
tial ambiguity of punctuation marks.
boundary identi fication can be challenging due to the poten
the b'eginning of a sentence is usually marked with an
• In written text in English and some other languages,
marked with a period (.), a question mark (?), an
uppercase letter, and the end of a sente nce is explicitly
exclamation mark or another type of punctuation.
dary markers, capitalized initial letters are used to
• However, in addition to their role as sentence boun
ns and numbers and other punctuation marks are used
distinguish proper nouns, periods are used in abbreviatio
inside proper name s.
between period characters tha~ are enclosed between
• A character-wise analysis of text allows for a distinction
are followed by at least one, non-alphabet\c character,
two alphanumeric characters, and period characters that
such as a funher punctuation sign, a space, tab or new line.
written as well as spoken text and code switching.
• There are various chaJlenges associated with SBD, for
Q.25 Discuss the challenges of sentence boun
dary dete ction In writte n text.

I!! Ans. :
most common problems of sentence segmentation in
• Ambiguous abbreviations and capitalizations are the
written text.
ic. The primary reason for this is the speaker may have
• Quoted sentences are more complex and problemat
e the quotes are also marked with punctuation marks.
uttered multiple sentences and sentence boundaries insid
boundary detection may ·result in cutting some sentences
• As a result of this an automatic method of sentence
n instead -of written, prosodic cues usually mark
incorrectly. In case if the preceding sentence is spoke
structure.
Service (SMS) texts or Instant M~ssaging (IM) texts,
• "Spontaneously" written texts, such as Short Message
ng punctuation. which makes sentence segmentation
tend to be nongrammaticaJ and have poorly used or missi
even more challenging.
nition (OCR) or ASR, aims to translate images of
• The automatic systems, such as optical character recog
nces into machine-editable tex.
handwritten, typewritten, or printed text or spoken uttera
the finding of sentence boundaries must deal with the
• When the sentences comes from such automatic system,
errors of these systems as well.
commas and can result in meaningless sentences. ASR
• For example, OCR system easily confuses periods and
transcripts typically lack punctuation marks and are usual
ly mono-case.

ction In spoken / converaatlonal text.


G.21 Dlacuu the challengH of sente nce boundary dete
ii Ana.:
. . .
• For convcnationa1 5
....._ . . pecc
h
or text or mult1 party meetings with ungrammatical sentences and disfluencies' in
........
·--n 11n o1 cIcar where the boundari.es are.
dg•
TECHNICAL PUBLICA nON~ • .,, Up-t/lM l for knowi.
-
(1- 1s) Struclu/'9 of Words and Documents
Natural Language Processing
act segmentation. This is
, The problem may be redefined for the conversational domain as the task of dialog
standards such as
because dialog acts are better defined for conversational speech using a number of mark-up
Dialog Acl Mark-up in Several Layers (DAMSL).
tical sentence as a whole,
• for example, the sentence I think so but you should also ask him may be a gramma
one suggeS tion. Su~h
but for DAMSL and MRDA standards, there are two dialog act tags, one affinnation and
or sentiment analysis.
a modification may be needed for conversation analysis, such as speaker role detection
This task can be seen as a semantic boundary detection task instead of syntactic.
e boundary detection ?
Q.27 What Is code switching ? Why It Is considered as a problem In sentenc

[!) Ans. :
es by multilingual
• Code switching - that is, the use of words, phrases, or sentences from multiple languag
, when switching to
speakers - is another problem that can affect the characteristics of sentences. For example
to the
a different language, the writer can either keep the punctuation rules from the first language or resort
s).
code of the second language (e.g., Spanish uses the inverted question mark to precede question
can be redefmed, as in
• Code switching also affects technical texts for which the meanings of punctuation signs
must detect and parse
Uniform Resource Locators (URLs), programming languages, and mathematics. We
those specific constructs in order to process technical texts adequately.
on patterns to identify
• Conventional rule-based sentence segmentation systems in well-formed texts rely
potential ends of sentences and lists of abbreviations for disambiguating them.
abbreviations at the
• Although rules cover most of these cases, they do not address unknown abbreviations,
ends of sentences, or typos in the input text.
chats, and biogs, or to
• f urthermore, such rules are not robust to text that is not well formed, such as forums,
a specific set of rules.
spoken input that completely lacks typographic cues. Moreover, each language requires
• . Hence code switching is considered as a problem in sentence boundary detection.
effective than a rule based
Q.28 How sentence segmentation as a classlflcatlon problem Is more
problem?
I!! Ans.:
on patterns to identify
• Convent ional rule-based sentence segmentation systems in well-formed texts rely
potential ends of sentences and lists of abbreviations for disambi guating them.
• Sentence segmentation in text usually uses the punctuation marks as delimiters and aims
to categorize them as
ies are usually
sente~ce ending,'beginning or not. On the other hand, for speech input, all word boundar
considered as candidate sentence boundaries.
abbreviations at the
• Although rules cover most of these cases, they do not address unknown abbreviations,
ends of sentences, or typos in the input text.
I
and biogs, or to
! • Furthennore, such rules are not robust to text that is not well fonned, such as forums, chats,
set of rules.
.spoken input that completely lacks typographic cues. Moreover, each language requires a specific
I • To improve on such a rule-based approach. semence segmentation is stated as a classific
ation problem. Given
nining dala whae all sentence boundaries are marked, • can train a classifier to rccognizc them.
I
I
I
~111,I UU:,t. . . . . . .. . . . , . , , _ , I-•
L rl
(1 - 16) . Structure of Words and Documen ti
Natural Languag e Processin g

Q.29 What Is topic bounda ry segmen tation ?


OR How automat ic topic bounda ry segmen tation works ?
l!J Ans.:
ically dividing a
• Topic segment ation (sometimes called discourse or text segmentation) is the task of automat
stream of text or speech into topically homogeneous blocks.
is to find the boundaries
• That is, given a sequenc e of (written or spoken) words, the aim of topic segment ation
where topics.
such as information
• Topic segment ation is an important task for various language-understanding applicati ons,
extraction and retrieval and text summarization.
coheren t segments, then
• In information retrieval, if long documents can be segment ed into shorter, topically
onJy the segment that is about the user's query could be retrieved .
.
• For muJtiparty meetings, the task of topic segment ation is inspired by discours e analysis
agenda items, whereas for
• For official and well-structured meetings , the topics are segment ed accordin g to the
more casual conversational-style meetings, the boundar ies are less clear.
complex.
• For conversational speech, the topic boundaries may not be absolute. Hence they are more
s and paragraph
• In text, topic boundaries are usually marked with distinct segment ation cues, such as headline
as pause duration and
breaks. These cues are absent in speech. Howeve r, speech provides other cues, such
speaker changes.
nt because of many natural-
• Topic segmentation is a nontrivial problem without a very high human agreeme
granular ities.
lang~age-related issues and hence requires a good definitio n of topic categori es and their

(IJ 1.5 Methods


problem.
Q.30 Discuss Sentenc e / Topic segmen tation as a bounda ry classific ation
ation problem in NLP ?
OR Why Sentenc e / Topic segmen tation Is conside red as a bounda ry classific
l!J Ans.:
a boundar y classific ation
• Sentence segment ation and topic segment ation have mainly been consider ed as
problem.
ation and between two
• For given a boundar y candidat e (betwee n two-wor d tokens for sentence segment
te is an actual boundar y
sentences for topic segment ation), the goal is to predict whether or not the candida
(sentence or topic boundary).
candida te and y E Y be the
• Fonnally, let x E X be the vector of features (the observa tion) associat ed with a
predicted for that candidat e. The label y can be b for boundar y and bfor nonboun dary.
label
find a function that will
• Thi_s results in a classification problem : given a set of training example s {x, y} lnlD'
assign the most accurate possible label y of unseen example s x_n.
y types using finer-gra ined
• Altcma~ivcly to the binary classification problem , it is possible to model boundar
catcgoncs.

TECHNICAL PUBUCA noose. _, ul)-lhfli# ,o, kno•dp•


'
(1 • 17) Structure of Words and oocume nls
Natural Language p,ocesslng
nd
that sentence segmen tation in text be framed as three-cl ass problem _: sentence bou ary
• Gillick suggested 8
1
1
an abbrevi ation b
1
without an abbrevia tion b , and abbreviation not at a bounda ry b ,
with ,

st
ndaries b, atement
• Similarly, in spoken language, a three-way classification can be made between non-bou
b\ and quescion boundaries bq.
Q. Discuss the method of classlflcatlon In sentence or topic segmentation.
31
tatJon.
OR Explain the cla551ficatlon method used In sentence or topic segmen
.
[!) An••:
e sentence or topic
• For sentence or topic segmentation, the problem is defined as finding the most probabl
boundaries.
es, with assumption
• The natural unit of sentence segmentation is words and of topic segmentation is sencenc
that assume topics typically do not change in the middle of a sentence.
sentence or topic - that is,
• The words or sentences are then grouped into contiguous stretches belonging to one
ndaries .
the word or sentence boundaries are classified into sentence or topic boundaries and non-bou
the aim is to estimate the
• The classification can be done at each, potential boundary i (local modelling); then,
most probable boundary type, 91, for each candidate example, x, :
91 = argn:iax
y, m y
P (y; I x.)
used to show possible
• Here, the ,.. is used to denote estimated categories, and a variable without a ,.. is
categories.
• In local modelling, features can be extracted from the surrounding example
context of the candidate boundary
e and search for
to model such dependencies. It is also possible to see the candidate boundaries as a sequenc
given the candidate
the sequence of boundary types, Y= 9, , ..... y0 , that have the maximum probability
examples, X = Xp···• x0 :
I\
Y = argmax P (Y I X)
y

machin e leamln g algorith m.


Q.32 Discuss the categorization of methods according to th~ type of the

OR What are the generative and discriminative categarlzatlon methods


7

s.
OR Compare between generative and discriminative categarlzatlon method
lil Ans.:
0 Generative sequence model
tion) and the labels
a) ,It 'estimate the joint distribution of the observations, P (X, Y) (e.g., words, punctua
(sentence boundary, topic boundary).
and have good
b) It requires specific assumptions (such as backoff to acco.unt for unseen events)
generalization properties.
0 Discriminative sequence model
es.
a) It focus on features that characterize the differences between the labeling of the exampl
b) Such ~thods (as described in the following sections) can be used for sentenc
e and topic segmentation in
both wntten and spoken language, with one differen ce.

1EC'M ICM.Aa 1CA~· • ,...,,,.,lor•o•"•


Do cum e
Str uct um of Wo rds an d
(1 • 18)
sin g
Na tur al Lan gua ge Pro ces tia l en d-o f-s en ten ce de lim ite
r (peri,
tha t do no t inc lud e a po ten .
f a II bou n dan·es .
c) In text. the cat ego ry o set to no nse nte nc e or nontop1c,
on ma rk) is pre
question mark. exc lam ati as in spe ech ,
wo rd bo un da rie s tha t inc lud e a de lim ite r, wh ere
d) A cat ego ry is· esr·1matedfcor on ly tho se
nsi de red .
cut ive tok ens are usu all y co

__ _,
bo

00 An s.:
un dar ies

Ex pla ln ge
bet
ne
we
rat
en co

ive
nse

se qu en ce cla ss lffc ati on me tho ds


fo r se nte nc e an d top ic se gm en tat ion .

ce seg me nta tio n


ue nc e cla ssi fic ati on me tho d for top ic an d sen ten
d ge ne rat ive seq
• Th e mo st com mo nly use
(H MM ).
the hid den Ma rko v mo del
s rul e :
bab ilir y is wr ine n as the following, usi ng the Ba ye
• The pro
(Y ) = arg ma x p (X I Y) P( Y)
I X) = arg ma x P( X ~{ ir
y = arg ma
y
x p (Y y
y
ch an ge the argume1
se it is fix ed for dif fer en t Y an d he nc e do es no t
tor is dro pp ed be cau
• P( X) in the den om ina
of ma x. er of boundai;
ed m- sta te Ma rko v mo de l, wh ere m is the nu mb
led by a ful ly co nn ect
• The bigrnm cas e is model
categories. ue nc e tha
for sen ten ce (to pic ) seg me nta tio n. an d the sta te seq
nte nce s or pa rag rap hs)
• Th e states em it words (se est im ate d.
rd (se nte nce ) seq ue nc e is
·most likely generated the wo eli ho od s, P (x,/y.), are est im
ate d us ing th,
te ob ser va tio n lik
ies, P () ,/y,_1), an d sta
1

• Sta te transiti_o n probabilit


training data.
pro gra m. mi ng.
le bo un dar y seq ue nc e is ob tai ne d by dy na mi c
• The most pro bab da rie s,
me nta tio n wi th tw o sta tes : on e for seg me nt bo un
seg
l hjd den Ma rko v mo de l for
• Be low is the con cep tua
on e for others.

mp lex ity.
hig he ror de r n-g ram s at the co st of an inc rea sed co
ded to
• The bigram cas e can be eA.'len n is the nu mb er of
d of usi ng tw o sta tes , n sta tes are use d. wh ere
Jly ins tea as PO S tag s of the
• Ifo r. topH
ic scgmcnt3 . tion. rypica an y inf orm ati on be yo nd wo rds, su ch
possible in HM M to use
op ics. owc,-n:, n is no t
sod ic cue s, for speech seg me nta tio n.
words or Pro
hri em it the
• Two simple extensions ha,
-c bee0 nn. ---- '
be rg et aJ sug ge ste d us ing ex pli cit sta tes to
. pro..,..,..-u : S de ls.
boundarv Jex jca J inf orm ati on via co mb ina tio n "'i th oc he r mo
• tokens. hence tncTu orporating oon . .
C1 al use d lh tio ns
• For topic 5eg me nta tio n. r
a an d mo de led top 1c- sta rr an d top ic- fin al sec
e sam e rde •
.._
~ ,;_ itlv -·=-L.
• ' W'- 0
hclped gJQtJy for broad
cast ne ws rop
.
,c seg mc nta tro n. Th e sec on d ex ten sio n is ins
pir ed fro m
_, _ -- '-l s wh ich . rds b syn tac tic , an d oth er
~'"tOred bq~- ~ . .ftl\ K' caprurc no t 0 nl ut als o morphological,
d . Y wo PO S tag s in
information. Guz C1 alp rop osc LM (fH EL fl.f ) for sen ren ce seg me nta tio n us ing
usi ng factOf"Cd HE
addition ro words.
Natural Language Processing (1 • 19) Structure of Words and Documents

_9J, Discuss discriminative local classlflcatlon methods.

nil I!] Ans. :

- • A number of discriminative classification approaches, such as support vector machines, boosting. maximum
entropy, and regression, are based on very different machine learning algorilhms.
• While discriminative approaches have been shown to outperform generative methods in many speech and
language processing tasks, training typically requires iterative optimization.
is
• In discriminative local classification, each boundary is processed separately with local and contextual features.
• No global (i.e., sentence or document wide) optimization is performed, unlike in sequence classification.
• for sentence segmentation, supervised learning methods have primarily been applied to newspaper articles.
• Many classifiers have been tried for the task : regression trees , neural networks , a C4.5 classification tree ,
1t maximum entropy classifiers , support vector machines (SVMs), and naive Bayes classifiers .
• Mikheev treated the sentence segmentation problem as a subtask for POS tagging by assigrung a tag to
y punctuation similar to other tokens . For tagging he employed a combination of HMM and maximum entropy
approaches .
1t .35 Write a note on TextTlllng method for topic segmentation.
OR Discuss how TextTlllng method Is used for topic segmentation. ·

OR Explain block comparison and vocabulary Introduction methods for topic segmentation.

l!l Ans.:
• The popular TextTiling method of hearst for topic segmentation uses a lexical cohesion metric in a word
vector space asap indicator of topic similarity.
• TextTiling can be seen as a local classification method with a single feature of similarity.
• Below Fig. Q.35. l depicts a typical graph of similarity with respect to consecutive segmentation units. The
document is chopp~d when the similarity is below some threshold.
o.1·r--.---.----r--....--....--....--....----r----r----,

4 5 6 7 8 1718192

0.1

0 10 20 30 40 50 60 70 80 90 100

Fig. Q.35.1 : TextTlllng example


l~

_ __:S.::.fro:,:c::.tu:re.:..:::o::_
f:_:Wi:o1<~d.:s:_s~nd~D~oc~urn6 nts
_ _ _ _ _ _.:_(1:...·...:2:.:0:...,_ _ _ _
...:g:..__ _ _
_ru_ra_l_L_a_n_gu_a_g_e_R_roc_e_ss_ln -- -
-N_a y scores were proposed for
TextTiling :
for co~puti ng the similarit
• Originally, two methods
• Block comparison - words the ad'
t to see ho w sim ila r tl1 ey are according to how many ~acenr
cks of tex
a. It compares adjacent blo
blocks have in common. t instead at a
ssarily loo kin g on ly at the consecutive blocks bu
variable, not ne ce
b. T~e block size can be
l
window.
s (se nte nces or paragraphs), the similarity (or topica
d b , each having k token
c. Given two blocks, b1 an 1 .
sion) score is compute d by the formula given below
cohe
Lt w, b, . w, bl
✓ Lt (1): b, Lt (1): bl
may be computed
to term t in block b. The weights can be binary or
assigned
where ro1, b is the weight quency.
er inform ation retrieva l-based metrics such as term fre
using oth
on -
• Vocabulary introducti a tok en -sequen ce gap on the basis of how ma
ny
re to
tion method, assigns a sco
a. The vocabulary introduc oint.
erval in which it is the midp
new words are seen in the int o co ns ecu tive blocks, b,. and b2, of equal
number of
arison formulation . giv en tw
b. Similar to the block comp th the follow ing formu la : Where NumNewTerms(b)
sion score is computed wi
words, w, the topical cohe time in text.
in block b, seen for the first
returns the number of terms ewTerms(b1)
NUI"QNewTerms(b ,) + NumN
2x w
g at all words,
laten t sem an tic an aly sis . Instead of simply lookin
d to exploit s because this
c. This method is extende ica l sp ac e, wh ich has led to improved result
transform ed lex
researchers worked on the
antic similarities im pli cit ly.
approach also captures sem
lflcatlon methods.
l a l n dlscrfmlnatfve sequence class
~p
ragraph) highly
CiJ An s.:
ic de cis ion for a giv en example (word, sentence, pa
sentence or top
• In segmentation tasks, the
in its vigini!Y . iw ii ~ ~t") ~ of ~e/.-1e ~
the exam ples
depends on the decision for ten sions of local discriminative models
with
in ge ne ral ex
classification methods are ighbouring decisions to label
• Discriminative sequence of labels by loo king at ne
coding stages that find the
best assignment
additional de
an example. tension of
ten sion of ma xim um entropy, SV M strucr is an ex
(CRFs) are an ex HMMs .
• Conditional Random Fields rgin. M ark ov ne tw ork s (M 3N ) are extensions of
tputs, and maximum ma
SVM to handle structured ou es loading of one
m (M IR A) is an on lin e learning approach that requir
Algorith
• The Margin Infused Relaxed
ining.
sequence at a time during tra luding sen ten ce segmentation in spe
ech.
e lab elling tas ks, inc
many sequenc
• CRFs have been successful for are tra ined by finding the A param
eters
struc tur es. CR Fs
models for labelling tion term to avoid overfining
.
- ~ ~ - ~• CRFs are a class of Jog-linear rnmallv wi th a re__gulariz.a
,. AM ia
(1 - 21) Stru cture of Word s snd Doc ume nts
Natu ral Lang uage Proc essi ng

met hod s are used for training.


• Gra dien t, con juga te grad ient, or onli ne
labe ls at test time
ram min g (Vit erbi deco ding ) is used to find the most prob able assi gnm ent of
• Dyn ami c prog
or to com pute the Z(-) func tion .
hes for wor d clas slflc atlo n.
Q.3 7 Disc uss the hyb rid app roac

00 Ans . :
whi ch is criti cal for the
sequ enti al disc rimi nati ve c/as silic atio n algo rithm s typi cally igno re the cont ext,
• Non
segm enta tion task.
ly cons ider cont ext, thes e
le we may add cont ext as a featu re or simp ly use CRF s, whi ch inhe rent
• Whi se du.ration or pitc h rang e.
oach es are s ubop tima l whe n deal ing with real -val ued features, s uch as pau
appr er man ually or auto mat ical ly
prob lem by binn ing the feature spac e eith
Mos t earl ier stud ies sim ply tack.led this
berg et al. .
sific atio n appr oach , as sugg este d by Shri
• An altemati ~e is to use a hyb rid clas obta ined from the
is to use the post erio r prob abil ities , PcCy,/x.), for each bou ndar y cand idat e,
• The mai n idea
obse rvat ion like liho ods by
, by sim ply conv ertin g them to state
othe r classifiers, such as boo sting or CRF
-kno wn Bay es rule as follo ws
divi ding to thei r prio rs follo win g the well
Pc (y; Ix,)
argm ax = P(y. ) = argmy.ax P(x, I Y,)
Y; '
enta tion. To hand le dyn ami c
App lyin g the Vite rbi algo rithm to the HM M then retu rns the mos t like ly segm
• me as is usuaJly desc ribe d
obse rvat ion like liho ods, a weig htin g sche
ranges of state tran sitio n prob abil ities and
in the literature can be appl ied.
s, nam ely boo sting , mw mu m
disc rimi nati ve local classifi~ation met hod
• Zim mer man et al. com pare d vari ous tion of multilinguaJ spee ch.
r hybrid vers ions for ~entence segm enta
entr opy, and deci sion trees, alon g with thei
aJways superior.
He conc lude d that hybrid appr oach es are
men tatio n ?
glob al mod elin g for sen tenc e seg
Q.38 Wha t are the exte nsio ns for
OR
g exte nsio ns ?
e seg men tatio n Is carr ied out usin
How glob al mod elin g for sen tenc

00 Ans . :
ies rath er than sentence s in
tion have focused on reco gniz ing bou ndar
• Mos t appr oach es to sent ence segm enta
them selv es. that mus t be asse ssed in
ratic num ber of sent ence hyp othe ses
• This has occu rred because of the quad
com pari son to the num ber of boun dari es. hed by a loca l
lem , inpu t is segm ente d acco rdin g to like ly sent ence bou ndar ies esta blis
• To tackJe this prob
the n-be st lists.
mod e.I. Late r it is train ed as a re-ra nker on a synt acti c pars er or glob al
appr oach allo ws leve ragi ng of sent ence -lev el feat ures such as scor es from
• This
prosodic features.
ws com bini ng loca l scor es
et al. prop osed to exte nd this conc ept to a pruned sent ence latti ce, whi ch allo
• Fav re
ient man ner.
with sent ence -lev el scor es in a mor e effic
d Document~
Structure of Words an
(1 - 22)
sing
Natural Language Proces

W @ co m p le xi ty of
Approaches
entation Is evaluate
d.
-
complexity of sentence/topic segm
Q. 39 Dl1cu11 ho w the
memory) of their
00 Ans. :
can be rared in ter ms of complexity (time and y a/so
rion approaches l-world datasets . Some ma
• Sentence/topic segmenra the ir pe rfo rm an ce on rea
training and prediction alg
orithms and in terms of co ntinuous features to discrete
features.
ing or no rm alizin g
ing, such as conven
require specific pre-process
ive
• Dlscriminati,·t approa
ch
pro ac hes is mo re comp lex than training of generar
ap
t~ining of discriminative ir feature weights.
a) ln terms of complexity, sses ov er the tra ining data to adjust for the
ltiple pa
ones because they require mu
'rfo u.,~{ic ~Ju.olio,.
0~ '1u 'tW>1e. ~ ·
• Gtntrativt modelJ r training sets and
/1iple orders of magnitude large
b) Generarive models
such as HELM
benefit, for instance, from de
- s-c an
cades of news wire
ha nd le
tra
mu
nsc rip lS. Bu t the y do not cope well with unseen
events.

• Di1criminati¥t da nl
lie n ng sets.
rie ty of fea tures an d pe rform bener on smaller traini
c) They allow for a wi
der va relatively simple
fiers is als o slo we r, ev en though the models are
d) Predicting with discri
minarive classi more features.
ear), be cau se it is do mi na red by the cost of extracting
(linear or Jog-lin
• ~q ut nc t approaches g: finding
pro ac he s bring the ad ditional complexity of decodin
e) Compared to local ap
proaches, sequence ap of decisions.
de cis ion s req uires evalu ating a/I possible sequences e
the best sequence of
all ow the use of dy na mi c programming to trade tim
ependence assumptions
fJ Fortunately, conditional ind omial time.
lyn
for memory and decode in po (nu mb er of bouncfury candidates
processed
er of the mo de l
g) This complexity is
then exponential in the ord Jltcs).
tC'lhcr) and the nu mb er of d~ s..-s (number of boundary &L
to-,
t rlauifitN,
• Dftaimiaadvt ttqw:nc nc c on the tr:iining d.ita, wl,i,h might become
inf< :re
fk:.e.d w repe"1c:dly ~r fo rm
h) fo r t,u,.mplc CR h, .alli-0
ttp t't l)I Vt,

e Approaches
W 1.7 Performance of th
t.t lon 1pproa.d1 H In d6tall.
t1 lh • ,.,._rforrnan te of Mnt.nee M ~m •n
O.AO 01 Ku
:
Oft Wrltt • .tir.,,t no t. on
,n In ,p..c.h
a) ~ M?JltntMJc
In •~ •, h
c) ~ Mgmtr,tatJr,n
IJ AM .:
•r .h
'..I •J ~ M 2m tn t.t ,c,
n In •~
1 I • IJ nulu;,11;1) u·,i11i .
• I'" "'- 'iU :ru .,.Qr1ttrt;,Iv,n ,·n t{lu,1. ,, {,t;f ,,,,,s:,1i1.-e ,. u~1JiJ/
,,t mJr,,1,., ,,f Of t/f t"' rl,1; ,,u,,,J,cr ,,t u:m,pl
i:ts)
I) 1IIIC '" '" , . (r;,111,
t111: ,rir&n ,,f ,,u lJ t,,, IIJ" JlfCJ..J';,,Jllfl)
2J I J-1r,•twc (1J;c J,flflf,1, •
Natural Language Processing (1 • 23) Structure of Words and Documents

where
I) Recall is defined as the ratio of the number of correctly returned sentence boundaries to the numbt:r of
sentence boundaries in the refere nee anl"\otations.
2) Precision is the ratio of the number of correctly returned sentence boundaries to the number of all
automatically estimated sentence boundaries), and the National Institute of Standards and Technology
(NIST) error rate (number of candidates wrongly labeled divided by the number of actual boundaries).
O b) Sentence segmentaUon In text
• For sentence segmentation in text, researchers have reported error rate results on a subset of the Woll Street
Journal Corpus of about 27,000 sentences.
• For instance, Mikheev reports that his rule-based system performs at an error rate of 1.41 %.
• The addition of an abbreviation list to thjs system lowers its error rate to 0.45 % and combinjng it with a
supervised classifier using POS tag features leads to an error rate of 0.31 %.
• Without requiring handcrafted rules or an abbreviation list, Gillick's SVM-based system obtains even
fewer errors, at 0.25 %.
• Even though the error rates presented seem low, sentence segmentation is one of the first processing steps
f~r any NLP task, and each error impacts subsequent steps, especially if the resulting sentences are
presented to the user as for example, in extractive summari111tion.
0 c) Sentence Hgmentation In speech
• For sentence segmentation in speech, Doss ct al. report on the Mandarin TDT4 Multilingual Broadca!>1
News Speech Corpus, an FI-measure using the same set of features is as of
o 69.1 % for a Max£nt classifier •
o 72.6 % with Adaboc,~t
o 72.7 % with SVMs
• A combination of the three cla\sifiers U\ing logistic regres~ion is alw pr<,po'>Cd.

I FIii in the Bianka for Mid Term Exams I


0 .1 Natural langu:J91:, are _ _ _ _ _ from conitructP.d and formal lang1Jage, such a, thoae used to
program computr,rs or to &tudy loglc.
0 .2 Tho IIJchnlqu'l of dlsco-mrlng of word etructure Is called _ _ _ _ parsl~.
Q.3 o,,p,.mdlnQ on Iha mr,ana of communlc8tlon, are spcllr:d out via graphem~ or realized
through phonr,mo,.
0.-1 Tv~rm• are
---- words.
0 .5 In morphtJl()(JY and eyntax, a _____ Is a morphome that has syntactic charac.t,:,rl&tlca of a wc,rd, but
dllp13nd, phonolr.,glcally on anothr,r word or phrnr,(;.
Q.S _ _ _ _alao known at word &IJ{lmentatlcm, It the fundamt1ntal ettsp of morphological analyal1 and a
r,r1Jr1J1J1,1,114 for moet language proceeelng appllc.atlont.
0.7 Tho citation form <A I ltrxem, II al10 called ltl
----
IICIUIIGAL PUIM.ICA IIW°J4 • 111 IIIHhtwl fr,t ~

You might also like