0% found this document useful (0 votes)
8 views4 pages

End Sem Answer Key 2023

The document discusses various aspects of natural language processing (NLP), including initialization and recurrence relations for distance calculations, lexical and allophonic variations, and applications like machine translation and spell correction. It also covers challenges in information extraction, statistical named entity recognition techniques, and the differences between extractive and abstractive summarization. Additionally, it highlights the pros and cons of chatbots and issues related to scale in NLP models.

Uploaded by

cse21055
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views4 pages

End Sem Answer Key 2023

The document discusses various aspects of natural language processing (NLP), including initialization and recurrence relations for distance calculations, lexical and allophonic variations, and applications like machine translation and spell correction. It also covers challenges in information extraction, statistical named entity recognition techniques, and the differences between extractive and abstractive summarization. Additionally, it highlights the pros and cons of chatbots and issues related to scale in NLP models.

Uploaded by

cse21055
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Question 1.

Initialization
D(i,0) = i
D(0,j) = j
Recurrence Relation:
For each i = 1…M
For each j = 1…N
D(i-1,j) + 1
D(i,j)= min D(i,j-1) + 1
D(i-1,j-1) + 2; if X(i) ≠ Y(j)
0; if X(i) = Y(j)
Termination:
D(N,M) is distance

 Lexical variation: a difference in what segments are used to represent the word in the lexicon
 Due to the influence of the surrounding sounds, syllable structure, etc.
 Because can be pronounced either as [b iy k ah z] [b iy k ah zh] [b iy k ah s] [b iy k aa z]
 Allophonic variation: a difference in how the individual segments change their value in different contexts.
 Because can be pronounced either as monosyllabic cause or bisyllabic because
Question 2.
 Machine Translation:
 P(high winds tonite) > P(large winds tonite)
 Spell Correction
 The office is about fifteen minuets from my house
 P(about fifteen minutes from) > P(about fifteen minuets from)
 Speech Recognition
 P(I saw a van) > P(eyes awe of an)

 Types: means the number of distinct words in a corpus, i.e. the size of the vocabulary V.
 Tokens: means the total number of running words. N
 “They picnicked by the pool, then lay back on the grass and looked at the stars.” 16 word tokens and
14 word types
 Switchboard corpus has 2.4 million wordform tokens and approximately 20,000 wordform types.
 Brown corpus contains 61,805 wordform types.
 Brown et al.(1992): 583 million wordform tokens that included 293,181 different wordform types.

Question 3.
• Entropy: expected surprise (over p):
é 1 ù
H( p) = E p ê log 2 ú = -å px log 2 px
ë p x û x

• – x log x is convex
• – å x log x is convex (sum of convex functions is convex).
Issues of Scale
• Lots of features:
• NLP maxent models can have well over a million features.
• Even storing a single array of parameter values can have a substantial memory cost.
• Lots of sparsity:
• Many features seen in training will never occur again at test time.
• Overfitting very easy
• Optimization problems:
• Feature weights can be infinite, and iterative solvers can take a long time to get to those infinities.

Question 4.
 It can tell us how the word is pronounced.
 noun is CONtent and the adjective is conTENT
 OBject(noun) and obJECT(verb),
 DIScount(noun) and disCOUNT(verb)
 Knowing a word’s part of speech can help tell us which morphological affixes it can take.
 It gives a significant amount of information about the word and its neighbors.
 Knowing whether a word is a possessive pronoun or a personal pronoun can tell us what words are
likely to occur in its vicinity.
 useful in language model for speech recognition.
B. Parts of speech can be divided into two broad super categories:
 Open Class: Growing continuously
 4 Major open classes: nouns, main verbs, adjectives, and adverbs
 Closed class: It has relatively fixed membership.
 Example of English closed classes: prepositions, determiners, pronouns, conjunctions, auxiliary
verbs, particles are closed classes.
 Function word: tend to be very short, occur frequently, and play an important role in grammar.
 Example: of, it, and, you, etc.
Question 5.

Yesterday, I bought a Nokia phone and my girlfriend bought a moto phone. We called each other when we got home.
The voice on my phone was not clear. The camera was good. My girlfriend said the sound of her phone was clear. I
wanted a phone with good voice quality. So I was satisfied and returned the phone to BestBuy yesterday.

Challenges
 Contrasts with standard text-based categorization
 Domain dependent
 Sarcasm
 Sometimes people express their negative feelings using positive or intensified positive words in the
text.
 Thwarted expressions
 The sentences/words that contradict the overall sentiment of the set are in majority.
Example: The actors are good, the music is brilliant and appealing.
Yet, the movie fails to strike a chord.
 Consolidation of Conflicting sentiments

Question 6.
 Dotted rule: We use a dot within the right hand side of a state’s grammar rule to indicate the progress made in
recognizing it.
Operation of Earley parser
 March through the N+1 sets of states in the chart in a left-to-right fashion.
 At each step, one of the three operators is applied to a single state as input and deriving new states from it.
 Predictor: S→.VP, [0, 0] => VP→.Verb, [0, 0] & VP→.Verb NP, [0, 0]
 Scanner: VP→.Verb NP, [0, 0] => VP→Verb.NP, [0,1]
 Completer: NP→Det Nominal., [1,3] & VP→Verb.NP, [0,1] => VP→Verb NP., [0,3]
 This results in the addition of new states to the end of either the current or next set of states in the chart.
 The presence of a state S→ α., [0,N] in the list of states in the last chart entry indicates successful parse.

Question 7.
Information Extraction tasks are characterized by two properties:
1. the desired knowledge can be described by a relatively simple and fixed template (frame) with slotsthat need to be
filled in with material from the text
2. only a small part of the information in the text is relevant for filling in this frame; the rest can be ignored.

Precision is a measure of how much of the information that the system returned is actually correct.
Precision: =# of correct answers given by system / # of answers given by system

Recall is a measure of how much relevant information the system has extracted from the text.
Recall: = # of correct answers given by system / total # of possible correct answers in the text

F-measure that balances recall and precision by using a parameter β.


F =(β2+1)P.R/ β2P+R
When β= 1, precision and recall are given equal weight.
When β> 1, precision is favored
Whenβ< 1, recall is favored.

Question 8.

1. Statistical NER techniques: Sequence models: HMMs, CMMs/MEMMs, CRFs


2. Hybrid Approach
3. Dictionary (Gazetteers) Look-up Approach

Following are the major challenges encountering in Indian Languages.


Agglutination
Ambiguity
Between Proper and common nouns
Between named entities
Lack of Capitalization

Question 9.
The classic search model
Inverted index
• We need variable-size postings lists
– On disk, a continuous run of postings is normal and best
– In memory, can use linked lists or variable length arrays
• Some tradeoffs in size/ease of insertion

Positional indexes
• In the postings, store, for each term the position(s) in which tokens of it appear:
<term, number of docs containing term;
doc1: position1, position2 … ;
doc2: position1, position2 … ;
etc.>
example
<be: 993427;
1: 7, 18, 33, 72, 86, 231;
2: 3, 149;
4: 17, 191, 291, 430, 434;
5: 363, 367, …>

Question 10.
 Extractive summaries are created by reusing portions (words, sentences, etc.) of the input text verbatim.
 For example, search engines typically generate extractive summaries from webpages.
 Most of the summarization research today is on extractive summarization.

 In abstractive summarization, information from the source text is re-phrased.


 Human beings generally write abstractive summaries (except when they do their assignments ).
 Abstractive summarization has not reached a mature stage because allied problems such as semantic
representation, inference and natural language generation are relatively harder.

Chatbots: pro and con


 Pro:
 Fun
 Good for narrow, scriptable applications
 Cons:
 They don't really understand
 Rule-based chatbots are expensive and brittle
 IR-based chatbots can only mirror training data
 The case of Microsoft Tay
 (or, Garbage-in, Garbage-out)
 The future: combining chatbots with frame-based agents

You might also like