End Sem Answer Key 2023
End Sem Answer Key 2023
Initialization
D(i,0) = i
D(0,j) = j
Recurrence Relation:
For each i = 1…M
For each j = 1…N
D(i-1,j) + 1
D(i,j)= min D(i,j-1) + 1
D(i-1,j-1) + 2; if X(i) ≠ Y(j)
0; if X(i) = Y(j)
Termination:
D(N,M) is distance
Lexical variation: a difference in what segments are used to represent the word in the lexicon
Due to the influence of the surrounding sounds, syllable structure, etc.
Because can be pronounced either as [b iy k ah z] [b iy k ah zh] [b iy k ah s] [b iy k aa z]
Allophonic variation: a difference in how the individual segments change their value in different contexts.
Because can be pronounced either as monosyllabic cause or bisyllabic because
Question 2.
Machine Translation:
P(high winds tonite) > P(large winds tonite)
Spell Correction
The office is about fifteen minuets from my house
P(about fifteen minutes from) > P(about fifteen minuets from)
Speech Recognition
P(I saw a van) > P(eyes awe of an)
Types: means the number of distinct words in a corpus, i.e. the size of the vocabulary V.
Tokens: means the total number of running words. N
“They picnicked by the pool, then lay back on the grass and looked at the stars.” 16 word tokens and
14 word types
Switchboard corpus has 2.4 million wordform tokens and approximately 20,000 wordform types.
Brown corpus contains 61,805 wordform types.
Brown et al.(1992): 583 million wordform tokens that included 293,181 different wordform types.
Question 3.
• Entropy: expected surprise (over p):
é 1 ù
H( p) = E p ê log 2 ú = -å px log 2 px
ë p x û x
• – x log x is convex
• – å x log x is convex (sum of convex functions is convex).
Issues of Scale
• Lots of features:
• NLP maxent models can have well over a million features.
• Even storing a single array of parameter values can have a substantial memory cost.
• Lots of sparsity:
• Many features seen in training will never occur again at test time.
• Overfitting very easy
• Optimization problems:
• Feature weights can be infinite, and iterative solvers can take a long time to get to those infinities.
Question 4.
It can tell us how the word is pronounced.
noun is CONtent and the adjective is conTENT
OBject(noun) and obJECT(verb),
DIScount(noun) and disCOUNT(verb)
Knowing a word’s part of speech can help tell us which morphological affixes it can take.
It gives a significant amount of information about the word and its neighbors.
Knowing whether a word is a possessive pronoun or a personal pronoun can tell us what words are
likely to occur in its vicinity.
useful in language model for speech recognition.
B. Parts of speech can be divided into two broad super categories:
Open Class: Growing continuously
4 Major open classes: nouns, main verbs, adjectives, and adverbs
Closed class: It has relatively fixed membership.
Example of English closed classes: prepositions, determiners, pronouns, conjunctions, auxiliary
verbs, particles are closed classes.
Function word: tend to be very short, occur frequently, and play an important role in grammar.
Example: of, it, and, you, etc.
Question 5.
Yesterday, I bought a Nokia phone and my girlfriend bought a moto phone. We called each other when we got home.
The voice on my phone was not clear. The camera was good. My girlfriend said the sound of her phone was clear. I
wanted a phone with good voice quality. So I was satisfied and returned the phone to BestBuy yesterday.
Challenges
Contrasts with standard text-based categorization
Domain dependent
Sarcasm
Sometimes people express their negative feelings using positive or intensified positive words in the
text.
Thwarted expressions
The sentences/words that contradict the overall sentiment of the set are in majority.
Example: The actors are good, the music is brilliant and appealing.
Yet, the movie fails to strike a chord.
Consolidation of Conflicting sentiments
Question 6.
Dotted rule: We use a dot within the right hand side of a state’s grammar rule to indicate the progress made in
recognizing it.
Operation of Earley parser
March through the N+1 sets of states in the chart in a left-to-right fashion.
At each step, one of the three operators is applied to a single state as input and deriving new states from it.
Predictor: S→.VP, [0, 0] => VP→.Verb, [0, 0] & VP→.Verb NP, [0, 0]
Scanner: VP→.Verb NP, [0, 0] => VP→Verb.NP, [0,1]
Completer: NP→Det Nominal., [1,3] & VP→Verb.NP, [0,1] => VP→Verb NP., [0,3]
This results in the addition of new states to the end of either the current or next set of states in the chart.
The presence of a state S→ α., [0,N] in the list of states in the last chart entry indicates successful parse.
Question 7.
Information Extraction tasks are characterized by two properties:
1. the desired knowledge can be described by a relatively simple and fixed template (frame) with slotsthat need to be
filled in with material from the text
2. only a small part of the information in the text is relevant for filling in this frame; the rest can be ignored.
Precision is a measure of how much of the information that the system returned is actually correct.
Precision: =# of correct answers given by system / # of answers given by system
Recall is a measure of how much relevant information the system has extracted from the text.
Recall: = # of correct answers given by system / total # of possible correct answers in the text
Question 8.
Question 9.
The classic search model
Inverted index
• We need variable-size postings lists
– On disk, a continuous run of postings is normal and best
– In memory, can use linked lists or variable length arrays
• Some tradeoffs in size/ease of insertion
Positional indexes
• In the postings, store, for each term the position(s) in which tokens of it appear:
<term, number of docs containing term;
doc1: position1, position2 … ;
doc2: position1, position2 … ;
etc.>
example
<be: 993427;
1: 7, 18, 33, 72, 86, 231;
2: 3, 149;
4: 17, 191, 291, 430, 434;
5: 363, 367, …>
Question 10.
Extractive summaries are created by reusing portions (words, sentences, etc.) of the input text verbatim.
For example, search engines typically generate extractive summaries from webpages.
Most of the summarization research today is on extractive summarization.