Quest NLP
Quest NLP
2. Which of the following techniques can be used for the purpose of keyword normalization, the
process of converting a keyword into its meaningful base form?
A. Lemmatization B. Levenshtein distance C. Morphing D. Stemming
Answer: (a)
Lemmatization is the process of mapping an inflected or derived word to its base form (root
word). The base form is the meaningful stem.
Stemming is the process like lemmatization but need not end up in a meaningful word as the
base form.
A.
Choose the correct one. (1 mark each)
1. Which of the following areas where NLP can be useful?
A. Automatic text summarization B. Automatic question answering systems
C. Information retrieval D. All
2. Choose area where NLP cannot be useful.
A. Automatic Text Summarization B. Automatic question answering systems
C. Information retrieval D. X-Ray analysis
3. What is the field of NLP?
A. Building robot B. Economics C. Linguistics D. all
4. What is not the field of Natural Language Processing?
A. Computer Science B. AI C. Linguistics D. Economics
5. What is significance of caret ^ in regular expression?
A. If [ab^cd] means “a or b ^ c and d”.
B. If [^A-Z] means all uppercase nothing negated.
C. If caret is first symbol after the open square brace "[" then resulting pattern is negated.
D. If [^a-b] means all lowercase nothing negated.
6. What is a meaning of Morphology?
The study of word format B. The study of sentence format
C. The study of syntax of sentence D. The study of semantics of sentence.
7. N-grams are defined as the combination of N keywords together. How many bi-grams can be
generated from the given sentence: “Education is the most powerful weapon which you can
use to change the world”.
A. 14 B. 13 C. 12 D. 11
8. What is the number of Trigrams in a normalized sentence of length of N words?
A. N B. N-1 C. N-2 D. N-3
9. Which python library use to implement natural language processing?
A. NLTK B. Scrapy C. Matplotlib D. Pydot
10. Parts-of-Speech tagging determines ___________
A. part-of-speech for each symbol only generated dynamically as per meaning of the
sentence
B. part-of-speech for each word dynamically as per sentence structure
C. all stem for a specific word given as input
D. all lema for a specific word given as input
11. Which is one of supercategories of Parts of speech?
A. Sub class B. Open class C. Join class D. Empty class
12. Which of the following belongs to the open class group?
A. Verb B. Prepositions C. Determinants D. Conjunctions
13. Which is the type of morphology that changes the word category and affects the meaning?
A. Inflectional B. Derivational C. Cliticization D. Rational
14. Choose from the following where NLP is not being useful.
A. Automatic Text Summarization B. Automatic Q&A Systems
C. Partially Observable systems D. Information Retrieval
15. N-Gram language models cannot be used for -------.
A. Spelling Correction B. Predicting the completion of a sentence
C. Removing semantic ambiguity D. Speech Recognition
16. Which of the following is the type of 'walk', 'talk', 'print' ?
A. Regular verb B. Irregular verb C. Complex verb D. Normal verb
17. Which is used for the ratio of N-gram probability?
A. Frequency B. relative frequency C. cumulative frequency D. both A & C
18. In an HMM, observation likelihoods measure the likelihood of ________.
A. a POS tag given a word B. a POS tag given the preceding tag
C. a word given a POS tag D. a POS tag given two preceding tags
19. Which of the following will be POS Tagger output when the input sentence is "They refuse
to permit"
A. [('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB')]
B. [('They', 'NN'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB')]
C. [('They', 'PRP'), ('refuse', 'NN'), ('to', 'TO'), ('permit', 'VB')]
D. [('They', 'PRP'), ('refuse', 'VBP'), ('to', 'PRP'), ('permit', 'VB')]
20. Which algorithm is commonly used for text classification in NLP?
A. Decision Trees B. K-Means clustering C. Naïve Bayes D. SVM
Chapter (2) RE
Note:
----------
//
[] to specify a disjunction of characters
- to specify a range
[^ any single character except the character after^
? zero or one of the previous character
* zero or more
+ one or more
. wildcard, any single character except a carriage return
-----------
/s/ and /S/ are the same. True or False
How to write regular expression to specify any single digit?
- /[1234567890]/ (or) /[0-9]/
For price, /[0-9][0-9]*/ (or) /[0-9]+/
What is the regular expression
for the strings like aaaa or ababab or bbbb?
- /[ab]*/
for the strings ‘rain’, ‘ran’, ‘run’?
- /r.n/
What is the meaning of /[^a]/?
- Any single character except a.
FSA
Describe regular language.
- Regular language is a kind of formal language. Regular expressions, finite state automata
and regular grammar can be used to describe regular languages.
Discuss automaton and its components.
- Automaton is used for modeling the regular expression. It is also called finite automaton,
finite-state automaton or FSA. It can be represented as a directed graph: a finite set of
vertices (nodes) and set of directed links between pairs of vertices called arcs. It can also
be represented with a state-transition table. The components of automaton are:
1. A set of finite N state
2. Start state = state 0
3. Final state = accepting state represented by double circle.
4. Non-final state = reject = fail state = sink state
5. Transition between states
Describe the algorithms for recognizing a string using a state-transition table and briefly explain
it.
1. D-RECOGNIZE for deterministic recognizer
- A deterministic algorithm is one that has no choice points; the algorithm always knows
what to do for any input.
-
2. NFSA
What are the solutions of NFSA?
1. Backup – in a choice point, put a marker to mark where we were in the input and what
state the automaton was in. Another path can be tried if there is wrong choice.
2. Look-ahead – look ahead to decide which path to take
3. Parallelism – look at every alternative path in parallel
What is formal language?
- A formal language is a set of strings, each string composed of symbols from a finite
symbol-set called an alphabet. Eg./ Math formula, Chemical notations and programming
languages
- Formal languages are not the same as natural languages. Natural languages are the kind
of languages that real people speak.
- Formal language can be used to model part of a natural language.
- Generative grammar is used in linguistics to mean a grammar of a formal language.
Construct state transition table for the following by describing the type of FSA. (or)
Present finite-state automaton from the following state transition table describing the type of
FSA. (10 marks)
(a) NFSA
Answer:
State Input 0 Input 1
q0 q0, q1 q0, q2
q1 q3 null
q2 null q3
q3 null null
(b) D-FSA
Answer:
State Input: 0 Input: 1
q0 q1 q2
q1 q3 q2
q2 q1 q4
q3 q3 q2
q4 q1 q4
The order in which a NFSA chooses the next state to explore on the agenda defines its search
strategy. The depth-first search or LIFO strategy corresponds to the agenda-as-stack; the breadth-
first search or FIFO strategy corresponds to the agenda-as-queue.
Evaluate the ordering strategies of NFSA to explore the possible paths through a machine.
- Figure 2.20
- Figure 2.21
- The first one is an ordering strategy where the states that are considered next are the most
recently created ones. The agenda is implemented by a stack which is commonly referred
to as depth-first search or Last In First Out (LIFO) strategy. It has one major pitfall: under
certain circumstances they can enter an infinite loop.
- The second way to order the states in the search space is to consider states in the order in
which they are created. The agenda is implemented via a queue which is commonly
referred to as breadth-first search or First In First Out (FIFO) strategy. Its pitfall is the
search may never terminate if the state-space is infinite.
Chapter (3)
What does affixes mean? Which affixes are in the word “unbelievably”?
- affixes add “additional” meanings of various kinds in a word: prefixes, suffixes, infixes,
circumfixes.
- three affixes (un-,-able, and-ly)
Discuss the ways to combine morphemes to create words that are common and play important
roles in speech and language processing.
1. Inflection – combination of word stem with a grammatical morpheme resulting in a word
of the same class as the original stem and filling some syntactic function like agreement.
Eg./ adding morpheme -s for making plural on nouns and -ed for making past tense on
verbs.
2. Derivation – combination of a word stem with a grammatical morpheme, usually
resulting in a word of a different class, often with a meaning hard to predict exactly. Eg./
verb “computerize” noun “computerization” by adding -ation
3. Compounding – combination of multiple word stem together. Eg./ doghouse
4. Cliticization – combination of the word stem with clitic (short form). Eg./ -'ve for I’ve
Chapter (4) N-gram
Discuss N-gram model and the area of usage. (8 marks)
N-gram model is the idea of word prediction with probabilistic models, which predict the next
word from the previous N −1 words. Such statistical models of word sequences are also called
language models or LMs.
N-grams are used to identify words in noisy, ambiguous input like speech recognition and
handwritten recognition.
It is also essential in statistical machine translation, spelling correction and augmentative
communication systems that help the disabled.
In NLP tasks like part-of speech tagging, natural language generation, and word similarity, as
well as in applications from authorship identification and sentiment extraction to predictive text
input systems for cell phones, it is also important.
What is utterance? What kinds of disfluencies are there in the following sentence explaining
briefly each. (5 marks)
“I do uh main- mainly business data processing”
"So, I was, um, thinking about switching careers."
"We, uh, we need to finish the report by, like, tomorrow."
"I mean, I guess, uh, we could try a different approach?"
"I— I think we should, uh, wait before making a decision."
Utterance is the spoken correlate of a sentence. “uh” is called fillers or filled pauses which is
used to break the speaking for a while. “main-” is called a fragment which is used for broken-off
word.
Write out all the non-zero trigram probabilities from the following mini-corpus of three
sentences.
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
Answer:
P(am| <s>, I) = ½ = 0.5
P(Sam| I am) = ½ = 0.5
P(</s>| am Sam) = 1/1 = 1
P(I | <s> Sam) = 1/1 = 1
P(do| <s>, I> = 0.5
etc.
Given a set of unigram and bigram probabilities, what is the probability of the following
sequence ‘do Sam I like’ according to the bigram language model? P(do|) = 2/11, P(do|Sam) =
1/11, P(Sam|) = 4/11, P(Sam|do) = 1/8, P(I|Sam) = 4/11, P(Sam|I) = 2/9, P(I|do) = 2/8, P(I|like) =
2/7, P(like|I) = 3/11, P(do) = 3/8, P(Sam) = 2/11, P(I) = 4/11, P(like) = 5/11
Answer:
2/11 * 1/8 * 4/11 * 3/11
How is the given sentence represented using Bigram model? “I want to eat Indian food”
Answer: {(I, want), (want, to), (to, eat), (eat, Indian), (Indian, food)}