Chapter 6-NLP
Chapter 6-NLP
6.1. Introduction
Human language is filled with ambiguities that make it incredibly difficult to write software that
accurately determines the intended meaning of text or voice data. Examples include:
Homonyms – words with identical pronunciations but different spellings and meanings such as
to, too and two.
Homophones – are two or more words that share the same pronunciation, but which have
different spellings or meanings such as hear and here.
Sarcasm - is a remark that people use to say the opposite of what is true with the purpose to
amuse or hurt someone by making them feel foolish. When someone shows you a beautiful
picture from their vacation and you respond “what an awful place, I hope to go there
someday”. Your response is sarcastic.
Idioms – is a phrase that when taken as a whole has meaning you wouldn’t be able to deduce
from the meanings of the individual words. Example: When we say, “We are in the same page”
we are not talking about book pages.
Metaphors – is a figure of speech that describes an object or action in a way that isn’t literally
true, but helps explain an idea or make a comparison. Example: Love is a battlefield
Grammar and usage exceptions, and variations in sentence structure
These are just a few of the irregularities of human language that take humans years to learn, but that
programmers must teach natural language-driven applications to recognize and understand accurately
from the start, if those applications are going to be useful.
Natural language processing (NLP) is a branch of artificial intelligence that helps computers
understand, interpret and manipulate human language. NLP draws from many disciplines, including
computer science and computational linguistics, in its pursuit to fill the gap between human
communication and computer understanding. NLP drives computer programs that translate text from
one language to another, respond to spoken commands, and summarize large volumes of text
rapidly—even in real time. It is applied in voice-operated GPS systems, digital assistants, speech-to-text
dictation software, customer service chatbots, and other consumer conveniences. It also plays a
growing role in enterprise solutions that help streamline business operations, increase employee
productivity, and simplify mission-critical business processes.
Natural Language Processing breaks down human text and voice data in ways that help the computer
make sense of what it's ingesting. Some of these tasks include the following:
Speech recognition, also called speech-to-text, is the task of reliably converting voice data into
text data. Speech recognition is required for any application that follows voice commands or
answers spoken questions. What makes speech recognition especially challenging is the way
people talk—quickly, slurring words together, with varying emphasis and intonation, in
different accents, and often using incorrect grammar.
Part of speech tagging, also called grammatical tagging is the process of determining the part
of speech of a particular word or piece of text based on its use and context. Part of speech
identifies ‘make’ as a verb in ‘I can make a paper plane,’ and as a noun in ‘what make of car do
you own?’
Word sense disambiguation is the selection of the meaning of a word with multiple
meanings through a process of semantic analysis that determine the word that makes the most
sense in the given context. For example, word sense disambiguation helps distinguish the
meaning of the verb 'make' in ‘make the grade’ (achieve) vs. ‘make a bet’ (place).
Named entity recognition, or NEM, identifies words or phrases as useful entities. NEM
identifies ‘Kentucky’ as a location or ‘Fred’ as a man's name.
Co-reference resolution is the task of identifying if and when two words refer to the same
entity. The most common example is determining the person or object to which a certain
pronoun refers (e.g., ‘she’ = ‘Mary’), but it can also involve identifying a metaphor or an idiom
in the text (e.g., an instance in which 'bear' isn't an animal but a large hairy person).
Sentiment analysis also known as opinion mining attempts to extract subjective qualities—
attitudes, emotions, sarcasm, confusion, suspicion—from text. This application is implemented
through a combination of NLP (Natural Language Processing) and statistics by assigning the
values to the text (positive, negative, or natural), identify the mood of the context (happy, sad,
angry, etc.)
Natural language generation is sometimes described as the opposite of speech recognition or
speech-to-text; it's the task of putting structured information into human language.
Machine Translation Machine translation is used to translate text or speech from one natural
language to another natural language. Example: Google Translator
Spelling correction Microsoft Corporation provides word processor software like MS-word,
PowerPoint for the spelling correction.
Chatbot - Implementing the Chatbot is one of the important applications of NLP. It is used by
many companies to provide the customer's chat services.
Natural languages, such as English or Spanish, cannot be characterized as a definitive set of sentences.
Everyone agrees that “Not to be invited is sad” is a sentence of English, but people disagree on the
grammaticality of “To be not invited is sad.” Therefore, it is more fruitful to define a natural language
model as a probability distribution over sentences rather than a definitive set. That is, rather than
asking if a string of words is or is not a member of the set defining the language, we instead ask for P(S
= words )—what is the probability that a random sentence would be words.
Natural languages are also ambiguous. “He saw her duck” can mean either that he saw a duck
belonging to her, or that he saw her move to evade something. Thus, again, we cannot speak of a
single meaning for a sentence, but rather of a probability distribution over possible meanings. Finally,
natural languages are difficult to deal with because they are very large, and constantly changing. Thus,
our language models are, at best, an approximation.
Consider two sentences: "There was heavy rain" vs. "There was heavy flood". From experience, we
know that the former sentence sounds better. An N-gram model will tell us that "heavy rain" occurs
much more often than "heavy flood" in the training corpus. Thus, the first sentence is more probable
and will be selected by the model.
A model that simply relies on how often a word occurs without looking at previous words is
called unigram. If a model considers only the previous word to predict the current word, then it's
called bigram. If two previous words are considered, then it's a trigram model.
Conditional probability:
This can be re-arranged as
When we have more variables
Example: What is the most probable next word predicted by the model for the following word
sequence? (Source: YouTube Varsha’s engineering by Dr. Varsha Patil.)
1. <s> Do ? Using Bi-gram
Word Frequency
<s> 7
</s> 7
I 6
am 2
Henry 5
like 5
Given corpus college 3
<s>I am henry </s> do 4
Next word Probability of next word Therefore I is the most probable word that comes
after Do
Wi-2=I, Wi-1=like
P(</s>|I like) 0/3
P(<I>|I like) 0/3
P(am|I like) 0/3
P(henry| I like ) 1/3
P(like|I like ) 0/3
P(college| I like ) 2/3 Therefore college is the most probable word that
P(do| I like ) 0/3 comes after I like
5. Compare the probabilities of “I like college” and “Do I like henry” using Bi-gram
“I like college”
P(I|<s>) x P(like|I) x P(college|like) x P(</s>|college)
3/7 x 3/6 x 3/5 x 3/3
0.428 x 0.5 x 0.6 + 1
0.128
“Do I like henry”
P(Do|<s>) x P(I|Do) x P(like|I) x P(henry|like) x P(</s>|henry)
3/7 x 2/4 x 3/6 x 2/5 x 3/5
0.428 x 0.5 x 0.5 x 0.4 x 0.6
0.025
6. Find the probability of “like college”
P(like|<s>) x P(college|like) x P(</s>|college)
0/7 x 3/5 x 3/3
0 (the result is be zero)
“like college” has more frequency but the sum total is zero. To avoid this problem, we can use sum of
logarithm of the probabilities.
Log(0) + log(3/5) + log(3/3) = -0.221
Smoothing solves the zero-count problem but there are techniques to help us better estimate the
probabilities of unseen n-gram sequences. Suppose we want to get trigram probability of a certain
word sequence that never occurs. We can estimate this using the bigram probability. If the latter is also
not possible, we use unigram probability. This technique is called backoff. One such technique that's
popular is called Katz Backoff.
Interpolation is another technique in which we can estimate an n-gram probability based on a linear
combination of all lower-order probabilities. For instance, a 4-gram probability can be estimated using
a combination of trigram, bigram and unigram probabilities. The weights in which these are combined
can also be estimated by reserving some part of the corpus for this purpose.
While backoff considers each lower order one at a time, interpolation considers all the lower order
probabilities together.
We can also use Laplace (Add-one smoothing) by adding 1 to all the counts including non-zero
probabilities.