0% found this document useful (0 votes)
191 views8 pages

Chapter 6-NLP

Natural language processing (NLP) is a branch of artificial intelligence that helps computers understand human language through techniques like speech recognition, part-of-speech tagging, named entity recognition, and sentiment analysis. NLP models human language as probability distributions over words and meanings rather than definitive rules due to natural language ambiguities. N-gram models predict the next likely word based on the previous N-1 words by counting word sequences in a training corpus. Conditional probability is used to calculate the probability of a word based on previous words in the sequence.

Uploaded by

Seyo Kasaye
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
191 views8 pages

Chapter 6-NLP

Natural language processing (NLP) is a branch of artificial intelligence that helps computers understand human language through techniques like speech recognition, part-of-speech tagging, named entity recognition, and sentiment analysis. NLP models human language as probability distributions over words and meanings rather than definitive rules due to natural language ambiguities. N-gram models predict the next likely word based on the previous N-1 words by counting word sequences in a training corpus. Conditional probability is used to calculate the probability of a word based on previous words in the sequence.

Uploaded by

Seyo Kasaye
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Chapter 6- Natural Language Processing

6.1. Introduction
Human language is filled with ambiguities that make it incredibly difficult to write software that
accurately determines the intended meaning of text or voice data. Examples include:
 Homonyms – words with identical pronunciations but different spellings and meanings such as
to, too and two.
 Homophones – are two or more words that share the same pronunciation, but which have
different spellings or meanings such as hear and here.
 Sarcasm - is a remark that people use to say the opposite of what is true with the purpose to
amuse or hurt someone by making them feel foolish. When someone shows you a beautiful
picture from their vacation and you respond “what an awful place, I hope to go there
someday”. Your response is sarcastic.
 Idioms – is a phrase that when taken as a whole has meaning you wouldn’t be able to deduce
from the meanings of the individual words. Example: When we say, “We are in the same page”
we are not talking about book pages.
 Metaphors – is a figure of speech that describes an object or action in a way that isn’t literally
true, but helps explain an idea or make a comparison. Example: Love is a battlefield
 Grammar and usage exceptions, and variations in sentence structure
These are just a few of the irregularities of human language that take humans years to learn, but that
programmers must teach natural language-driven applications to recognize and understand accurately
from the start, if those applications are going to be useful.
Natural language processing (NLP) is a branch of artificial intelligence that helps computers
understand, interpret and manipulate human language. NLP draws from many disciplines, including
computer science and computational linguistics, in its pursuit to fill the gap between human
communication and computer understanding. NLP drives computer programs that translate text from
one language to another, respond to spoken commands, and summarize large volumes of text
rapidly—even in real time. It is applied in voice-operated GPS systems, digital assistants, speech-to-text
dictation software, customer service chatbots, and other consumer conveniences. It also plays a
growing role in enterprise solutions that help streamline business operations, increase employee
productivity, and simplify mission-critical business processes.

Natural Language Processing breaks down human text and voice data in ways that help the computer
make sense of what it's ingesting. Some of these tasks include the following:
 Speech recognition, also called speech-to-text, is the task of reliably converting voice data into
text data. Speech recognition is required for any application that follows voice commands or
answers spoken questions. What makes speech recognition especially challenging is the way
people talk—quickly, slurring words together, with varying emphasis and intonation, in
different accents, and often using incorrect grammar.

 Part of speech tagging, also called grammatical tagging is the process of determining the part
of speech of a particular word or piece of text based on its use and context. Part of speech
identifies ‘make’ as a verb in ‘I can make a paper plane,’ and as a noun in ‘what make of car do
you own?’

 Word sense disambiguation is the selection of the meaning of a word with multiple
meanings through a process of semantic analysis that determine the word that makes the most
sense in the given context. For example, word sense disambiguation helps distinguish the
meaning of the verb 'make' in ‘make the grade’ (achieve) vs. ‘make a bet’ (place).

 Named entity recognition, or NEM, identifies words or phrases as useful entities. NEM
identifies ‘Kentucky’ as a location or ‘Fred’ as a man's name.

 Co-reference resolution is the task of identifying if and when two words refer to the same
entity. The most common example is determining the person or object to which a certain
pronoun refers (e.g., ‘she’ = ‘Mary’), but it can also involve identifying a metaphor or an idiom
in the text (e.g., an instance in which 'bear' isn't an animal but a large hairy person).

 Sentiment analysis also known as opinion mining attempts to extract subjective qualities—
attitudes, emotions, sarcasm, confusion, suspicion—from text. This application is implemented
through a combination of NLP (Natural Language Processing) and statistics by assigning the
values to the text (positive, negative, or natural), identify the mood of the context (happy, sad,
angry, etc.)
 Natural language generation is sometimes described as the opposite of speech recognition or
speech-to-text; it's the task of putting structured information into human language.
 Machine Translation Machine translation is used to translate text or speech from one natural
language to another natural language. Example: Google Translator
 Spelling correction Microsoft Corporation provides word processor software like MS-word,
PowerPoint for the spelling correction.
 Chatbot - Implementing the Chatbot is one of the important applications of NLP. It is used by
many companies to provide the customer's chat services.

6.2. Language models


Formal languages, such as the programming languages Java or Python have precisely defined language
models. A language can be defined as a set of strings; “print(2 + 2)” is a legal program in the
language Python, whereas “2)+(2 print” is not. Since there are an infinite number of legal
programs, they cannot be enumerated; instead they are specified by a set of rules called a grammar.
Formal languages also have rules that define the meaning or semantics of a program; for example, the
rules say that the “meaning” of “2 + 2” is 4, and the meaning of “1/0” is that an error is signaled.

Natural languages, such as English or Spanish, cannot be characterized as a definitive set of sentences.
Everyone agrees that “Not to be invited is sad” is a sentence of English, but people disagree on the
grammaticality of “To be not invited is sad.” Therefore, it is more fruitful to define a natural language
model as a probability distribution over sentences rather than a definitive set. That is, rather than
asking if a string of words is or is not a member of the set defining the language, we instead ask for P(S
= words )—what is the probability that a random sentence would be words.

Natural languages are also ambiguous. “He saw her duck” can mean either that he saw a duck
belonging to her, or that he saw her move to evade something. Thus, again, we cannot speak of a
single meaning for a sentence, but rather of a probability distribution over possible meanings. Finally,
natural languages are difficult to deal with because they are very large, and constantly changing. Thus,
our language models are, at best, an approximation.

N-gram character models


Given a sequence of N-1 words, an N-gram model predicts the most probable word that might follow
this sequence. It's a probabilistic model that's trained on a corpus of text. In linguistics and NLP, corpus
(literally Latin for body) refers to a collection of texts. Such collections may be formed of a single
language of texts, or can span multiple languages.
An N-gram model is built by counting how often word sequences occur in corpus text and then
estimating the probabilities. Since a simple N-gram model has limitations, improvements are often
made via smoothing, interpolation and back-off. A sequence of written symbols of length n is called an
n-gram (from the Greek root for writing or letters), with special case “unigram” for 1-gram, “bigram”
for 2-gram, and “trigram” for 3-gram.

Consider two sentences: "There was heavy rain" vs. "There was heavy flood". From experience, we
know that the former sentence sounds better. An N-gram model will tell us that "heavy rain" occurs
much more often than "heavy flood" in the training corpus. Thus, the first sentence is more probable
and will be selected by the model.

A model that simply relies on how often a word occurs without looking at previous words is
called unigram. If a model considers only the previous word to predict the current word, then it's
called bigram. If two previous words are considered, then it's a trigram model.
Conditional probability:
This can be re-arranged as
When we have more variables

Example: What is the most probable next word predicted by the model for the following word
sequence? (Source: YouTube Varsha’s engineering by Dr. Varsha Patil.)
1. <s> Do ? Using Bi-gram
Word Frequency
<s> 7
</s> 7
I 6
am 2
Henry 5
like 5
Given corpus college 3
<s>I am henry </s> do 4

<s>I like college</s>


<s>Do henry like college </s>
<s>henry I am </s>
<s>Do I like henry </s>
<s>Do I like college </s>
<s>I do like henry </s>

Next word Probability of next word Therefore I is the most probable word that comes
after Do

wi is the next word and wi-1 is Do


P(</s>|do) 0/4
P(<I>|do) 2/4
P(am|do) 0/4
P(henry|do) 1/ 4
P(like|do) 1/4
P(college|do) 0/4
P(do|do) 0/4

2. <s>I like Henry? Using Bi-gram


Next word Probability of next word

P(</s>|Henry) 3/5 Therefore </s> is the most probable word that


P(<I>|Henry) 1/5
comes after Henry
P(am|Henry) 0/5
P(henry| Henry ) 0/5
P(like| Henry ) 1/5
P(college| Henry ) 0/5
P(do| Henry ) 0/5
3. <s> Do I like ? Using Tri-gram
Next word Probability of next word

Wi-2=I, Wi-1=like
P(</s>|I like) 0/3
P(<I>|I like) 0/3
P(am|I like) 0/3
P(henry| I like ) 1/3
P(like|I like ) 0/3
P(college| I like ) 2/3 Therefore college is the most probable word that
P(do| I like ) 0/3 comes after I like

4. <s>Do I like College …? Using Four-gram


Next word prediction probability
Wi-3=I, Wi-2=like, Wi-3=College
Next word Probability of next word

Therefore </s> is the most probable word


P(</s>|I like college) 2/2
that comes after I like college
P(<I>|I like college) 0/2
P(am|I like college) 0/2
P(henry| I like college) 0/2
P(like|I like college ) 0/2
P(college| I like college ) 0/2
P(do| I like college ) 0/2

5. Compare the probabilities of “I like college” and “Do I like henry” using Bi-gram
“I like college”
 P(I|<s>) x P(like|I) x P(college|like) x P(</s>|college)
 3/7 x 3/6 x 3/5 x 3/3
 0.428 x 0.5 x 0.6 + 1
 0.128
“Do I like henry”
 P(Do|<s>) x P(I|Do) x P(like|I) x P(henry|like) x P(</s>|henry)
 3/7 x 2/4 x 3/6 x 2/5 x 3/5
 0.428 x 0.5 x 0.5 x 0.4 x 0.6
 0.025
6. Find the probability of “like college”
 P(like|<s>) x P(college|like) x P(</s>|college)
 0/7 x 3/5 x 3/3
 0 (the result is be zero)
“like college” has more frequency but the sum total is zero. To avoid this problem, we can use sum of
logarithm of the probabilities.
 Log(0) + log(3/5) + log(3/3) = -0.221

Smoothing solves the zero-count problem but there are techniques to help us better estimate the
probabilities of unseen n-gram sequences. Suppose we want to get trigram probability of a certain
word sequence that never occurs. We can estimate this using the bigram probability. If the latter is also
not possible, we use unigram probability. This technique is called backoff. One such technique that's
popular is called Katz Backoff.
Interpolation is another technique in which we can estimate an n-gram probability based on a linear
combination of all lower-order probabilities. For instance, a 4-gram probability can be estimated using
a combination of trigram, bigram and unigram probabilities. The weights in which these are combined
can also be estimated by reserving some part of the corpus for this purpose.
While backoff considers each lower order one at a time, interpolation considers all the lower order
probabilities together.
We can also use Laplace (Add-one smoothing) by adding 1 to all the counts including non-zero
probabilities.

What are some limitations of N-gram models?


A model trained on the works of Shakespeare will not give good predictions when applied to another
genre. We need to therefore ensure that the training corpus looks similar to the test corpus. There's
also the problem of Out of Vocabulary (OOV) words. These are words that appear during testing but
not in training. One way to solve this is to start with a fixed vocabulary and convert OOV words in
training to UNK pseudo-word.
In one study, when applied to sentiment analysis, a bigram model outperformed a unigram model but
the number of features doubled. Thus, scaling N-gram models to larger datasets or moving to a higher
N needs good feature selection techniques.
N-gram models poorly capture longer-distance context. It's been shown that after 6-grams,
performance gains are limited. Other language models such cache LM, topic-based LM and latent
semantic indexing do better.

What software tools are available to do N-gram modelling?


 R has a few useful packages including ngram, tm, tau and RWeka. Package tidytext has
functions to do N-gram analysis.
 In Python, NTLK has the function nltk.utils.ngrams(). A more comprehensive package is nltk.lm.
Outside NLTK, the ngram package can compute n-gram string similarity.
Exercise -What is the most probable next word predicted by the model for the following word
sequences?
1. <s> Sam . . .
2. <s> Sam I do . . .
3. <s> Sam I am Sam . . .
4. <s> do I like . . .
Corpus data
<s> I am Sam </s>
<s> Sam I am </s>
<s> Sam I like </s>
<s> Sam I do like </s>
<s> do I like Sam </s>

You might also like