NLP Notes
NLP Notes
1. Sentiment analysis
Sentiment analysis, also referred to as opinion mining, is an approach to natural
language processing (NLP) that identifies the emotional tone behind a body of text.
This is a popular way for organizations to determine and categorize opinions about a
product, service or idea.
Sentiment analysis systems help organizations gather insights into real-time customer
sentiment, customer experience and brand reputation.
Generally, these tools use text analytics to analyze online sources such as emails, blog
posts, online reviews, news articles, survey responses, case studies, web chats, tweets,
forums and comments.
Sentiment analysis uses machine learning models to perform text analysis of human
language. The metrics used are designed to detect whether the overall sentiment of a
piece of text is positive, negative or neutral.
2. Machine Translation
Machine translation, sometimes referred to by the abbreviation MT, is a sub-field
of computational linguistics that investigates the use of software to translate text or
speech from one language to another.
On a basic level, MT performs mechanical substitution of words in one language for
words in another, but that alone rarely produces a good translation because
recognition of whole phrases and their closest counterparts in the target language is
needed.
Not all words in one language have equivalent words in another language, and many
words have more than one meaning.
2
Solving this problem with corpus statistical and neural techniques is a rapidly growing
field that is leading to better translations, handling differences in linguistic typology,
translation of idioms, and the isolation of anomalies.
Corpus: A collection of written texts, especially the entire works of a particular
author.
3. Text Extraction
There are a number of natural language processing techniques that can be
used to extract information from text or unstructured data.
These techniques can be used to extract information such as entity names,
locations, quantities, and more.
With the help of natural language processing, computers can make sense
of the vast amount of unstructured text data that is generated every day,
and humans can reap the benefits of having this information readily
available.
Industries such as healthcare, finance, and e-commerce are already using
natural language processing techniques to extract information and
improve business processes.
As the machine learning technology continues to develop, we will only
see more and more information extraction use cases covered.
4. Text Classification
5. Speech Recognition
Speech recognition is an interdisciplinary subfield of computer
science and computational linguistics that develops methodologies and technologies
that enable the recognition and translation of spoken language into text by computers.
It is also known as automatic speech recognition (ASR), computer speech
recognition or speech to text (STT).
It incorporates knowledge and research in the computer
science, linguistics and computer engineering fields. The reverse process is speech
synthesis.
3
You’ll come across chatbots on business websites or messengers that give pre-scripted
replies to your questions. As the entire process is automated, bots can provide quick
assistance 24/7 without human intervention.
7. Email Filter
One of the most fundamental and essential applications of NLP online is email
filtering. It began with spam filters, which identified specific words or phrases that
indicate a spam message. But, like early NLP adaptations, filtering has been
improved.
Gmail's email categorization is one of the more common, newer implementations of
NLP. Based on the contents of emails, the algorithm determines whether they belong
in one of three categories (main, social, or promotional).
This maintains your inbox manageable for all Gmail users, with critical, relevant
emails you want to see and reply to fast.
8. Search Autocorrect and Autocomplete
When you type 2-3 letters into Google to search for anything, it displays a list of
probable search keywords. Alternatively, if you search for anything with mistakes, it
corrects them for you while still returning relevant results. Isn't it incredible?
4
Everyone uses Google search autocorrect autocomplete on a regular basis but seldom
gives it any thought. It's a fantastic illustration of how natural language processing is
touching millions of people across the world, including you and me.
Both, search autocomplete and autocorrect make it much easier to locate accurate
results.
3. Components of NLP
There are two components of NLP, Natural Language Understanding (NLU)and
Natural Language Generation (NLG).
Natural Language Understanding (NLU) which involves transforming
humanlanguage into a machine-readable format.It helps the machine to understand
and analyze human language by extracting the text from large data such as keywords,
emotions, relations, and semantics.
Natural Language Generation (NLG) acts as a translator that converts
thecomputerized data into natural language representation.
It mainly involves Text planning, Sentence planning, and Text realization.
The NLU is harder than NLG.
4. Steps in NLP
There are general five steps :
1. Lexical Analysis
2. Syntactic Analysis (Parsing)
3. Semantic Analysis
4. Discourse Integration
5. Pragmatic Analysis
Lexical Analysis:
The first phase of NLP is the Lexical Analysis.
This phase scans the source code as a stream of characters and converts it into
meaningful lexemes.
It divides the whole text into paragraphs, sentences, and words.
Lexeme: A lexeme is a basic unit of meaning. In linguistics, the abstract unit of
morphological analysis that corresponds to a set of forms taken by a single word is
called lexeme.
The way in which a lexeme is used in a sentence is determined by its grammatical
category.
5
Semantic Analysis
Semantic analysis is concerned with the meaning representation.
It mainly focuses on the literal meaning of words, phrases, and sentences.
The semantic analyzer disregards sentence such as “hot ice-cream”.
Another Example is “Manhattan calls out to Dave” passes a syntactic analysis because it’s
a grammatically correct sentence. However, it fails a semantic analysis. Because
Manhattan is a place (and can’t literally call out to people), the sentence’s meaning doesn’t
make sense.
Discourse Integration
Discourse Integration depends upon the sentences that precedes it and also
invokesthe meaning of the sentences that follow it.
For instance, if one sentence reads, “Manhattan speaks to all its people,” and the
following sentence reads, “It calls out to Dave,” discourse integration checks the first
sentence for context to understand that “It” in the latter sentence refers to Manhattan.
Pragmatic Analysis
During this, what was said is re-interpreted on what it actually meant.
It involves deriving those aspects of language which require real world knowledge.
For instance, a pragmatic analysis can uncover the intended meaning of “Manhattan
speaks to all its people.” Methods like neural networks assess the context to
understand that the sentence isn’t literal, and most people won’t interpret it as such. A
pragmatic analysis deduces that this sentence is a metaphor for how people
emotionally connect with place.
Tokens:
Suppose, for a moment, that words in English are delimited only by
whitespace and punctuation (the marks, such as full stop, comma, and
brackets)
Example: Will you read the newspaper? Will you read it? I won’t
read it. If we confront our assumption with insights from syntax,
we notice twowords here: words newspaper and won’t.
7
Lexemes
By the term word, we often denote not just the one linguistic form in the given
context but also the concept behind the form and the set of alternative forms that can
express it.
Such sets are called lexemes or lexical items, and they constitute the lexicon of a
language.
Lexemes can be divided by their behaviour into the lexical categories of verbs, nouns,
adjectives, conjunctions or other parts of speech.
The citation form of a lexeme, by which it is commonly identified, is also called its
lemma.
When we convert a word into its other forms, such as turning the singular mouse into
the plural mice or mouses, we say we inflect the lexeme.
When we transform a lexeme into another one that is morphologically related,
regardless of its lexical category, we say we derive the lexeme: for instance, the nouns
receiver and reception are derived from the verb receive.
Example: Did you see him? I didn’t see him. I didn’t see anyone
Example presents the problem of tokenization of didn’t and the investigation of the
internal structure of anyone.
The difficulty with the definition of what counts as a word need not pose a problem
for the syntactic description if we understand no one as two closely connected tokens
treated as one fixed element.
Morphemes
These components are usually called segments or morphs.
Morphology
Morphology is the domain of linguistics that analyses the internal structure of words.
Morphological analysis – exploring the structure of words
Words are built up of minimal meaningful elements called morphemes:
played = play-ed
cats = cat-s
unfriendly = un-friend-ly
8
Morphological Typology
Morphological typology is a way of classifying the languages of the world that groups
languages according to their common morphological structures.
The field organizes languages on the basis of how those languages form words by
combining morphemes.
The morphological typology classifies languages into two broad classes like synthetic
languages and analytical languages.
The synthetic class is then further sub classified as either agglutinative languages or
fusional languages.
Analytic languages contain very little inflection, instead relying on features like word
order and auxiliary words to convey meaning.
Synthetic languages, ones that are not analytic, are divided into two categories:
agglutinative and fusional languages.
Agglutinative languages rely primarily on discrete particles(prefixes, suffixes, and
infixes) for inflection, ex: inter+national = international, international+ize =
internationalize.
While fusional languages "fuse" inflectional categories together, often allowing one
word ending to contain several categories, such that the original root can be difficult
to extract (anybody, newspaper).
• NLTK, or Natural Language Toolkit, is a Python package that you can use for NLP.
• A lot of the data that you could be analyzing is unstructured data and contains human-
readable text.
• Before you can analyze that data programmatically, you first need to preprocess it.
• Now we are going to see kinds of text preprocessing tasks you can do with NLTK so
that you’ll be ready to apply them in future projects.
10
1. Tokenizing
By tokenizing, you can conveniently split up text by word or by sentence.
This will allow you to work with smaller pieces of text that are still relatively
coherent and meaningful even outside of the context of the rest of the text.
It’s your first step in turning unstructured data into structured data, which is easier
to analyze.
When you’re analyzing text, you’ll be tokenizing by word and tokenizing by
sentence.
Tokenizing by word
• Words are like the atoms of natural language. They’re the smallest unit of meaning
that still makes sense on its own.
• Tokenizing your text by word allows you to identify words that come up particularly
often.
• For example, if you were analyzing a group of job ads, then you might find that the
word “Python” comes up often.
• That could suggest high demand for Python knowledge, but you’d need to look deeper
to know more.
Tokenizing by sentence
• When you tokenize by sentence, you can analyze how those words relate to one
another and see more context.
• Are there a lot of negative words around the word “Python” because the hiring
manager doesn’t like Python?
• Are there more terms from the domain of herpetology than the domain of software
development, suggesting that you may be dealing with an entirely different kind
of python than you were expecting?
'And the first lesson of all was the basic trust that he could learn.’,
"It's shocking to find how many people do not believe they can learn,\n and how
many more believe learning to be difficult."]
Note:
import nltk
nltk.download('punkt')
3. Stemming
Stemming is a text processing task in which you reduce words to their root, which
is the core part of a word.
For example, the words “helping” and “helper” share the root “help.”
Stemming allows you to zero in on the basic meaning of a word rather than all the
details of how it’s being used.
NLTK has more than one stemmer, but we’ll be using the Porter stemmer.
Python program for Stemming
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
stemmer = PorterStemmer()
string_for_stemming = "The crew of the USS Discovery discovered many
discoveries. Discovering is what explorers do."
words = word_tokenize(string_for_stemming)
print(words)
stemmed_words = [stemmer.stem(word) for word in words]
print(stemmed_words)
Output
13
• ['The', 'crew', 'of', 'the', 'USS', 'Discovery', 'discovered', 'many', 'discoveries', '.',
'Discovering', 'is', 'what', 'explorers', 'do', '.’]
• ['the', 'crew', 'of', 'the', 'uss', 'discoveri', 'discov', 'mani', 'discoveri', '.', 'discov', 'is',
'what', 'explor', 'do', '.’]
'Discovery' 'discoveri'
'discovered' 'discov'
'discoveries' 'discoveri'
'Discovering' 'discov'
• Some sources also include the category articles (like “a” or “the”) in the list of parts
of speech, but other sources consider them to be adjectives. NLTK uses the
word determiner to refer to articles.
5. Lemmatizing
• Like stemming, lemmatizing reduces words to their core meaning, but it will give
you a complete English word that makes sense on its own instead of just a fragment of
a word like 'discoveri'.
• A lemma is a word that represents a whole group of words, and that group of words is
called a lexeme.
• For example, if you were to look up the word “blending” in a dictionary, then you’d
need to look at the entry for “blend,” but you would find “blending” listed in that
entry.
• In this example, “blend” is the lemma, and “blending” is part of the lexeme. So when
you lemmatize a word, you are reducing it to its lemma.
15
6. Chunking
chunking allows you to identify phrases.
A phrase is a word or group of words that works as a single unit to perform a
grammatical function. Noun phrases are built around a noun.
Here are some examples:
“A planet”
“A tilting planet”
“A swiftly tilting planet”
Chunking makes use of POS tags to group words and apply chunk tags to those
groups. Chunks don’t overlap, so one instance of a word can be in only one chunk
at a time.
After getting a list of tuples of all the words in the quote, along with their POS
tag. In order to chunk, you first need to define a chunk grammar.
Output:
• ['It', "'s", 'a', 'dangerous', 'business', ',', 'Frodo', ',', 'going', 'out', 'your', 'door', '.']
• [('It', 'PRP'), ("'s", 'VBZ'), ('a', 'DT'), ('dangerous', 'JJ'), ('business', 'NN'), (',', ','),
('Frodo', 'NNP'), (',', ','), ('going', 'VBG'), ('out', 'RP'), ('your', 'PRP$'), ('door', 'NN'),
('.', '.')]
• (S
• It/PRP
• 's/VBZ
• (NP a/DT dangerous/JJ business/NN)
• ,/, Frodo/NNP
• ,/, going/VBG
• out/RP
• your/PRP$
• (NP door/NN)
• ./.)
17
Tree Representation
7. Chinking
• Chinking is used together with chunking, but while chunking is used to include a
pattern, chinking is used to exclude a pattern.
Python program to perform chinking
import nltk
nltk.download('puckt')
from nltk.tokenize import word_tokenize
quote = "It's a dangerous business, Frodo, going out your door."
words_quote = word_tokenize(quote)
print(words_quote)
nltk.download("averaged_perceptron_tagger")
tags = nltk.pos_tag(words_quote)
print(tags)
#Regular expression
grammar = """
Chunk: {<.*>+}
}<JJ>{""“
chunk_parser = nltk.RegexpParser(grammar)
tree = chunk_parser.parse(tags)
print(tree)
Output:
• ['It', "'s", 'a', 'dangerous', 'business', ',', 'Frodo', ',', 'going', 'out', 'your', 'door', '.']
• [('It', 'PRP'), ("'s", 'VBZ'), ('a', 'DT'), ('dangerous', 'JJ'), ('business', 'NN'), (',', ','),
('Frodo', 'NNP'), (',', ','), ('going', 'VBG'), ('out', 'RP'), ('your', 'PRP$'), ('door', 'NN'),
('.', '.')]
18
• (S
• (Chunk It/PRP 's/VBZ a/DT)
• dangerous/JJ
• (Chunk business/NN ,/, Frodo/NNP ,/, going/VBG out/RP your/PRP$ door/NN ./.))
Tree Representation
Output
['It', "'s", 'a', 'dangerous', 'business', ',', 'Frodo', ',', 'going', 'out', 'your', 'door', '.']
19
(S
It/PRP
's/VBZ
a/DT
dangerous/JJ
business/NN
,/,
(PERSON Frodo/NNP)
,/,
going/VBG
out/RP
your/PRP$
door/NN
./.)
Note: If we use this code it simply specifies that it is a Named Entity with out
giving the specification.
The parsing in NLP is the process of determining the syntactic structure of a text by analysing
its constituent words based on an underlying grammar.
Example Grammar:
Then, the outcome of the parsing process would be a parse tree, where sentence is the root,
intermediate nodes such as noun_phrase, verb_phrase etc. have children - hence they are
called non-terminals and finally, the leaves of the tree ‘Tom’, ‘ate’, ‘an’, ‘apple’ are called
terminals.
Parse Tree:
• A sentence is parsed by relating each word to other words in the sentence which depend
on it.
• The syntactic parsing of a sentence consists of finding the correct syntactic structure of
that sentence in the given formalism/grammar.
• Dependency grammar (DG) and phrase structure grammar (PSG) are two such
formalisms.
• PSG breaks sentence into constituents (phrases), which are then broken into smaller
constituents.
• Describe phrase, clause structure Example: NP, PP, VP etc.,
• DG: syntactic structure consists of lexical items, linked by binary asymmetric relations
called dependencies.
• Interested in grammatical relations between individual words.
• Does propose a recursive structure rather a network of relations
• These relations can also have labels.
2
• A treebank can be defined as a linguistically annotated corpus that includes some kind
of syntactic analysis over and above part-of-speech tagging.
Constituency tree vs Dependency tree
• Dependency structures explicitly represent
- Head-dependent relations (directed arcs)
- Functional categories (arc labels)
- Possibly some structural categories (POS)
• Phrase structure explicitly represent
- Phrases (non-terminal nodes)
- Structural categories (non-terminal labels)
- Possible some functional categories (grammatical functions)
Syntax:
• In NLP, the syntactic analysis of natural language input can vary from being very low-
level, such as simply tagging each word in the sentence with a part of speech (POS), or
very high level, such as full parsing.
• In syntactic parsing, ambiguity is a particularly difficult problem because the most
possible analysis has to be chosen from an exponentially large number of alternative
analyses.
• From tagging to full parsing, algorithms that can handle such ambiguity have to be
carefully chosen.
• Here we explore the syntactic analysis methods from tagging to full parsing and the use
of supervised machine learning to deal with ambiguity.
2.1 Parsing Natural Language
• In a text-to-speech application, input sentences are to be converted to a spoken output
that should sound like it was spoken by a native speaker of the language.
• Example: He wanted to go a drive in the country.
• There is a natural pause between the words derive and In in sentence that reflects an
underlying hidden structure to the sentence.
• Parsing can provide a structural description that identifies such a break in the intonation.
• A simpler case: The cat who lives dangerously had nine lives.
• In this case, a text-to-speech system needs to know that the first instance of the word
lives is a verb and the second instance is a noun before it can begin to produce the
natural intonation for this sentence.
• This is an instance of the part-of-speech (POS) tagging problem where each word in
the sentence is assigned a most likely part of speech.
• Another motivation for parsing comes from the natural language task of summarization,
in which several documents about the same topic should be condensed down to a small
digest of information.
• Such a summary may be in response to a question that is answered in the set of
documents.
3
• In this case, a useful subtask is to compress an individual sentence so that only the
relevant portions of a sentence is included in the summary.
• For example: Beyond the basic level, the operations of the three products vary widely.
The operations of the products vary.
• The elegant way to approach this task is to first parse the sentence to find the various
constituents: where we recursively partition the words in the sentence into individual
phrases such as a verb phrase or a noun phrase.
➢ Dependence analysis is typically favoured for languages such as Czech and Turkish,
that have free word order.
➢ Phrase structure analysis is often used to provide additional information about long-
distance dependencies and mostly languages like English and French.
➢ NLP: is the capability of the computer software to understand the natural language.
➢ There are variety of languages in the world.
➢ Each language has its own structure (SVO or SOV)->called grammar ->has certain set
of rules->determines: what is allowed, what is not allowed.
➢ English: S O V Other languages: S V O or O S V
I eat mango
➢ Grammar is defined as the rules for forming well-structured sentences.
➢ belongs to VN
➢ Different Types of Grammar in NLP
1. Context-Free Grammar (CFG)
2.Constituency Grammar (CG) or Phrase structure grammar
3.Dependency Grammar (DG)
• The dependency tree analyses, where each word depends on exactly one parent,
either another word or a dummy root symbol.
• By convention, in dependency tree 0 index is used to indicate the root symbol
and the directed arcs are drawn from the head word to the dependent word.
• In the Fig shows a dependency tree for Czech sentence taken from the Prague
dependency treebank.
▪ Each node in the graph is a word, its part of speech and the position of the word in the
sentence. • For example [fakulte, N3,7] is the seventh word in the sentence with POS
tag N3.
▪ The node [#, ZSB,0] is the root node of the dependency tree.
7
▪ There are many variations of dependency syntactic analysis, but the basic textual format
for a dependency tree can be written in the following form.
▪ Where each dependent word specifies the head Word in the sentence, and exactly one
word is dependent to the root of the sentence.
• NNP: proper noun, singular VBZ: verb, third person singular present ADJP: adjective
phrase RB: adverb JJ: adjective
• The same sentence gets the following dependency tree analysis: some of the
information from the bracketing labels from the phrase structure analysis gets mapped
onto the labelled arcs of the dependency analysis.
• To explain some details of phrase structure analysis in treebank, which was a project
that annotated 40,000 sentences from the wall street journal with phrase structure tree.
➢ The SBARQ label marks what questions ie those that contain a gap and therefore
require a trace.
➢ Wh- moved noun phrases are labelled WHNP and put inside SBARQ. They bear an
identity index that matches the reference index on the *T* in the position of the gap.
➢ However, questions that are missing both subject and auxiliary are label SQ
➢ NP-SBJ noun phrases cab be subjects.
➢ *T* traces for wh- movement and this empty trace has an index (here it is 1) and
associated with the WHNP constituent with the same index.
Parsing Algorithms
• Given an input sentence, a parser produces an output analysis of that sentence.
• Treebank parsers do not need to have an explicit grammar, but to discuss the parsing
algorithms simpler, we use CFG.
• The simple CFG G that can be used to derive string such as a and b or c from the start
symbol N.
Here we discuss on modelling aspects of parsing: how to design features and ways to resolve
ambiguity in parsing.
Probabilistic context-free grammar
• Ex: John bought a shirt with pockets
15
• Here we want to provide a model that matches the intuition that the second tree above
is preferred over the first.
• The parses can be thought of as ambiguous (leftmost to rightmost) derivation of the
following CFG:
➢ From these rule probabilities, the only deciding factor for choosing between the two
parses for John brought a shirt with pockets in the two rules NP->NP PP and VP-> VP
PP. The probability for NP -> NP PP is set higher in the preceding PCFG.
➢ The rule probabilities can be derived from a treebank, consider a treebank with three
tress t1, t2, t3
• If we assume that tree t1 occurred 10 times in the treebank, t2 occurred 20 times and
t3 occurred 50 times, then the PCFG we obtain from this treebank is:
16
• For input a a a there are two parses using the above PCFG: the probability P1 =0.125
0.334 0.285 = 0.01189 p2=0.25 0.667 0.714 =0.119.
• The parse tree p2 is the most likely tree for that input.
Generative models
• To find the most plausible parse tree, the parser has to choose between the possible
derivations each of which can be represented as a sequence of decisions.
• Let each derivation D = d1,d2,…..,dn, which is the sequence of decisions used to
build the parse tree.
• Then for input sentence x, the output parse tree y is defined by the sequence of
steps in the derivation.
• The probability for each derivation:
• The value of ø(x,y).w is the score of (x,y) . The height the score, the more possible it is
that y is the output of x.
• The function GEN(x) generates the set of possible outputs y for a given x.
• Having ø(x,y).w and GEN(x) specified, we would like to choose the height scoring
candidate 𝑦∗ from GEN(x) as the most possible output
We will formalize this intuition by introducing models that assign a probability to each possible
next word. The same models will also serve to assign a probability to an entire sentence. Such
a model, for example, could predict that the following sequence has a much higher probability
of appearing in a text:
2. Why would you want to predict upcoming words, or assign probabilities to sentences?
Probabilities are essential in any task in which we have to identify words in noisy, ambiguous
input, like speech recognition. For a speech recognizer to realize that you said
I will be back soonish and not I will be bassoon dish, it helps to know
that back soonish is a much more probable sequence than bassoon dish.
3. For writing tools like spelling correction or grammatical error correction, we need to
find and correct errors in writing like Their are two midterms, in which There was mistyped
as Their, or Everything has improve, in which improve should have been improved. The
phrase There are will be much more probable than Their are, and has improved than has
improve, allowing us to help users by detecting and correcting these errors.
他 向 记者 介绍了 主要 内容
As part of the process we might have built the following set of potential rough
English translations:
he introduced reporters to the main contents of the statement
he briefed to reporters the main contents of the statement
he briefed reporters on the main contents of the statement
5. Probabilities are also important for augmentative and alternative communication AAC
systems. People often use such AAC devices if they are physically unable to speak or sign but
can instead use eye gaze or other specific movements to select words from a menu to be
spoken by the system. Word prediction can be used to suggest likely words for the menu.
Language Models: Models that assign probabilities to sequences of words are called language
models or LMs. The simplest model that assigns probabilities to sentences and sequences of
words are the n-gram. An n-gram is a sequence of n words: a 2-gram (which we’ll call bigram)
is a two-word sequence of words like “please turn”, “turn your”, or ”your homework”, and a
3-gram (a trigram) is a three-word sequence of words like “please turn your”, or “turn your
homework”.
We’ll see how to use n-gram models to estimate the probability of the last word of an n-gram
given the previous words, and also to assign probabilities to entire sequences. The n-gram
models are much simpler than state-of-the art neural language models based on the RNNs and
transformers.
N-Grams
P(w|h), the probability of a word w given some history h. Suppose the history h is “its water
is so transparent that” and we want to know the probability that the next word is the:
One way to estimate this probability is from relative frequency counts: take a very large corpus,
count the number of times we see its water is so transparent that, and count the number of times
this is followed by the. This would be answering the question “Out of the times we saw the
history h, how many times was it followed by the word w”, as follows:
With a large enough corpus, such as the web, we can compute these counts and estimate the
probability. While this method of estimating probabilities directly from counts works fine in
3
many cases, it turns out that even the web isn’t big enough to give us good estimates in most
cases. This is because language is creative; new sentences are created all the time, and we won’t
always be able to count entire sentences. Even simple extensions of the example sentence may
have counts of zero on the web (such as “Walden Pond’s water is so transparent that the”; well,
used to have counts of zero). Similarly, if we wanted to know the joint probability of an entire
sequence of words like its water is so transparent, we could do it by asking “out of all possible
sequences of five words, how many of them are its water is so transparent?” We would have
to get the count of its water is so transparent and divide by the sum of the counts of all possible
five word sequences. That seems rather a lot to estimate!
For this reason, we’ll need to introduce more clever ways of estimating the probability of a
word w given a history h, or the probability of an entire word sequence W. Now, how can we
compute probabilities of entire sequences like P(w1;w2;…;wn)? One thing we can do is
decompose this probability using the chain rule of probability:
Applying the chain rule to words, we get
The chain rule shows the link between computing the joint probability of a sequence and
computing the conditional probability of a word given previous words. But using the chain rule
doesn’t really seem to help us! We don’t know any way to compute the exact probability of a
word given a long sequence of preceding words, P(wn|w1:n-1).
The intuition of the n-gram model is that instead of computing the probability of a word given
its entire history, we can approximate the history by just the last few words. The bigram model,
approximates the probability of a word given all the previous words P(wn|w1:n-1) by using only
the conditional probability of the preceding word P(wn|wn-1). In other words, instead of
computing the probability
The assumption that the probability of a word depends only on the previous word is Markov
called a Markov assumption. Markov models are the class of probabilistic models that assume
4
we can predict the probability of some future unit without looking too far into the past. We can
generalize the bigram (which looks one word into the past) to the trigram (which looks two
words into the past) and thus to the n-gram (which looks n-1 words into the past).
Let’s see a general equation for this n-gram approximation to the conditional probability of the
next word in a sequence. We’ll use N here to mean the n-gram size, so N = 2 means bigrams
and N = 3 means trigrams. Then we approximate the probability of a word given its entire
context as follows:
Given the bigram assumption for the probability of an individual word, we can compute the
probability of a complete word sequence
For example, to compute a particular bigram probability of a word wn given a previous word
wn-1, we’ll compute the count of the bigram C(wn-1wn) and normalize by the sum of all the
bigrams that share the same first word wn-1:
Let’s work through an example using a mini-corpus of three sentences. We’ll first need to
augment each sentence with a special symbol <s> at the beginning of the sentence, to give us
the bigram context of the first word. We’ll also need a special end-symbol. </s>
The above equation estimates the n-gram probability by dividing the observed frequency of a
particular sequence by the observed frequency of a prefix. This ratio is called a relative
5
frequency. We said above that this use of frequencies as a way to estimate probabilities is an
example of maximum likelihood estimation or MLE. In MLE, the resulting parameter set
maximizes the likelihood of the training set T given the model M (i.e., P(T|M)). For example,
suppose the word Chinese occurs 400 times in a corpus of a million words like the Brown
corpus. What is the probability that a random word selected from some other text of, say, a
million words will be the word Chinese? The MLE of its probability is 400/1000000 or :0004.
Now :0004 is not the best possible estimate of the probability of Chinese occurring in all
situations; it might turn out that in some other corpus or context Chinese is a very unlikely
word. But it is the probability that makes it most likely that Chinese will occur 400 times in a
million-word corpus.
Let’s move on to some examples from a slightly larger corpus than our 14-word example above.
We’ll use data from the now-defunct Berkeley Restaurant Project, a dialogue system from the
last century that answered questions about a database of restaurants in Berkeley, California.
Here are some text normalized sample user queries (a sample of 9332 sentences is on the
website):
can you tell me about any good cantonese restaurants close by
mid priced thai food is what i’m looking for
tell me about chez panisse
can you give me a listing of the kinds of food that are available
i’m looking for a good place to eat breakfast
when is caffe venezia open during the day
Figure below shows the bigram counts from a piece of a bigram grammar from the Berkeley
Restaurant Project. Note that the majority of the values are zero. In fact, we have chosen the
sample words to cohere with each other; a matrix selected from a random set of eight words
would be even more sparse.
6
Figure below shows the bigram probabilities after normalization (dividing each cell above
Figure by the appropriate unigram for its row, taken from the following set of unigram
probabilities):
Now we can compute the probability of sentences like I want English food or I want Chinese
food by simply multiplying the appropriate bigram probabilities together, as follows:
Some practical issues: Although for pedagogical purposes we have only described trigram
bigram models, in practice it’s more common to use trigram models, which condition on the
previous two words rather than the previous word, or 4-gram or even 5-gram models, when
there is sufficient training data. Note that for these larger ngrams, we’ll need to assume extra
contexts to the left and right of the sentence end.
For example, to compute trigram probabilities at the very beginning of the sentence, we use
two pseudo-words for the first trigram (i.e., P(I|<s><s>).
We always represent and compute language model probabilities in log format as log
probabilities. Since probabilities are (by definition) less than or equal to1, the more
probabilities we multiply together, the smaller the product becomes. Multiplying enough n-
7
grams together would result in numerical underflow. By using log probabilities instead of raw
probabilities, we get numbers that are not as small.
For an intrinsic evaluation of a language model we need a test set. As with many of the
statistical models in our field, the probabilities of an n-gram model come from the corpus it is
trained on, the training set or training corpus. We can then measure the quality of an n-gram
model by its performance on some unseen data called the test set or test corpus. So if we are
given a corpus of text and want to compare two different n-gram models, we divide the data
into training and test sets, train the parameters of both models on the training set, and then
compare how well the two trained models fit the test set. But what does it mean to “fit the test
set”? The answer is simple: whichever model assigns a higher probability to the test set—
meaning it more accurately predicts the test set, is a better model.
Perplexity
In practice we don’t use raw probability as our metric for evaluating language models, but a
variant called perplexity. The perplexity (sometimes called PPL for short) of a language model
on a test set is the inverse probability of the test set, normalized by the number of words. For a
test set W = w1w2…wN,:
8
The perplexity of a test set W depends on which language model we use. Here’s the perplexity
of W with a unigram language model (just the geometric mean of the unigram probabilities):
The perplexity of W computed with a bigram language model is still a geometric mean, but
now of the bigram probabilities:
Minimizing perplexity is equivalent to maximizing the test set probability according to the
language model.
Given a text W, different language models will have different perplexities. Because of this,
perplexity can be used to compare different n-gram models. Let’s look at an example, in which
we trained unigram, bigram, and trigram grammars on 38 million words (including start-of-
sentence tokens) from the Wall Street Journal, using a 19,979 word vocabulary. We then
computed the perplexity of each of these models on a test set of 1.5 million words, using Eq.
for unigrams, for bigrams, and the corresponding equation for trigrams. The table below shows
the perplexity of a 1.5 million word WSJ test set according to each of these grammars.
As we see above, the more information the n-gram gives us about the word sequence, the higher
the probability the n-gram will assign to the string.
random value between 0 and 1, find that point on the probability line, and print the word whose
interval includes this chosen value. We continue choosing random numbers and generating
words until we randomly generate the sentence-final token </s>.
We can use the same technique to generate bigrams by first generating a random bigram that
starts with <s> (according to its bigram probability). Let’s say the second word of that bigram
is w. We next choose a random bigram starting with w (again, drawn according to its bigram
probability), and so on.
Generalization and Zeros
The n-gram model, like many statistical models, is dependent on the training corpus. One
implication of this is that the probabilities often encode specific facts about a given training
corpus. Another implication is that n-grams do a better and better job of modelling the training
corpus as we increase the value of N. We can use the sampling method from the prior section
to visualize both of these facts! To give an intuition for the increasing power of higher-order
n-grams, Figure below shows random sentences generated from unigram, bigram, trigram, and
4-gram models trained on Shakespeare’s works.
10
Figure 3.
The longer the context on which we train the model, the more coherent the sentences. In the
unigram sentences, there is no coherent relation between words or any sentence-final
punctuation. The bigram sentences have some local word-to-word coherence (especially if we
consider that punctuation counts as a word). The trigram and 4-gram sentences are beginning
to look a lot like Shakespeare. Indeed, a careful investigation of the 4-gram sentences shows
that they look a little too much like Shakespeare. The words It cannot be but so are directly
from King John. From Shakespeare (N =884,647, V=29,066), our n-gram probability matrices
are ridiculously sparse. There are V2 =844,000,000 possible bigrams alone, and the number of
possible 4-grams is V4 = 7X1017. Thus, once the generator has chosen the first 4-gram (It
cannot be but), there are only five possible continuations (that, I, he, thou, and so); indeed, for
many 4-grams, there is only one continuation.
To get an idea of the dependence of a grammar on its training set, let’s look at an n-gram
grammar trained on a completely different corpus: the Wall Street Journal (WSJ) newspaper.
Shakespeare and the Wall Street Journal are both English, so we might expect some overlap
between our n-grams for the two genres. Figure below shows sentences generated by unigram,
bigram, and trigram grammars trained on 40 million words from WSJ.
Compare these examples to the pseudo-Shakespeare in above figure. While they both model
“English-like sentences”, there is clearly no overlap in generated sentences, and little overlap
even in small phrases. Statistical models are likely to be pretty useless as predictors if the
training sets and the test sets are as different as Shakespeare and WSJ.
How should we deal with this problem when we build n-gram models? One step is to be sure
to use a training corpus that has a similar genre to whatever task we are trying to accomplish.
To build a language model for translating legal documents, we need a training corpus of legal
documents. To build a language model for a question-answering system, we need a training
corpus of questions. It is equally important to get training data in the appropriate dialect or
variety, especially when processing social media posts or spoken transcripts.
11
Matching genres and dialects is still not sufficient. Our models may still be subject to the
problem of sparsity. For any n-gram that occurred a sufficient number of times, we might have
a good estimate of its probability. But because any corpus is limited, some perfectly acceptable
English word sequences are bound to be missing from it. That is, we’ll have many cases of
putative “zero probability n-grams” that should really have some non-zero probability.
Consider the words that follow the bigram denied the in the WSJ Treebank3 corpus, together
with their counts:
These zeros—things that don’t ever occur in the training set but do occur in the test set—are a
problem for two reasons. First, their presence means we are underestimating the probability of
all sorts of words that might occur, which will hurt the performance of any application we want
to run on this data. Second, if the probability of any word in the test set is 0, the entire
probability of the test set is 0. By definition, perplexity is based on the inverse probability of
the test set. Thus, if some words have zero probability, we can’t compute perplexity at all, since
we can’t divide by 0! There are two solutions, depending on the kind of zero. For words whose
n-gram probability is zero because they occur in a novel test set context, like the example of
denied the offer above, we’ll introduce algorithms called smoothing or discounting.
Smoothing algorithms shave off a bit of probability mass from some more frequent events and
give it to these unseen events. But first, let’s talk about an even more insidious form of zero:
words that the model has never seen below at all (in any context): unknown words!
Unknown Words
What do we do about words we have never seen before? Perhaps the word Jurafsky simply
did not occur in our training set, but pops up in the test set! We can choose to disallow this
situation from occurring, by stipulating that we already know all the words that can occur. In
such a closed vocabulary system the test set can only contain words from this known lexicon,
and there will be no unknown words.
In most real situations, however, we have to deal with words we haven’t seen before, which
we’ll call unknown words, or out of vocabulary (OOV) words. The percentage of OOV words
that appear in the test set is called the OOV rate. One way to create an open vocabulary system
12
is to model these potential unknown words in the test set by adding a pseudo-word called
<UNK>.
There are two common ways to train the probabilities of the unknown word model <UNK>.
The first one is to turn the problem back into a closed vocabulary one by choosing a fixed
vocabulary in advance:
1. Choose a vocabulary (word list) that is fixed in advance.
2. Convert in the training set any word that is not in this set (any OOV word) to the unknown
word token <UNK> in a text normalization step.
3. Estimate the probabilities for <UNK> from its counts just like any other regular word in the
training set.
The second alternative, in situations where we don’t have a prior vocabulary in advance, is to
create such a vocabulary implicitly, replacing words in the training data by <UNK> based on
their frequency. For example, we can replace by <UNK> all words that occur fewer than n
times in the training set, where n is some small number, or equivalently select a vocabulary
size V in advance (say 50,000) and choose the top V words by frequency and replace the rest
by <UNK>. In either case we then proceed to train the language model as before, treating
<UNK> like a regular word.
Smoothing
What do we do with words that are in our vocabulary (they are not unknown words) but appear
in a test set in an unseen context (for example they appear after a word they never appeared
after in training)? To keep a language model from assigning zero probability to these unseen
events, we’ll have to shave off a bit of probability mass from some more frequent events and
give it to the events we’ve never seen. This modification is called smoothing or discounting.
Now we’ll see a variety of ways to do smoothing: Laplace (add-one) smoothing, add-k
smoothing, stupid backoff, and Kneser-Ney smoothing.
Laplace Smoothing
The simplest way to do smoothing is to add one to all the n-gram counts, before we normalize
them into probabilities. All the counts that used to be zero will now have a count of 1, the
counts of 1 will be 2, and so on. This algorithm is called Laplace smoothing. Laplace smoothing
does not perform well enough to be used smoothing in modern n-gram models, but it usefully
introduces many of the concepts that we see in other smoothing algorithms, gives a useful
baseline, and is also a practical smoothing algorithm for other tasks like text classification.
Let’s start with the application of Laplace smoothing to unigram probabilities. Recall that the
unsmoothed maximum likelihood estimate of the unigram probability of the word wi is its
count ci normalized by the total number of word tokens N:
13
Laplace smoothing merely adds one to each count (hence its alternate name add one
smoothing). Since there are V words in the vocabulary and each one was incremented, we also
need to adjust the denominator to take into account the extra V observations.
Let’s smooth our Berkeley Restaurant Project bigrams. Figure below shows the add-one
smoothed counts for the bigrams in Berkeley Restaurant Project.
Recall that normal bigram probabilities are computed by normalizing each row of counts by
the unigram count:
For add-one smoothed bigram counts, we need to augment the unigram count by the number
of total word types in the vocabulary V:
Thus, each of the unigram counts given in the previous section will need to be augmented by
V =1446. The result is the smoothed bigram probabilities in Figure below.
14
The sharp change in counts and probabilities occurs because too much probability mass is
moved to all the zeros.
Add-k smoothing
One alternative to add-one smoothing is to move a bit less of the probability mass from the
seen to the unseen events. Instead of adding 1 to each count, we add a fractional count k (.5?
.05? .01?). This algorithm is therefore called add-k smoothing.
Add-k smoothing requires that we have a method for choosing k; this can be done, for example,
by optimizing on a devset. Although add-k is useful for some tasks (including text
classification), it turns out that it still doesn’t work well for language modelling, generating
counts with poor variances and often inappropriate discounts.
Backoff and Interpolation
The discounting we have been discussing so far can help solve the problem of zero frequency
n-grams. But there is an additional source of knowledge we can draw on. If we are trying to
compute P(wn|wn-2wn-1) but we have no examples of a particular trigram wn-2wn-1wn, we can
instead estimate its probability by using the bigram probability P(wn|wn-1). Similarly, if we
don’t have counts to compute P(wn|wn-1), we can look to the unigram P(wn). In other words,
sometimes using less context is a good thing, helping to generalize more for contexts that the
model hasn’t learned much about. There are two ways to use this n-gram “hierarchy”. In
backoff, we use the trigram if the evidence is sufficient, otherwise we use the bigram, otherwise
the unigram. In other words, we only “back off” to a lower-order n-gram if we have zero
evidence for a higher-order n-gram.
By contrast, in interpolation, we always mix the probability estimates from all the n-gram
estimators, weighting and combining the trigram, bigram, and unigram counts. In simple linear
interpolation, we combine different order n-grams by linearly interpolating them. Thus, we
15
estimate the trigram probability P(wn|wn-2wn-1) by mixing together the unigram, bigram, and
trigram probabilities, each weighted by a
UNIT - IV
Semantic Parsing
1. Introduction
• In other words, the reusability of the representation across domains is very limited.
• The problem with second approach is that it is extremely difficult to construct a
general-purpose ontology and create symbols that are shallow enough to be learnable
but detailed enough to be useful for all possible applications.
• Ontology means
1. The branch of metaphysics dealing with the nature of being.
2. a set of concepts and categories in a subject area or domain that shows their properties
and the relations between them.
"what's new about our ontology is that it is created automatically from large datasets"
2. Semantic Interpretation
➢ Resolve the ambiguities of words in context. The bill is large but need not be paid, the
theory should be able to disambiguate the monetary meaning of bill.
➢ Identify meaningless but syntactically well-formed sentence: Colorless green ideas
sleep furiously.
➢ Identify syntactically or transformationally unrelated paraphrasers of concept having
the same semantic content.
➢ In any given language, the same word type is used in different contexts and with
different morphological variants to represent different entities or concepts in the
world.
➢ For example, we use the word nail to represent a part of the human anatomy
and also to represent the generally metallic object used to secure other objects.
➢ Once we have the word-sense, entities and events identified, another level of semantics
structure comes into play: identifying the participants of the entities in these events.
➢ Resolving the argument structure of predicate in the sentence is where we identify which
entities play what part in which event.
➢ A word which functions as the verb is called a predicate and words which function as the
nouns are called arguments. Here are some other predicates and arguments:
3. System Paradigms
2.Scope 3. Coverage.
1. System Architectures
a. Knowledge based: These systems use a predefined set of rules or a knowledge base to
obtain a solution to a new problem.
b. Unsupervised: These systems tend to require minimal human intervention to be
functional by using existing resources that can be bootstrapped for a particular
application or problem domain.
c. Supervised: these systems involve the manual annotation of some phenomena
that appear in a sufficient quantity of data so that machine learning algorithms can
be applied.
d. Semi-Supervised: manual annotation is usually very expensive and does not yield
enough data to completely capture a phenomenon. In such instances, researches
can automatically expand the data set on which their models are trained either
by employing machine-generated output directly or by bootstrapping off an
existing model by having humans correct its output.
2. Scope:
➢ Domain Dependent: These systems are specific to certain domains, such as
air travel reservations or simulated football coaching.
➢ Domain Independent: These systems are general enough that the techniques can be
applicable to multiple domains without little or no change.
3. Coverage
a. Shallow: These systems tend to produce an intermediate representation that can
then be converted to one that a machine can base its action on.
b. Deep: These systems usually create a terminal representation that is directly consumed by
a machine or application.
4. Word Sense
➢ Word Sense Disambiguation is an important method of NLP by which the meaning
of a word is determined, which is used in a particular context.
➢ In a compositional approach to semantics, where the meaning of the whole is
composed on the meaning of parts, the smallest parts under consideration in textual
discourse are typically the words themselves: either tokens as they appear in the text
or their lemmatized forms.
➢ Words sense has been examined and studied for a very long time.
➢ Attempts to solve this problem range from rule based and knowledge based to
completely unsupervised, supervised, and semi-supervised learning methods.
➢ Very early systems were predominantly rule based or knowledge based and used
dictionary definitions of senses of words.
➢ Unsupervised word sense induction or disambiguation techniques try to induce the
senses or word as it appears in various corpora.
➢ These systems perform either a hard or soft clustering of words and tend to allow the
tuning of these clusters to suit a particular application.
➢ Most recent supervised approaches to word sense disambiguation, usually
application- independent-level of granularity (including small details). Although the
output of supervised approaches can still be amendable to generating a ranking,
5
Resources:
➢ As with any language understanding task, the availability of resources is key
factor in the disambiguation of the word senses in corpora.
➢ Early work on word sense disambiguation used machine readable dictionaries or
thesaurus as knowledge sources.
➢ Two prominent sources were the Longman dictionary of contemporary English
(LDOCE) and Roget’s Thesaurus.
➢ The biggest sense annotation corpus OntoNotes released through Linguistic Data
Consortium (LDC).
➢ The Chinese annotation corpus is HowNet.
Systems:
Researchers have explored various system architectures to address the sense disambiguation
problem.
We can classify these systems into four main categories: i. rules based or
knowledge ii. Supervised iii.unsupervised iv. Semisupervised
Rule Based:
➢ The first-generation of word sense disambiguation systems was primarily based on
dictionary sense definitions.
➢ Much of this information is historical and cannot readily be translated and made
available for building systems today. But some of techniques and algorithms are still
available.
6
Supervised:
• The simpler form of word sense disambiguating systems the supervised approach,
which tends to transfer all the complexity to the machine learning machinery while
still requiring hand annotation tends to be superior to unsupervised and performs best
8
"Word sense disambiguation" (WSD) is a natural language processing (NLP) task that involves
determining the correct sense or meaning of a word within a given context. Many words in
natural language have multiple meanings or senses, and WSD aims to choose the most
appropriate sense for a word in a specific sentence or context.
9
Supervised learning with Support Vector Machines (SVM) is one approach to solving the WSD
problem. Here's how it works:
1. Data Collection: To train an SVM for WSD, you need a labeled dataset where each word
is tagged with its correct sense in various contexts. This dataset is typically created by
human annotators who assign senses to words in sentences.
2. Feature Extraction: For each word in the dataset, you need to extract relevant features
from its context. These features could include the words surrounding the target word,
part-of-speech tags, syntactic information, and more. These features serve as the input to
the SVM.
3. Training: Once you have the labeled dataset and extracted features, you can train an
SVM classifier. The goal is to teach the SVM to learn patterns in the features that are
indicative of specific word senses.
4. Testing/Predicting: After training, you can use the SVM to predict the sense of an
ambiguous word in a new, unseen sentence. The SVM considers the context features and
assigns the word the most likely sense based on what it learned during training.
5. Evaluation: To assess the performance of your WSD system, you can use various
evaluation metrics, such as accuracy, precision, recall, and F1-score. These metrics help
you measure how well your SVM-based WSD system is performing in disambiguating
word senses.
SVMs are popular for WSD because they are effective at handling high-dimensional feature
spaces and can learn complex decision boundaries. However, the success of the SVM-based WSD
system heavily depends on the quality of the labeled dataset and the choice of features used for
training.
The identification of the head word is important in syntax because it helps determine the
grammatical structure of a phrase or sentence. For feature selection in NLP tasks like parsing or
word sense disambiguation, knowing the head word and its relationships with other words in a
10
sentence can be valuable information. Syntactic relations often involve the relationship between a
head word and its dependents or modifiers, and these relations can be used as features in
various natural language processing applications.
Unsupervised:
HyperLex
11
12
Semi Supervised:
Semi-supervised learning is a machine learning paradigm that combines both labeled and
unlabeled data to improve model performance. In the context of word sense disambiguation
13
Self-training is a popular semi-supervised learning approach that can be adapted for WSD. In
self-training for WSD, you start with a small set of labeled examples and a larger set of
unlabeled examples. The process involves iterative steps:
1. Initialization: Begin with a small labeled dataset where each example consists of a
sentence containing an ambiguous word and its corresponding sense label.
2. Feature Extraction: Extract relevant features from the labeled examples, which
typically include information about the target word, its context words, part-of-speech
tags, syntactic relations, and more.
3. Model Training: Train a WSD model using the labeled data. This can be a
supervised machine learning model like Support Vector Machines (SVM), Naive
Bayes, or a neural network-based model.
4. Prediction: Use the trained model to predict word senses for the unlabeled data.
Apply the model to the sentences containing the ambiguous word from the unlabeled
dataset to assign senses to those instances.
5. Confidence Threshold: Introduce a confidence threshold or some criteria to filter the
predictions. For instance, you can choose to keep only the predictions where the
model is highly confident.
6. Adding Labeled Data: Add the confidently predicted examples to the labeled
dataset, marking them as newly labeled instances.
7. Iteration: Repeat steps 2-6 for a fixed number of iterations or until convergence.
8. Final Model: Train a final model using the combined labeled data (original labeled
dataset plus the newly labeled instances) to create a more robust WSD model.
• It leverages a larger pool of unlabeled data, which can be especially beneficial when
labeled data is scarce.
• It allows the model to learn from its own predictions and iteratively improve.
• Self-training is a flexible approach and can be used with various machine learning
models.
Challenges:
• Labeling errors: The initial labeled dataset should be of high quality because errors
can accumulate during self-training iterations.
Semi-supervised learning with self-training can be effective for WSD, but it's essential to
carefully design the process, monitor model performance, and apply filtering criteria to
ensure the quality of the added labeled instances.
14
Definition of Discourse: Discourse is the coherent structure of language above the level of
sentences or clauses. A discourse is a coherent structured group of sentences.
What makes a passage coherent? A practical answer: It has meaningful
connections between its utterances.
Cohesion
Relations between words in two units (sentences, paragraphs) “glue” them together.
Example: Before winter I built a chimney, and shingled the sides of my house… I have thus
a tight shingled and plastered house.
Discourse Processing:
One of the major problems in NLP is discourse processing − building theories and models of
how utterances stick together to form coherent discourse. Actually, the language always
consists of collocated, structured and coherent groups of sentences rather than isolated and
unrelated sentences like movies. These coherent groups of sentences are referred to as
discourse.
Concept of Coherence
Coherence and discourse structure are interconnected in many ways. Coherence, along with
property of good text, is used to evaluate the output quality of natural language generation
system. The question that arises here is what does it mean for a text to be coherent? Suppose
we collected one sentence from every page of the newspaper, then will it be a discourse? Of-
course, not. It is because these sentences do not exhibit coherence. The coherent discourse must
possess the following properties −
The discourse would be coherent if it has meaningful connections between its utterances. This
property is called coherence relation. For example, some sort of explanation must be there to
justify the connection between utterances.
Discourse structure
An important question regarding discourse is what kind of structure the discourse must have.
The answer to this question depends upon the segmentation we applied on discourse. Discourse
segmentations may be defined as determining the types of structures for large discourse. It is
quite difficult to implement discourse segmentation, but it is very important for information
retrieval, text summarization and information extraction kind of applications.
The earlier method does not have any hand-labeled segment boundaries. On the other hand,
supervised discourse segmentation needs to have boundary-labeled training data. It is very easy
to acquire the same. In supervised discourse segmentation, discourse marker or cue words play
an important role. Discourse marker or cue word is a word or phrase that functions to signal
discourse structure. These discourse markers are domain-specific.
Text Coherence
Lexical repetition is a way to find the structure in a discourse, but it does not satisfy the
requirement of being coherent discourse. To achieve the coherent discourse, we must focus on
coherence relations in specific. As we know that coherence relation defines the possible
connection between utterances in a discourse. Hebb has proposed such kind of relations as
follows −
We are taking two terms S0 and S1 to represent the meaning of the two related sentences −
Result
It infers that the state asserted by term S0 could cause the state asserted by S1. For example,
two statements show the relationship result: Ram was caught in the fire. His skin burned.
Explanation
It infers that the state asserted by S1 could cause the state asserted by S0. For example, two
statements show the relationship − Ram fought with Shyam’s friend. He was drunk.
Parallel
It infers p(a1,a2,…) from assertion of S0 and p(b1,b2,…) from assertion S1. Here ai and bi are
similar for all i. For example, two statements are parallel − Ram wanted car. Shyam wanted
money.
Elaboration
It infers the same proposition P from both the assertions − S0 and S1 For example, two
statements show the relation elaboration: Ram was from Chandigarh. Shyam was from Kerala.
Occasion
It happens when a change of state can be inferred from the assertion of S0, final state of which
can be inferred from S1 and vice-versa. For example, the two statements show the relation
occasion: Ram picked up the book. He gave it to Shyam.
The coherence of entire discourse can also be considered by hierarchical structure between
coherence relations. For example, the following passage can be represented as hierarchical
structure −
Reference Resolution
Interpretation of the sentences from any discourse is another important task and to achieve this
we need to know who or what entity is being talked about. Here, interpretation reference is the
key element. Reference may be defined as the linguistic expression to denote an entity or
individual. For example, in the passage, Ram, the manager of ABC bank, saw his friend Shyam
at a shop. He went to meet him, the linguistic expressions like Ram, His, He are reference.
On the same note, reference resolution may be defined as the task of determining what entities
are referred to by which linguistic expression.
Let us now see the different types of referring expressions. The five types of referring
expressions are described below −
Such kind of reference represents the entities that are new to the hearer into the discourse
context. For example − in the sentence Ram had gone around one day to bring him some food
− some is an indefinite reference.
Opposite to above, such kind of reference represents the entities that are not new or identifiable
to the hearer into the discourse context. For example, in the sentence - I used to read The Times
of India – The Times of India is a definite reference.
Pronouns
It is a form of definite reference. For example, Ram laughed as loud as he could. The
word he represents pronoun referring expression.
Demonstratives
These demonstrate and behave differently than simple definite pronouns. For example, this and
that are demonstrative pronouns.
Names
It is the simplest type of referring expression. It can be the name of a person, organization and
location also. For example, in the above examples, Ram is the name-refereeing expression.
Coreference Resolution
It is the task of finding referring expressions in a text that refer to the same entity. In simple
words, it is the task of finding corefer expressions. A set of coreferring expressions are called
coreference chain. For example - He, Chief Manager and His - these are referring expressions
in the first passage given as example.
In English, the main problem for coreference resolution is the pronoun it. The reason behind
this is that the pronoun it has many uses. For example, it can refer much like he and she. The
pronoun it also refers to the things that do not refer to specific things. For example, It’s raining.
It is really good.
Unlike the coreference resolution, pronominal anaphora resolution may be defined as the task
of finding the antecedent for a single pronoun. For example, the pronoun is his and the task of
pronominal anaphora resolution is to find the word Ram because Ram is the antecedent.