Unit - 5
Unit - 5
Discourse Analysis is extracting the meaning out of the corpus or text. Discourse Analysis is
very important in Natural language Processing and helps train the NLP model better.
Pre-requisites
Before learning about the Discourse in NLP, let us first learn some basics about the NLP
itself.
NLP stands for Natural Language Processing. In NLP, we perform the analysis and
synthesis of the input, and the trained NLP model then predicts the necessary output.
NLP is the backbone of technologies like Artificial Intelligence and Deep Learning.
In basic terms, we can say that NLP is nothing but the computer program's ability to
process and understand the provided human language.
Introduction
One of the primary challenges that we face in the world of Artificial Intelligence is
processing Natural Language data by computers. We can even say that Natural Language
Processing is quite a difficult issue in the field of AI. Now if we are talking about the major
problem in Natural Language Processing, then we are talking about the processing of
Discourse in NLP.
So, we can see that the real problem is the processing of the Discourse in NLP, and hence we
need to work on it so that our model can be trained well, which will help in better processing
of Natural Language data by the computers and hence the Artificial Intelligence can predict
the desired result.
Now a question that comes to our mind is what is Discourse in NLP? Well, in simple terms,
we can say that discourse in NLP is nothing but coherent groups of sentences. When we are
dealing with Natural Language Processing, the provided language consists of structured,
collective, and consistent groups of sentences, which are termed discourse in NLP. The
relationship between words makes the training of the NLP model quite easy and more
predictable than the actual results.
Discourse Analysis is extracting the meaning out of the corpus or text. Discourse Analysis is
very important in Natural language Processing and helps train the NLP model better.
Let us now learn about the concept of coherence in the next section.
Concept of Coherence
Coherence in terms of Discourse in NLP means making sense of the utterances or making
meaningful connections and correlations. There is a lot of connection between the coherence
and the discourse structure (discussed in the next section). We use the property of good text,
coherence, etc., to evaluate the quality of the output generated by the natural language
processing generation system.
What are coherent discourse texts? Well, if we read a paragraph from a newspaper, we can
see that the entire paragraph is interrelated; hence we can say that the discourse is coherence,
but if we only combine the newspaper headlines consecutively, then it is not a discourse, it is
just a group of sentences that are also non-coherence.
Let us now learn about the two major properties of coherence, i.e., Coherence relation
between utterances and Coherence relation between entities.
When we say that the discourses are coherent, then it simply means that the discourse has
some sort of meaningful connection. The coherent relation tells us that there is some sort of
connection present between the utterances.
If there is some kind of relationship between the entities, then we can also say that the
discourse in NLP is coherent. So, the coherence between the entities is known as entity-
based coherence.
Discourse Structure
So far, we have discussed discourse and coherence, but we have not discussed the structure of
the discourse in NLP. Let us now look at the structure that discourse in NLP must have. Now,
the structure of the discourse depends on the type of segmentation applied to the discourse.
What is discourse segmentation? Well, when we determine the types of structures for a
large discourse, we term its segmentation. The segmentation is a difficult thing to implement,
but it is very necessary as discourse segmentation is used in fields like :
Information Retrieval,
Text summarization,
Information Extraction, etc.
Now the algorithm will take the help of cohesion (that we have discussed above), and the
algorithm will classify the dependent texts and tie them together using some linguistic
devices. In simpler terms, unsupervised discourse segmentation means the classification and
grouping up of similar texts with the help of coherent discourse in NLP.
The unsupervised discourse segmentation can also be performed with the help of lexicon
cohesion. The lexicon cohesion indicates the relationship among similar units, for example,
synonyms.
In the previous segmentation, there was no certain labeled segment boundary to separate the
discourse segments. But in the supervised discourse segmentation, we only deal with the
training data set having a labeled boundary. To differentiate or structure the discourse
segments, we make use of cue words or discourse makers. These cue words or discourse
maker works to signal the discourse structure. As there can be varied domains of discourse in
NLP so, the cue words or discourse makers are domain specific.
Text Coherence
As we have previously discussed, the coherent discourse in NLP aims to find the coherence
relation among the discourse text. Now, to find the structure in discourse, we use lexical
repetition, but by using this lexical repetition, we cannot satisfy the conditions of coherent
discourse. So, to prove such a kind of discourse relation, Hebb has proposed some solutions.
Result
We can say that the second statement, i.e., S1 can be the cause of the first statement, i.e., S0.
For example, Rahul is late. He will be punished.
In the above example, we can say that the first statement, S0, i.e., Rahul is late, has caused
the second statement, i.e., S1, i.e., He will be punished.
Explanation
Similar to the result, We can say that the first statement, i.e., S0 can be the cause of the
second statement, i.e., S1. For example, Rahul fought with his friend. He was drunk.
Parallel
By the term parallel, we mean that the assertion from the statement S0, i.e., p(a1, a2, …), and
the assertion from the statement S1, i.e. p(b1, b2, …), the ai and bi is similar for all the values
of I.
In simpler terms, it shows us that the sentences are parallel. For example, He wants food. She
wants money. Both of the statements are parallel as there is a sense of want in both sentences.
Elaboration
Elaboration means that proposition P is inferring from both the assertions S0 and S1. For
example, Rahul is from Delhi. Rohan is from Mumbai.
Occasion
The occasion takes place when the change in the state is inferred from the first assertion S0,
the final state is inferred from the statement S1, and vice-versa. Let us take an example to
understand the relationship occasion better. For example, Rahul took the money. he gave it to
Rohan.
In the previous section, we discussed how text coherence takes place. Let us now try to build
a hierarchal discourse structure with the help of a group of statements. We generally create
the hierarchical structure among the coherence relations to get the entire discourse in NLP.
S1:
Rahul went to the bank to deposit money.
S2:
He then went to Rohan's shop.
S3 :
He wanted a phone.
S4 :
He did not have a phone.
S5:
He also wanted to buy a laptop from Rohan's shop.
Now the entire discourse can be represented using the below hierarchal discourse structure.
Reference Resolution
The extraction of the meaning or interpretation of the sentences of discourse is one of the
most important tasks in natural language processing, and to do so, we first need to know what
or who is the entity that we are talking about. Reference resolution means understanding the
type of entity that is being talked about.
By the term reference, we mean the linguistic expression that is used to denote an individual
or an entity. For example, look at the below sentences.
In the above sentences, Rahul, He, and His references. So, we can simply define the reference
resolution as the task of determination of the entities that are being referred to by the
linguistic expressions.
Let us now look at the various terminologies used in the reference resolution.
Referring expression:
The NLP expression that performs the reference is termed a referring expression. For
example, the passage that we have talked about in the above section is an example of
the referring expression.
Referent:
Referent is the entity we have referred to. For example, in the above passage, Rahul is
the referent.
Co-refer:
As the name suggests, Co-refer is a term used for an entity if two or more expressions
are referring to the same entity. For example, Rahul and He is used for the same
entity, i.e., Rahul.
Antecedent:
The term that has been licensed to use another term is termed antecedent. For
example, in the above passage, Rahul is the antecedent of the reference He.
Anaphora & Anaphoric:
The referring expression is termed anaphoric. Anaphora & Anaphoric can be said to
be the term or reference used for an entity that has previously been introduced in the
same sentence.
Discourse model:
It is the model that has the overall representation of the entities that have been
referred to in the discourse text. It also contains the relationship of the involved
discourse in the NLP.
As we have previously discussed, the NLP expression that performs the reference is termed a
referring expression. We have mainly five types of referring expressions in Natural Language
Processing. Let us discuss them one by one.
Indefinite noun reference is a kind of reference that represents the entity that is new to the
discourse context's hearer. To understand the indefinite noun phrase, let us take an example.
For example:
In the sentence : Rahul is doing some work., some is an indefinite noun phrase.
A definite noun reference is a kind of reference that represents the entity that is not new to
the discourse context's hearer. The discourse context's hearer can easily identify the definite
noun reference. To understand the definite noun phrase, let us take an example.
For example:
In the sentence: Rahul loves reading the Times of India., the Times of India is an definite
noun phrase.
3. Pronouns
Pronouns is a form of definite reference (its working is the same as we have learned in
English grammar).
For example:
In the sentence, Rahul learned as much as he could. Here, he is the pronoun that is referring
to the noun Rahul.
4. Demonstratives
The demonstratives are also used to demonstrate the nouns but they behave differently than
the simple pronouns.
For example, that, this, these, and those, are some examples of demonstratives.
5. Names
Names can be the name of the person, location, organization, etc. So, it is the simplest form
of referring to the expressions.
For example, in the above examples, Rahul is the name referring expression.
To resolve the reference, we can use the two resolution tasks. Let us discuss them one by one.
1. Co-reference Resolution
In the Co-reference Resolution, the main aim is to find the referring expression from the
provided text that refers to the same entity. In a discourse in NLP, Co-refer is a term used for
an entity if two or more expressions are referring to the same entity.
For example, Rahul and He is used for the same entity i.e., Rahul.
The Co-reference Resolution can be simply termed as finding the relevant co-refer
expressions among the provided discourse text. Let us take an example for more clarity.
For example, Rahul went to the farm. He cooked food. In this example, Rahul and He is the
referring expressions.
We have some sort of constraints present on the Co-reference Resolution. Let us learn about
the constraint.
In the English language, we have many pronouns. If we are using the pronouns he and she,
then we can easily resolve it. But if we are using the pronoun it, the resolution can be tricky,
and if we have a set of co-referring expressions, then it becomes more complex to resolve it.
In simpler terms, if we are using the it pronoun, then the exact determination of the referred
noun is complex.
For example, in the passage - Rahul went to the farm. He cooked food., Rahul is the
antecedent of the reference He.
Conclusion
Discourse in NLP is nothing but coherent groups of sentences. When we are dealing
with Natural Language Processing, the provided language consists of structured,
collective, and consistent groups of sentences, which are termed discourse in NLP.
Discourse Analysis is very important in Natural language Processing and helps train
the NLP model better.
Coherence in terms of Discourse in NLP means making sense between the utterances
or making meaningful connections and correlations. We use the property of good text,
coherence, etc. to evaluate the quality of the output generated by the natural language
processing generation system.
The extraction of the meaning or interpretation of the sentences of discourse is one of
the most important tasks in natural language processing, and to do so, we first need to
know what or who is the entity that we are talking about.
Indefinite noun reference is a kind of reference that represents the entity that is new
to the discourse context's hearer.
Definite noun reference is a kind of reference that represents the entity that is not
new to the discourse context's hearer. The discourse context's hearer can easily
identify the definite noun reference.
In the Co-reference Resolution, the main aim is to find the referring expression from
the provided text that refers to the same entity. By the terms Pronominal Anaphora
Resolution, we are aiming to find the antecedent for the current single pronoun.
Hobbs Algorithm is one of the technique used for Pronoun Resolution. But what is Pronoun
Resolution ❓ Let’s understand this with an example.
Now, the question is: To whom the pronoun ‘his’ refers to ? Well to answer this, we as a
human can easily relate that the word ‘his’ refers to Jack and not to the Jill, hill or the crown.
But do you think is this task easy for computers as well ❔
The task of locating all expressions that are coreferential with any of the entities identified in
the text is known as coreference resolution, and it occurs when two or more expressions in
the text relate to the same person or object. As a result, pronouns and other referring
expressions must be resolved in order to infer the correct understanding of the text.
So to perform this task computers take help of different techniques, one of which is Hobbs
algorithm.
Hobbs algorithm is one of the several approaches for pronoun resolution. The algorithm is
mainly based on the syntactic parse tree of the sentences. To make the idea more clear let’s
consider the previous example of Jack and Jill and understand how we humans try to resolve
crown.
But then why we didn’t even thought of crown as a possible solution? Maybe because the
noun ‘crown’ came after the pronoun ‘his’. This is the first assumption in the Hobbs
algorithm, where the search for the referent is always restricted to the left of the target and
But we know that ‘his’ may not refer to Jill because Jill is a girl. Generally animate
objects are referred to either by male pronouns like- he, his; or female pronouns like- she,
her, etc. and inanimate objects take neutral gender like- it.. This property is known
as gender agreement which eliminates the possibilities of Jill, hill and water.
Pronouns can only go a few sentences back, and entities closer to the referring phrase are more
important than those further away… which finally leaves us with the only possible solution i.e.
Now after understanding how humans process text and resolve pronouns, let’s see how we can
embed intelligence (using Hobbs algorithm) in machines who lacks common sense, to
Hobbs algorithm is the pronoun to be resolved together with the syntactic parse of the
So here, we have the syntactic parse tree of the two sentences as shown.
The algorithm starts with the target pronoun and walks up the parse tree to the root node ‘S’.
For each noun phrase or ‘S’ node that it finds, it does the breadth first left to right search of
the node’s children to the left of the target. So in our example, the algorithm starts with the
parse tree of the sentence 2 and climbs up to the root node S2. Then it does a breadth first
search to find the noun phrase (NP). Here the algorithm, finds its first noun phrase for noun
‘Jill’.
But it does not explore that branch because of the syntactic constraint of Binding theory.
Binding theory states that: A reflexive can refer to the subject of the most immediate clause in
which it appears, whereas a nonreflexive cannot corefer this subject.. Words such as himself,
Then the pronoun him does not refer to John. Since one of the possible interpretation of the
sentence can be John bought him a new car, where him maybe someone whom the John is
gifting a car.
So according to the binding theory constraint, ‘him’ in our example will not refer to Jill. Also
because of the gender agreement constraint even if the branch was explored, Jill won’t be
Hence the algorithm now starts the search in the syntax tree of the previous sentence.
For each noun phrase that it finds it does a breadth first left to right search of the node’s
children. This is because of the grammatical rule or more commonly known as Hobbs
distance property.
Hobbs distance property states that entities in a subject position are more likely the possible
And hence the subject Jack in the sentence, Jack is an engineer, is explored
before the object engineer and finally Jack is the resolved referent for the pronoun him.
This is how the Hobbs algorithm can aid the process of pronoun resolution which is one of the
Stemming is a method in text processing that eliminates prefixes and suffixes from words,
transforming them into their fundamental or root form, The main objective of stemming is
to streamline and standardize words, enhancing the effectiveness of the natural language
processing tasks. The article explores more on the stemming technique and how to perform
stemming in Python.
What is Stemming in NLP?
Simplifying words to their most basic form is called stemming, and it is made easier by
stemmers or stemming algorithms. For example, “chocolates” becomes “chocolate” and
“retrieval” becomes “retrieve.” This is crucial for pipelines for natural language processing,
which use tokenized words that are acquired from the first stage of dissecting a document
into its constituent words.
Stemming in natural language processing reduces words to their base or root form, aiding in
text normalization for easier processing. This technique is crucial in tasks like text
classification, information retrieval, and text summarization. While beneficial, stemming
has drawbacks, including potential impacts on text readability and occasional inaccuracies
in determining the correct root form of a word.
Why is Stemming important?
It is important to note that stemming is different from Lemmatization. Lemmatization is the
process of reducing a word to its base form, but unlike stemming, it takes into account the
context of the word, and it produces a valid word, unlike stemming which may produce a
non-word as the root form.
Note: Do must go through concepts of ‘tokenization.‘
Some more example of stemming for root word "like" include:
->"likes"
->"liked"
->"likely"
->"liking"
Porter’s Stemmer
It is one of the most popular stemming methods proposed in 1980. It is based on the idea
that the suffixes in the English language are made up of a combination of smaller and
simpler suffixes. This stemmer is known for its speed and simplicity. The main applications
of Porter Stemmer include data mining and Information retrieval. However, its applications
are only limited to English words. Also, the group of stems is mapped on to the same stem
and the output stem is not necessarily a meaningful word. The algorithms are fairly lengthy
in nature and are known to be the oldest stemmer.
Example: EED -> EE means “if the word has at least one vowel and consonant plus EED
ending, change the ending to EE” as ‘agreed’ becomes ‘agree’.
Advantage: It produces the best output as compared to other stemmers and it has less
error rate.
Limitation: Morphological variants produced are not always real words.
PropBank
The propositional level of analysis is layered on top of the parse trees and identifies predicate
constituents and their arguments in OntoNotes. This level of analysis is supplied by
PropBank which is described below:
Robust syntactic parsers, made possible by new statistical techniques (Ratnaparkhi, 1997;
Collins, 1999; Collins, 2000; Bangalore and Joshi, 1999; Charniak, 2000) and by the
availability of large, hand-annotated training corpora (Marcus, Santorini, and Marcinkiewicz,
1993; Abeille, 2003), have had a major impact on the field of natural language processing in
recent years. However, the syntactic analyses produced by these parsers are a long way from
representing the full meaning of the sentence. As a simple example, in the sentences:
FrameNet
The FrameNet corpus is a lexical database of English that is both human- and machine-
readable, based on annotating examples of how words are used in actual texts. FrameNet is
based on a theory of meaning called Frame Semantics, deriving from the work of Charles J.
Fillmore and colleagues. The basic idea is straightforward: that the meanings of most words
can best be understood on the basis of a semantic frame: a description of a type of event,
relation, or entity and the participants in it. For example, the concept of cooking typically
involves a person doing the cooking (Cook), the food that is to be cooked (Food), something
to hold the food while cooking (Container) and a source of heat (Heating_instrument). In the
FrameNet project, this is represented as a frame called Apply_heat, and the Cook, Food,
Heating_instrument and Container are called frame elements (FEs). Words that evoke this
frame, such as fry, bake, boil, and broil, are called lexical units (LUs) of the Apply_heat
frame. The job of FrameNet is to define the frames and to annotate sentences to show how
the FEs fit syntactically around the word that evokes the frame.
Frames
A Frame is a script-like conceptual structure that describes a particular type of situation,
object, or event along with the participants and props that are needed for that Frame. For
example, the “Apply_heat” frame describes a common situation involving a Cook, some
Food, and a Heating_Instrument, and is evoked by words such as bake, blanch, boil, broil,
brown, simmer, steam, etc.
We call the roles of a Frame “frame elements” (FEs) and the frame-evoking words are called
“lexical units” (LUs).
FrameNet includes relations between Frames. Several types of relations are defined, of which
the most important are:
Inheritance: An IS-A relation. The child frame is a subtype of the parent frame, and each FE
in the parent is bound to a corresponding FE in the child. An example is the “Revenge” frame
which inherits from the “Rewards_and_punishments” frame.
Using: The child frame presupposes the parent frame as background, e.g the “Speed” frame
“uses” (or presupposes) the “Motion” frame; however, not all parent FEs need to be bound to
child FEs.
Subframe: The child frame is a subevent of a complex event represented by the parent, e.g.
the “Criminal_process” frame has subframes of “Arrest”, “Arraignment”, “Trial”, and
“Sentencing”.
Perspective_on: The child frame provides a particular perspective on an un-perspectivized
parent frame. A pair of examples consists of the “Hiring” and “Get_a_job” frames, which
perspectivize the “Employment_start” frame from the Employer’s and the Employee’s point
of view, respectively.
The British National Corpus (BNC) is a 100 million word collection of samples of written
and spoken language from a wide range of sources, designed to represent a wide cross-
section of British English from the later part of the 20th century, both spoken and written.
The latest edition is the BNC XML Edition, released in 2007.
The written part of the BNC (90%) includes, for example, extracts from regional and
national newspapers, specialist periodicals and journals for all ages and interests, academic
books and popular fiction, published and unpublished letters and memoranda, school and
university essays, among many other kinds of text. The spoken part (10%) consists of
orthographic transcriptions of unscripted informal conversations (recorded by volunteers
selected from different age, region and social classes in a demographically balanced way) and
spoken language collected in different contexts, ranging from formal business or government
meetings to radio shows and phone-ins.
The corpus is encoded according to the Guidelines of the Text Encoding Initiative (TEI) to
represent both the output from CLAWS (automatic part-of-speech tagger) and a variety of
other structural properties of texts (e.g. headings, paragraphs, lists etc.). Full classification,
contextual and bibliographic information is also included with each text in the form of a TEI-
conformant header.
Work on building the corpus began in 1991, and was completed in 1994. No new texts have
been added after the completion of the project but the corpus was slightly revised prior to the
release of the second edition BNC World (2001) and the third edition BNC XML
Edition (2007). Since the completion of the project, two sub-corpora with material from the
BNC have been released separately: the BNC Sampler (a general collection of one million
written words, one million spoken) and the BNC Baby (four one-million word samples from
four different genres).
Full technical documentation covering all aspects of the BNC including its design, markup,
and contents are provided by the Reference Guide for the British National Corpus (XML
Edition). For earlier versions of the Reference Guide and other documentation, see the BNC
Archive page.
Monolingual: It deals with modern British English, not other languages used in Britain.
However non-British English and foreign language words do occur in the corpus.
Synchronic: It covers British English of the late twentieth century, rather than the historical
development which produced it.
General: It includes many different styles and varieties, and is not limited to any particular
subject field, genre or register. In particular, it contains examples of both spoken and written
language.
Sample: For written sources, samples of 45,000 words are taken from various parts of single-
author texts. Shorter texts up to a maximum of 45,000 words, or multi-author texts such as
magazines and newspapers, are included in full. Sampling allows for a wider coverage of
texts within the 100 million limit, and avoids over-representing idiosyncratic texts.