0% found this document useful (0 votes)
48 views21 pages

Unit - 5

The document discusses discourse in natural language processing (NLP). Discourse refers to coherent groups of sentences. The document covers discourse analysis, coherence, discourse segmentation algorithms, text coherence, hierarchical discourse structure building, and reference resolution.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views21 pages

Unit - 5

The document discusses discourse in natural language processing (NLP). Discourse refers to coherent groups of sentences. The document covers discourse analysis, coherence, discourse segmentation algorithms, text coherence, hierarchical discourse structure building, and reference resolution.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Discourse in NLP is nothing but coherent groups of sentences.

When we are dealing with


Natural Language Processing, the provided language consists of structured, collective, and
consistent groups of sentences, which are termed discourse in NLP. The relationship between
words makes the training of the NLP model quite easy and more predictable than the actual
results.

Discourse Analysis is extracting the meaning out of the corpus or text. Discourse Analysis is
very important in Natural language Processing and helps train the NLP model better.

Pre-requisites

Before learning about the Discourse in NLP, let us first learn some basics about the NLP
itself.

 NLP stands for Natural Language Processing. In NLP, we perform the analysis and
synthesis of the input, and the trained NLP model then predicts the necessary output.
 NLP is the backbone of technologies like Artificial Intelligence and Deep Learning.
In basic terms, we can say that NLP is nothing but the computer program's ability to
process and understand the provided human language.

Introduction

One of the primary challenges that we face in the world of Artificial Intelligence is
processing Natural Language data by computers. We can even say that Natural Language
Processing is quite a difficult issue in the field of AI. Now if we are talking about the major
problem in Natural Language Processing, then we are talking about the processing of
Discourse in NLP.

So, we can see that the real problem is the processing of the Discourse in NLP, and hence we
need to work on it so that our model can be trained well, which will help in better processing
of Natural Language data by the computers and hence the Artificial Intelligence can predict
the desired result.

Now a question that comes to our mind is what is Discourse in NLP? Well, in simple terms,
we can say that discourse in NLP is nothing but coherent groups of sentences. When we are
dealing with Natural Language Processing, the provided language consists of structured,
collective, and consistent groups of sentences, which are termed discourse in NLP. The
relationship between words makes the training of the NLP model quite easy and more
predictable than the actual results.

Discourse Analysis is extracting the meaning out of the corpus or text. Discourse Analysis is
very important in Natural language Processing and helps train the NLP model better.

Let us now learn about the concept of coherence in the next section.

Concept of Coherence
Coherence in terms of Discourse in NLP means making sense of the utterances or making
meaningful connections and correlations. There is a lot of connection between the coherence
and the discourse structure (discussed in the next section). We use the property of good text,
coherence, etc., to evaluate the quality of the output generated by the natural language
processing generation system.

What are coherent discourse texts? Well, if we read a paragraph from a newspaper, we can
see that the entire paragraph is interrelated; hence we can say that the discourse is coherence,
but if we only combine the newspaper headlines consecutively, then it is not a discourse, it is
just a group of sentences that are also non-coherence.

Let us now learn about the two major properties of coherence, i.e., Coherence relation
between utterances and Coherence relation between entities.

Coherence Relation between Utterances

When we say that the discourses are coherent, then it simply means that the discourse has
some sort of meaningful connection. The coherent relation tells us that there is some sort of
connection present between the utterances.

Relationship between Entities

If there is some kind of relationship between the entities, then we can also say that the
discourse in NLP is coherent. So, the coherence between the entities is known as entity-
based coherence.

Discourse Structure

So far, we have discussed discourse and coherence, but we have not discussed the structure of
the discourse in NLP. Let us now look at the structure that discourse in NLP must have. Now,
the structure of the discourse depends on the type of segmentation applied to the discourse.

What is discourse segmentation? Well, when we determine the types of structures for a
large discourse, we term its segmentation. The segmentation is a difficult thing to implement,
but it is very necessary as discourse segmentation is used in fields like :

 Information Retrieval,
 Text summarization,
 Information Extraction, etc.

Algorithms for Discourse Segmentation

We have different algorithms for Unsupervised Discourse Segmentation and Supervised


Discourse Segmentation. Let us now learn about the various algorithms used for discourse
segmentation in this section.

Unsupervised Discourse Segmentation

The class of unsupervised segmentation is also termed or represented as linear segmentation.


Let us take an example to understand this discourse segmentation better.
Suppose we have a text with us, and the task is to segment the text into various units of multi-
paragraphs. In the multi-paragraphs, a single unit is going to represent a passage of the text.

Now the algorithm will take the help of cohesion (that we have discussed above), and the
algorithm will classify the dependent texts and tie them together using some linguistic
devices. In simpler terms, unsupervised discourse segmentation means the classification and
grouping up of similar texts with the help of coherent discourse in NLP.

The unsupervised discourse segmentation can also be performed with the help of lexicon
cohesion. The lexicon cohesion indicates the relationship among similar units, for example,
synonyms.

Supervised Discourse Segmentation

In the previous segmentation, there was no certain labeled segment boundary to separate the
discourse segments. But in the supervised discourse segmentation, we only deal with the
training data set having a labeled boundary. To differentiate or structure the discourse
segments, we make use of cue words or discourse makers. These cue words or discourse
maker works to signal the discourse structure. As there can be varied domains of discourse in
NLP so, the cue words or discourse makers are domain specific.

Text Coherence

As we have previously discussed, the coherent discourse in NLP aims to find the coherence
relation among the discourse text. Now, to find the structure in discourse, we use lexical
repetition, but by using this lexical repetition, we cannot satisfy the conditions of coherent
discourse. So, to prove such a kind of discourse relation, Hebb has proposed some solutions.

Suppose we have two kinds of related sentences, namely: S0 and S1.

Result

We can say that the second statement, i.e., S1 can be the cause of the first statement, i.e., S0.
For example, Rahul is late. He will be punished.

In the above example, we can say that the first statement, S0, i.e., Rahul is late, has caused
the second statement, i.e., S1, i.e., He will be punished.

Explanation

Similar to the result, We can say that the first statement, i.e., S0 can be the cause of the
second statement, i.e., S1. For example, Rahul fought with his friend. He was drunk.

Parallel

By the term parallel, we mean that the assertion from the statement S0, i.e., p(a1, a2, …), and
the assertion from the statement S1, i.e. p(b1, b2, …), the ai and bi is similar for all the values
of I.
In simpler terms, it shows us that the sentences are parallel. For example, He wants food. She
wants money. Both of the statements are parallel as there is a sense of want in both sentences.

Elaboration

Elaboration means that proposition P is inferring from both the assertions S0 and S1. For
example, Rahul is from Delhi. Rohan is from Mumbai.

Occasion

The occasion takes place when the change in the state is inferred from the first assertion S0,
the final state is inferred from the statement S1, and vice-versa. Let us take an example to
understand the relationship occasion better. For example, Rahul took the money. he gave it to
Rohan.

Building Hierarchical Discourse Structure

In the previous section, we discussed how text coherence takes place. Let us now try to build
a hierarchal discourse structure with the help of a group of statements. We generally create
the hierarchical structure among the coherence relations to get the entire discourse in NLP.

Let us consider the following phrases and serially number them.

 S1:
Rahul went to the bank to deposit money.
 S2:
He then went to Rohan's shop.
 S3 :
He wanted a phone.
 S4 :
He did not have a phone.
 S5:
He also wanted to buy a laptop from Rohan's shop.

Now the entire discourse can be represented using the below hierarchal discourse structure.
Reference Resolution

The extraction of the meaning or interpretation of the sentences of discourse is one of the
most important tasks in natural language processing, and to do so, we first need to know what
or who is the entity that we are talking about. Reference resolution means understanding the
type of entity that is being talked about.

By the term reference, we mean the linguistic expression that is used to denote an individual
or an entity. For example, look at the below sentences.

 Rahul went to the farm.


 He cooked food.
 His farm was very big.

In the above sentences, Rahul, He, and His references. So, we can simply define the reference
resolution as the task of determination of the entities that are being referred to by the
linguistic expressions.

Let us now look at the various terminologies used in the reference resolution.

Terminology Used in Reference Resolution

 Referring expression:
The NLP expression that performs the reference is termed a referring expression. For
example, the passage that we have talked about in the above section is an example of
the referring expression.
 Referent:
Referent is the entity we have referred to. For example, in the above passage, Rahul is
the referent.
 Co-refer:
As the name suggests, Co-refer is a term used for an entity if two or more expressions
are referring to the same entity. For example, Rahul and He is used for the same
entity, i.e., Rahul.
 Antecedent:
The term that has been licensed to use another term is termed antecedent. For
example, in the above passage, Rahul is the antecedent of the reference He.
 Anaphora & Anaphoric:
The referring expression is termed anaphoric. Anaphora & Anaphoric can be said to
be the term or reference used for an entity that has previously been introduced in the
same sentence.
 Discourse model:
It is the model that has the overall representation of the entities that have been
referred to in the discourse text. It also contains the relationship of the involved
discourse in the NLP.

Types of Referring Expressions

As we have previously discussed, the NLP expression that performs the reference is termed a
referring expression. We have mainly five types of referring expressions in Natural Language
Processing. Let us discuss them one by one.

1. Indefinite Noun Phrases

Indefinite noun reference is a kind of reference that represents the entity that is new to the
discourse context's hearer. To understand the indefinite noun phrase, let us take an example.

For example:
In the sentence : Rahul is doing some work., some is an indefinite noun phrase.

2. Definite Noun Phrases

A definite noun reference is a kind of reference that represents the entity that is not new to
the discourse context's hearer. The discourse context's hearer can easily identify the definite
noun reference. To understand the definite noun phrase, let us take an example.

For example:
In the sentence: Rahul loves reading the Times of India., the Times of India is an definite
noun phrase.

3. Pronouns

Pronouns is a form of definite reference (its working is the same as we have learned in
English grammar).
For example:
In the sentence, Rahul learned as much as he could. Here, he is the pronoun that is referring
to the noun Rahul.

4. Demonstratives

The demonstratives are also used to demonstrate the nouns but they behave differently than
the simple pronouns.

For example, that, this, these, and those, are some examples of demonstratives.

5. Names

Names can be the name of the person, location, organization, etc. So, it is the simplest form
of referring to the expressions.

For example, in the above examples, Rahul is the name referring expression.

Reference Resolution Tasks

To resolve the reference, we can use the two resolution tasks. Let us discuss them one by one.

1. Co-reference Resolution

In the Co-reference Resolution, the main aim is to find the referring expression from the
provided text that refers to the same entity. In a discourse in NLP, Co-refer is a term used for
an entity if two or more expressions are referring to the same entity.

For example, Rahul and He is used for the same entity i.e., Rahul.

The Co-reference Resolution can be simply termed as finding the relevant co-refer
expressions among the provided discourse text. Let us take an example for more clarity.

For example, Rahul went to the farm. He cooked food. In this example, Rahul and He is the
referring expressions.

We have some sort of constraints present on the Co-reference Resolution. Let us learn about
the constraint.

Constraint on Co-reference Resolution:

In the English language, we have many pronouns. If we are using the pronouns he and she,
then we can easily resolve it. But if we are using the pronoun it, the resolution can be tricky,
and if we have a set of co-referring expressions, then it becomes more complex to resolve it.
In simpler terms, if we are using the it pronoun, then the exact determination of the referred
noun is complex.

2. Pronominal Anaphora Resolution


By the terms Pronominal Anaphora Resolution, we are aiming to find the antecedent for
the current single pronoun.

For example, in the passage - Rahul went to the farm. He cooked food., Rahul is the
antecedent of the reference He.

Conclusion

 Discourse in NLP is nothing but coherent groups of sentences. When we are dealing
with Natural Language Processing, the provided language consists of structured,
collective, and consistent groups of sentences, which are termed discourse in NLP.
 Discourse Analysis is very important in Natural language Processing and helps train
the NLP model better.
 Coherence in terms of Discourse in NLP means making sense between the utterances
or making meaningful connections and correlations. We use the property of good text,
coherence, etc. to evaluate the quality of the output generated by the natural language
processing generation system.
 The extraction of the meaning or interpretation of the sentences of discourse is one of
the most important tasks in natural language processing, and to do so, we first need to
know what or who is the entity that we are talking about.
 Indefinite noun reference is a kind of reference that represents the entity that is new
to the discourse context's hearer.
 Definite noun reference is a kind of reference that represents the entity that is not
new to the discourse context's hearer. The discourse context's hearer can easily
identify the definite noun reference.
 In the Co-reference Resolution, the main aim is to find the referring expression from
the provided text that refers to the same entity. By the terms Pronominal Anaphora
Resolution, we are aiming to find the antecedent for the current single pronoun.

Hobbs Algorithm is one of the technique used for Pronoun Resolution. But what is Pronoun
Resolution ❓ Let’s understand this with an example.
Now, the question is: To whom the pronoun ‘his’ refers to ? Well to answer this, we as a

human can easily relate that the word ‘his’ refers to Jack and not to the Jill, hill or the crown.
But do you think is this task easy for computers as well ❔

The answer to this is ‘NO’.

Because computers lack Common sense.

The task of locating all expressions that are coreferential with any of the entities identified in

the text is known as coreference resolution, and it occurs when two or more expressions in

the text relate to the same person or object. As a result, pronouns and other referring

expressions must be resolved in order to infer the correct understanding of the text.

So to perform this task computers take help of different techniques, one of which is Hobbs

algorithm.

Hobbs algorithm is one of the several approaches for pronoun resolution. The algorithm is

mainly based on the syntactic parse tree of the sentences. To make the idea more clear let’s

consider the previous example of Jack and Jill and understand how we humans try to resolve

the pronoun ‘his’.


As shown, the possible candidates for resolving pronoun ‘his’ were Jack, Jill, hill, water and

crown.

But then why we didn’t even thought of crown as a possible solution? Maybe because the

noun ‘crown’ came after the pronoun ‘his’. This is the first assumption in the Hobbs

algorithm, where the search for the referent is always restricted to the left of the target and

hence crown is eliminated.

Then can Jill, water or hill be the possible referents?

But we know that ‘his’ may not refer to Jill because Jill is a girl. Generally animate

objects are referred to either by male pronouns like- he, his; or female pronouns like- she,

her, etc. and inanimate objects take neutral gender like- it.. This property is known

as gender agreement which eliminates the possibilities of Jill, hill and water.

Pronouns can only go a few sentences back, and entities closer to the referring phrase are more

important than those further away… which finally leaves us with the only possible solution i.e.

Jack. This property is known as Recency property.

Now after understanding how humans process text and resolve pronouns, let’s see how we can

embed intelligence (using Hobbs algorithm) in machines who lacks common sense, to

perform the task of pronoun resolution.

Consider two sentences:

Sentence 1(S1): Jack is an engineer.

Sentence 2 (S2): Jill likes him.


The algorithm makes use of syntactic constraints when resolving pronouns. The input to the

Hobbs algorithm is the pronoun to be resolved together with the syntactic parse of the

sentences up to and including the current sentence.

So here, we have the syntactic parse tree of the two sentences as shown.

The algorithm starts with the target pronoun and walks up the parse tree to the root node ‘S’.

For each noun phrase or ‘S’ node that it finds, it does the breadth first left to right search of

the node’s children to the left of the target. So in our example, the algorithm starts with the

parse tree of the sentence 2 and climbs up to the root node S2. Then it does a breadth first

search to find the noun phrase (NP). Here the algorithm, finds its first noun phrase for noun

‘Jill’.
But it does not explore that branch because of the syntactic constraint of Binding theory.

Binding theory states that: A reflexive can refer to the subject of the most immediate clause in

which it appears, whereas a nonreflexive cannot corefer this subject.. Words such as himself,

herself, themselves, etc. are known as reflexive.

Let’s understand this with an example.

 John bought himself a new car.

Here, himself refers to John. Whereas if the sentence is

 John bought him a new car.

Then the pronoun him does not refer to John. Since one of the possible interpretation of the

sentence can be John bought him a new car, where him maybe someone whom the John is

gifting a car.
So according to the binding theory constraint, ‘him’ in our example will not refer to Jill. Also

because of the gender agreement constraint even if the branch was explored, Jill won’t be

the accepted referent for pronoun ‘him’.

Hence the algorithm now starts the search in the syntax tree of the previous sentence.

For each noun phrase that it finds it does a breadth first left to right search of the node’s

children. This is because of the grammatical rule or more commonly known as Hobbs

distance property.

Hobbs distance property states that entities in a subject position are more likely the possible

substitute for the pronoun than in the object position.

And hence the subject Jack in the sentence, Jack is an engineer, is explored

before the object engineer and finally Jack is the resolved referent for the pronoun him.
This is how the Hobbs algorithm can aid the process of pronoun resolution which is one of the

crucial subtask of natural language understanding and natural language generation.


The steps of the Hobbs algorithm are as follows:
1. Begin at the noun phrase (NP) node immediately dominating the pronoun.
2. Go up the tree to the first NP or sentence (S) node encountered. Call this node X, and call
the path used to reach it p.
3. Traverse all branches below node X to the left of path p in a left-to-right, breadth first
fashion. Propose as the antecedent any NP node that is encountered which has an NP or S
node between it and X.
4. If node X is the highest S node in the sentence, traverse the surface parse trees of previous
sentences in the text in order of recency, the most recent first; each tree is traversed in a left-
to-right, breadth-first manner, and when an NP node is encountered, it is proposed as
antecedent. If X is not the highest S node in the sentence, continue to step 5.
5. From node X, go up the tree to the first NP or S node encountered. Call this new node X,
and call the path traversed to reach it p.
6. If X is an NP node and if the path p to X did not pass through the Nominal node that X
immediately dominates, propose X as the antecedent.
7. Traverse all branches below node X to the left of path p in a left-to-right, breadth first
manner. Propose any NP node encountered as the antecedent.
8. If X is an S node, traverse all branches of node X to the right of path p in a left-to right,
breadth-first manner, but do not go below any NP or S node encountered. Propose any NP
node encountered as the antecedent.
9. Go to Step 4.

Stemming is a method in text processing that eliminates prefixes and suffixes from words,
transforming them into their fundamental or root form, The main objective of stemming is
to streamline and standardize words, enhancing the effectiveness of the natural language
processing tasks. The article explores more on the stemming technique and how to perform
stemming in Python.
What is Stemming in NLP?
Simplifying words to their most basic form is called stemming, and it is made easier by
stemmers or stemming algorithms. For example, “chocolates” becomes “chocolate” and
“retrieval” becomes “retrieve.” This is crucial for pipelines for natural language processing,
which use tokenized words that are acquired from the first stage of dissecting a document
into its constituent words.
Stemming in natural language processing reduces words to their base or root form, aiding in
text normalization for easier processing. This technique is crucial in tasks like text
classification, information retrieval, and text summarization. While beneficial, stemming
has drawbacks, including potential impacts on text readability and occasional inaccuracies
in determining the correct root form of a word.
Why is Stemming important?
It is important to note that stemming is different from Lemmatization. Lemmatization is the
process of reducing a word to its base form, but unlike stemming, it takes into account the
context of the word, and it produces a valid word, unlike stemming which may produce a
non-word as the root form.
Note: Do must go through concepts of ‘tokenization.‘
Some more example of stemming for root word "like" include:
->"likes"
->"liked"
->"likely"
->"liking"
Porter’s Stemmer
It is one of the most popular stemming methods proposed in 1980. It is based on the idea
that the suffixes in the English language are made up of a combination of smaller and
simpler suffixes. This stemmer is known for its speed and simplicity. The main applications
of Porter Stemmer include data mining and Information retrieval. However, its applications
are only limited to English words. Also, the group of stems is mapped on to the same stem
and the output stem is not necessarily a meaningful word. The algorithms are fairly lengthy
in nature and are known to be the oldest stemmer.
Example: EED -> EE means “if the word has at least one vowel and consonant plus EED
ending, change the ending to EE” as ‘agreed’ becomes ‘agree’.

from nltk.stem import PorterStemmer

# Create a Porter Stemmer instance


porter_stemmer = PorterStemmer()

# Example words for stemming


words = ["running", "jumps", "happily", "running", "happily"]

# Apply stemming to each word


stemmed_words = [porter_stemmer.stem(word) for word in words]

# Print the results


print("Original words:", words)
print("Stemmed words:", stemmed_words)

Original words: ['running', 'jumps', 'happily', 'running', 'happily']


Stemmed words: ['run', 'jump', 'happili', 'run', 'happili']

 Advantage: It produces the best output as compared to other stemmers and it has less
error rate.
 Limitation: Morphological variants produced are not always real words.

PropBank
The propositional level of analysis is layered on top of the parse trees and identifies predicate
constituents and their arguments in OntoNotes. This level of analysis is supplied by
PropBank which is described below:
Robust syntactic parsers, made possible by new statistical techniques (Ratnaparkhi, 1997;
Collins, 1999; Collins, 2000; Bangalore and Joshi, 1999; Charniak, 2000) and by the
availability of large, hand-annotated training corpora (Marcus, Santorini, and Marcinkiewicz,
1993; Abeille, 2003), have had a major impact on the field of natural language processing in
recent years. However, the syntactic analyses produced by these parsers are a long way from
representing the full meaning of the sentence. As a simple example, in the sentences:

 John broke the window.


 The window broke.
A syntactic analysis will represent the window as the verb's direct object in the first sentence
and its subject in the second, but does not indicate that it plays the same underlying semantic
role in both cases. Note that both sentences are in the active voice, and that this alternation
between transitive and intransitive uses of the verb does not always occur, for example, in the
sentences:

 The sergeant played taps.


 The sergeant played.
The subject has the same semantic role in both uses. The same verb can also undergo
syntactic alternation, as in:

 Taps played quietly in the background.


and even in transitive uses, the role of the verb's direct object can differ:

 The sergeant played taps.


 The sergeant played a beat-up old bugle.
Alternation in the syntactic realization of semantic arguments is widespread, affecting most
English verbs in some way, and the patterns exhibited by specific verbs vary widely (Levin,
1993). The syntactic annotation of the Penn Treebank makes it possible to identify the
subjects and objects of verbs in sentences such as the above examples. While the Treebank
provides semantic function tags such as temporal and locative for certain constituents
(generally syntactic adjuncts), it does not distinguish the different roles played by a verb's
grammatical subject or object in the above examples. Because the same verb used with the
same syntactic subcategorization can assign different semantic roles, roles cannot be
deterministically added to the Treebank by an automatic conversion process with 100%
accuracy. Our semantic role annotation process begins with a rule-based automatic tagger, the
output of which is then hand-corrected (see Section 4 for details).
The Proposition Bank aims to provide a broad-coverage hand annotated corpus of such
phenomena, enabling the development of better domain-independent language understanding
systems, and the quantitative study of how and why these syntactic alternations take place.
We define a set of underlying semantic roles for each verb, and annotate each occurrence in
the text of the original Penn Treebank.

FrameNet

The FrameNet corpus is a lexical database of English that is both human- and machine-
readable, based on annotating examples of how words are used in actual texts. FrameNet is
based on a theory of meaning called Frame Semantics, deriving from the work of Charles J.
Fillmore and colleagues. The basic idea is straightforward: that the meanings of most words
can best be understood on the basis of a semantic frame: a description of a type of event,
relation, or entity and the participants in it. For example, the concept of cooking typically
involves a person doing the cooking (Cook), the food that is to be cooked (Food), something
to hold the food while cooking (Container) and a source of heat (Heating_instrument). In the
FrameNet project, this is represented as a frame called Apply_heat, and the Cook, Food,
Heating_instrument and Container are called frame elements (FEs). Words that evoke this
frame, such as fry, bake, boil, and broil, are called lexical units (LUs) of the Apply_heat
frame. The job of FrameNet is to define the frames and to annotate sentences to show how
the FEs fit syntactically around the word that evokes the frame.

Frames
A Frame is a script-like conceptual structure that describes a particular type of situation,
object, or event along with the participants and props that are needed for that Frame. For
example, the “Apply_heat” frame describes a common situation involving a Cook, some
Food, and a Heating_Instrument, and is evoked by words such as bake, blanch, boil, broil,
brown, simmer, steam, etc.
We call the roles of a Frame “frame elements” (FEs) and the frame-evoking words are called
“lexical units” (LUs).
FrameNet includes relations between Frames. Several types of relations are defined, of which
the most important are:
 Inheritance: An IS-A relation. The child frame is a subtype of the parent frame, and each FE
in the parent is bound to a corresponding FE in the child. An example is the “Revenge” frame
which inherits from the “Rewards_and_punishments” frame.
 Using: The child frame presupposes the parent frame as background, e.g the “Speed” frame
“uses” (or presupposes) the “Motion” frame; however, not all parent FEs need to be bound to
child FEs.
 Subframe: The child frame is a subevent of a complex event represented by the parent, e.g.
the “Criminal_process” frame has subframes of “Arrest”, “Arraignment”, “Trial”, and
“Sentencing”.
 Perspective_on: The child frame provides a particular perspective on an un-perspectivized
parent frame. A pair of examples consists of the “Hiring” and “Get_a_job” frames, which
perspectivize the “Employment_start” frame from the Employer’s and the Employee’s point
of view, respectively.

What is the BNC?

The British National Corpus (BNC) is a 100 million word collection of samples of written
and spoken language from a wide range of sources, designed to represent a wide cross-
section of British English from the later part of the 20th century, both spoken and written.
The latest edition is the BNC XML Edition, released in 2007.

The written part of the BNC (90%) includes, for example, extracts from regional and
national newspapers, specialist periodicals and journals for all ages and interests, academic
books and popular fiction, published and unpublished letters and memoranda, school and
university essays, among many other kinds of text. The spoken part (10%) consists of
orthographic transcriptions of unscripted informal conversations (recorded by volunteers
selected from different age, region and social classes in a demographically balanced way) and
spoken language collected in different contexts, ranging from formal business or government
meetings to radio shows and phone-ins.

The corpus is encoded according to the Guidelines of the Text Encoding Initiative (TEI) to
represent both the output from CLAWS (automatic part-of-speech tagger) and a variety of
other structural properties of texts (e.g. headings, paragraphs, lists etc.). Full classification,
contextual and bibliographic information is also included with each text in the form of a TEI-
conformant header.
Work on building the corpus began in 1991, and was completed in 1994. No new texts have
been added after the completion of the project but the corpus was slightly revised prior to the
release of the second edition BNC World (2001) and the third edition BNC XML
Edition (2007). Since the completion of the project, two sub-corpora with material from the
BNC have been released separately: the BNC Sampler (a general collection of one million
written words, one million spoken) and the BNC Baby (four one-million word samples from
four different genres).

Full technical documentation covering all aspects of the BNC including its design, markup,
and contents are provided by the Reference Guide for the British National Corpus (XML
Edition). For earlier versions of the Reference Guide and other documentation, see the BNC
Archive page.

What sort of corpus is the BNC?

Monolingual: It deals with modern British English, not other languages used in Britain.
However non-British English and foreign language words do occur in the corpus.

Synchronic: It covers British English of the late twentieth century, rather than the historical
development which produced it.

General: It includes many different styles and varieties, and is not limited to any particular
subject field, genre or register. In particular, it contains examples of both spoken and written
language.

Sample: For written sources, samples of 45,000 words are taken from various parts of single-
author texts. Shorter texts up to a maximum of 45,000 words, or multi-author texts such as
magazines and newspapers, are included in full. Sampling allows for a wider coverage of
texts within the 100 million limit, and avoids over-representing idiosyncratic texts.

You might also like