Definition and Features of A Corpus
Definition and Features of A Corpus
of a Corpus
2.1 Introduction
From the middle of the last century, we have observed a remarkable change
in the traditional approach to language study. It is noted that processes of
empirical linguistic research and application are gradually occupying the
place of intuition-based language research and application in almost all the
renowned linguistic research centres across the world. The causes behind such
an alternation in the approach to linguistic research are not easy to define
because the mere advent of computer technology and its use in linguistics are
not the sole reasons to divert a large number of hard-core traditional linguists
from the area of intuitive research to the domain of empirical analysis. There
are some other reasons behind this change, which leave a lasting impact on
the new generation of scholars. To trace the root of change in the mentality
of scholars as well as in the scenario of research, we have to trace the three
following factors:
Besides, there are some other factors that have also played important role in
attracting a large number of scholars into the field of corpus linguistics. These
Due to these factors, linguists are no more willing to depend only on the sets
However, before we start exploring this new domain to learn about its nature
and functional modalities, we need to turn our attention towards its definition
variety, users and its actual usage in various domains of human interaction.
Information from all these domains (and from some other sectors) related
structure of a language.
In this regard, the most striking thing is that for a long time, we had
various types from language without much trouble. Due to the limitation in
language use.
Modern corpus linguistics surpasses traditional linguistics in this
directly from the fields of actual language use. It then analyses the databases,
and application. It expands the horizon of linguistics for the direct benefit of
A language corpus has the ability to ventilate into many unknown aspects
of a natural language. We can learn in detail about these aspects from two
basic sources. One is spoken text, and the other is written text. Although we
know that each form is characteristically different from the other, there are
many things that we obtain in equal proportion from both forms. Moreover,
mutual interdependency between the two, we must admit that each form has
certain unique features, which cannot be mixed up with the features of the
other. Finer distinctive features observed in spoken and written forms should,
study of a language.
morphemes, which are used to form words and a finite set of grammatical
rules, which are used to generate sentences. But the fact is that grammatical
1972: 17). With profound knowledge, if we look into a piece of text, we easily
Furthermore, there are many other properties that are hidden under the
surface of the spoken and written text. For instance, the meaning of words,
time, place, agent, fact and content, information of pragmatics and discourse,
reality of linguistic events, etc., are always embedded within a piece of text.
Information of these properties is never possible to retrieve just by looking
at the surface form. In fact, this information is not possible to retrieve until
and unless we explore deep into the text. Information of these properties is
possible to obtain if we examine critically the text and analyse it with close
reference to the context of occurrence. This leads us to argue that with the
Definition and Features of a Corpus 27
help of a corpus, we can explore deep into the content of a language. On the
and deceptive unless these are substantiated with the evidence gathered from
define the stylistic nuances employed by the author to establish the author's
community. A corpus will contain not only samples of spoken text but also
perspective in view, let us discuss the general definition, form, and features
of a corpus.
To achieve its goal, it faithfully collects samples of text from various fields
explanations are logical and intuitive evidences and arguments are valid with
samples. It has been informed (Francis 1992: 17) that in the sixteenth century,
the emperor Justinian formed the Corpus Juris Civilis, which is nothing more
than 'a compilation of early Roman laws and legal principles, illustrated by
cases, and combined with explanation of new laws and future legislation to
28 Corpus Linguistics
be put into effect' {World Book 10.168). However, closer to the sense in which
the term corpus is now used is the Latin Corpus Glossary of the eighteenth
century, which assembled 'hard Latin words arranged in alphabetical order
analysis' (Crystal 1997). However, Kennedy (1998: 3) does not agree with
to be used for linguistic analysis' (Francis 1982: 7). Although the definition
stated above tries to encompass the socio-linguistic components induced
within a language, it miserably fails to divert attention to the linguistic
into careful consideration, we can explain the term, from the features it denotes,
in the following way:
C : Compatible to computer
O : Operational in research and application
R : Representative of the source language
P : Processeable by both man and machine
U : Unlimited in the amount of language data
S : Systematic in formation and text representation
users.
• It should enable language users to use language data in multiple tasks,
starting from simple linguistic description and analysis to statistical
annotated form.
• The linguistic and extralinguistic information of text samples should
and validation.
Unless defined otherwise, let us consider that a corpus should possess all
the properties mentioned above. Exception may be noted in a historical
used in a language.
composed and developed, how text is used, in which context it is used, and
how it serves the purpose of users.
we can simplify the task to some extent if we redefine the entire concept
corpus, keeping in view the work we are supposed to do with it. For instance,
definitely try to design a corpus that contains a large amount of data collected
from text samples of people related to the underworld so that the target
towards a particular type of language text, it gives us much needed relief from
the rigor of observing all corpus generation issues, conditions and principles
that the principles and conditions should vary depending on the purpose of
compile a large corpus of written text of any type with samples obtained from
various Web sites. Such work may not be as expensive as we assume and
Japanese, Chinese and some other languages, it is hardly available for most of
the Indian languages, including Hindi, Bengali, Tamil, Telugu, Oriya, Urdu,
Punjabi and others.151
32 Corpus Linguistics
A corpus can be, and indeed it is, of many types (See Chapter 3). However, a
values), which might vary for some other types. That means, a corpus, which
does not possess one or more default characteristics of a general corpus, should
be identified as a 'special corpus'—the title of which will specify its normal
pattern of deviation from the frame of a general corpus. Before we discuss
2.3.1 Quantity
The question that arises, while we determine to generate a corpus, is, how big
will the corpus be? That means, how many words will there be in a corpus?
sensible to prescribe any fixed parameter for such a question. But in simple
terms, we can say that the bigger the corpus, the better its authenticity
The default value of 'quantity' signifies that it should be large with regard
quite rapidly, and therefore, it is not sensible to recommend any set of figures.
Furthermore, the recent advent of 'monitor corpus' (See Chapter 3) affects
remarkably the concept of size, which refers to the calculation of the 'rate of
refer to the sum of the total linguistic components included in it. Thus, the
non-influential one.
Most often, contrary to widely used languages such as English and French,
are not able to provide written text samples belonging to diverse fields and
disciplines, which influential languages can easily supply.
number of words is not at all a faithful clue to check this feature of a corpus.
this problem, we argue for collecting texts from various written and spoken
sources. The advantage of this method lies in its way of gradual increment
of the number of sentence types, which automatically will ensure the normal
technology of the time. That means the issue of the quantity of words should
of time when the corpus is developed. When the actual work of electronic
corpus generation started in the second half of the last century, computer
task. In those early years of electronic corpus generation. Brown Corpus, which
contained just one million words, was considered a standard one[7] because, at
that particular time, a collection of one million words in electronic form was
In the 1970s and 1980s, when the computer went through stages of
gradually replaced by large corpora of various sizes. Within a few years, some
corpora were developed that contained more than twenty million words.
contained more than twenty million words. In the middle of the last decade of
the last century, the number of words of Bank of English reached two hundred
million, and it is still open for further increment.
On the other hand, linguists who are working with a corpus also realize
that a collection of one million words is not at all a reliable resource to make
any faithful observation on any aspect of a natural language. They ask for a
corpus of at least one hundred million words to validate their arguments and
hypotheses. In the mid-1990s, the horizon was further expanded. In the new
millennium, we are not even satisfied with a corpus containing a hundred
34 Corpus Linguistics
million words. For instance, British National Corpus has reached the stage of
400 million words within the last few years. Yet, it shows no intention to stop.
It still continues to grow with daily doses of data coming from various sources
and fields.
There are, however, a few loopholes in the labyrinth of quantity. We
observe that a collection of data from those languages that enjoy facilities
of electronic devices is much bigger than those languages that do not have
such facilities. That means techno-savvy languages, such as English, German,
French, Italian, Spanish, etc., have better scope for generation of a corpus than
non-techno-savvy languages because techno-savvy languages, due to specific
socioeconomic, politico-cultural, and commercial-scientific reasons, enjoy
both global patronage and technical support. Therefore, availability of texts
in electronic form in these languages is much higher than in others. Also, the
Roman script, used for most of these languages, contributes to a great extent
for their global expansion.
On the contrary, languages that do not have large resources in electronic
form due to the reasons stated above have very little scope for generation
of a corpus easily. Even if we keep aside the languages of the backward and
underdeveloped communities, we can easily find that resources in electronic
form available in Indian languages, such as Hindi, Bengali, Telugu, Tamil,
etc., are not even one-tenth of the resources available for languages such as
English, Spanish, German, French, etc., although the number of speakers of
Indian languages is not less than those of Western languages. The grim truth
is that facilities of the electronic medium are not yet accessed properly by
languages of underdeveloped countries as they are accessed by languages of
advanced countries. Therefore, it is not surprising if we find that the number
of electronic corpora in Indian languages is less in number when compared to
those in the advanced countries.
The above argument, however, does not work in case of corpora of
spoken texts. Both in advanced and non-advanced languages, in reality, only
a small and marginal fraction of the whole amount of spoken interactions
are included within a speech corpus. And most strikingly, for spoken texts
of advanced and non-advanced languages, collection and processing of
speech databases involve an equal amount of complexities and technical
sophistication. Even then, we find that tools and techniques for spoken text
collection and processing are easily available for advanced languages than
2.3.2 Quality
The default value for 'quality' relates to authenticity. That means all text
materials should be collected from genuine communications of people doing
their normal businesses. The role of corpus collectors is confined within the
area of acquiring data for the purpose of corpus generation, which, in return.
Definition and Features of a Corpus 35
will protect the interest of people who will make statements about the way
2.3.3 Representativeness
A corpus should include text samples from a broad range of materials in order
which sectors and fields are we supposed to collect the data? Should we include
only texts from family interactions, or should we include data from speech
events that occur in courts and police stations, offices and clubs, schools and
colleges, playgrounds and cinema halls, shopping malls and market places,
roads and pubs, etc.? The answer is already embedded within the question. We
need to collect text from all possible source of spoken interaction, irrespective
of their place, time and situation of occurrence, and from all types of people,
irrespective of their sex, age, class, caste, education or profession. Only then a
novels, and stories, but also from informative prose texts such as natural
science, social science, medical science, engineering, technology, commerce,
banking, earth science, advertisements, posters, newspapers, personal letters,
government notices, diaries and similar sources. To be truly representative
termed a corpus. Although the condition for designing a valid sample size of a
corpus is yet to be finalized, people who are seriously concerned with corpus
generation will not attempt to gather a large collection of citations of words to
claim it as a corpus.191
In the long run, the question of size becomes irrelevant in the context of
representativeness. A large corpus does not necessarily imply representation
the greater the number of individual samples, the greater is the reliability of
the analysis of linguistic variables (Kennedy 1998: 68). Brozvn Corpus, LOB
Corpus and Survey of English Usage are designed in such a way that they
LOB Corpus, and Survey of English Usage, however, shows how these corpora
are less enriched with respect to the number of words and less diversified in
structure and variety of contents. This helps us to settle empirically the issues
are also discussed with reference to some empirical issues (Summers 1991). It
is argued that even a corpus of one hundred million words is too small when
compared with the total amount of texts from which a corpus is sampled.
text type influence linguistic analysis at subsequent stages because the original
purpose of the text plays a vital role in drawing up inferences. Thus, on the
document or text types as its main organising principle' (Summers 1991: 5).
In our argument, the most sensible and pragmatic approach can be the one
that combines all these approaches in a systematic way and where we can
have data from a broad range of sources and text types with due emphasis on
linguistic research and analysis. In reply to this question, we argue that since
corpus collection to reflect on these inherent factors. For instance, the language
use. The goal of a corpus will be lost if it fails to project into all the primary
aspects of a language. Therefore, we consider proportional representation of
text samples as one of the basic features of a general corpus.
2.3.4 Simplicity
This feature signifies that a corpus should contain text samples in simple
and plain form so that target users can have easy access to the texts without
stumbling upon any additional linguistic information marked up within
the texts. There are a few corpora in which text samples are tagged with
the Standard Generalized Mark-up Language (SGML) (ISO 8879: 1986) format
in which all mark-ups are carefully used to not impose any additional
the corpus.
Since the default value of simplicity is 'plain text', the users expect an
separated from the text itself. Nowadays, many texts are available in SGML
format, which, in the future, may be available in the Text Encoding Initiative
(TEI) format. In such corpora, all words, phrases and sentences are marked
ups have been carefully designed and tagged so that these do not add up any
additional linguistic information on the texts.
the textual features. It varies from analyst to analyst and from purpose
to purpose.
A simple 'plain text' policy is usually not opposed to such type of encoding,
nor does it oppose the use of same mark-up conventions. However, we argue
that there should be clear-cut guidelines for the purpose of clarity of the text
Definition and Features of a Corpus 39
used to annotate only the surface features of texts. Otherwise the encoding
system, which is used to encode texts, will create problem in analyses and
big corpora will remain easy to manage if they are full of various annotations,
because retrieval times are already becoming critical.
There are definitely specific reasons behind the practice of using mark-
up techniques on a corpus. In some specific works of language technology,
a corpus built with marked-up texts become more useful for systematic
processing and analysis of texts, which result in the development of robust
systems and sophisticated tools for language processing. Marked-up corpora
also become highly useful resource for various sociolinguistic research,
2.3.5 Equality
The term equality of text samples is, to a certain extent, related to the feature
samples of each text type should possess an equal number of tokens collected
from various sources. For instance, if each sample of spoken text contains
five thousand words, each sample of written text should also contain five
thousand words.
which each text sample had more or less the same amount of data with respect
to the number of tokens. This norm was supported by the general argument
that text samples used in a corpus should be of equal size. However, there are
The sampling techniques used for Brown Corpus are often referred to as a
Wellington Corpus of New Zealand English, Kolhapur Corpus of Indian English, and
Freiburg LOB Corpus. Also some small-sized corpora are developed following
the same ratio of textual equality although the amount of data is increased in
a proportionate manner.
this model any more in their works of corpus compilation. People now
follow more robust methods based on various statistical as well as linguistic
2.3.6 Retrievability
The work of corpus generation does not end with the compilation of language
data within a corpus. It also involves formatting the text in a suitable form so
that the data becomes easily retrievable by end users. That means the data
stored in a corpus should be made an easy resource for the new generation of
This actually redirects our attention towards the techniques and tools
used for preserving language data in electronic format. Present technology
This, however, will not serve the goal of corpus linguistics because the
utility of a corpus is not confined to computer-trained people only. Because
a corpus is made with the language of all, it is meant for use by all. Starting
from the computer experts, it is open for linguists, social scientists, language
and common people. The goal of a corpus will be accomplished only when
people coming from every walk of life will be able to access the corpus and
electronic corpus. But they need to use language corpora for addressing their
needs. Therefore, a corpus must be stored in an easy and simple format so that
corpus handling and management. Even naive people, who have never
acquired formal computer training, can compile a corpus, arrange data as they
like, use databases according to their choices and classify and analyse data
according to their needs. Due to such a wider scope for application by the
the global scenario of language research and use never imagined before.
2.3.7 Verifiability
This feature implies that the text samples collected from various sources of
investigation. Until and unless a corpus is free and open for all kinds of
of language use.
easily qualifies to win the trust of the users of the language or the language
variety. The users, after verifying the data stored in a corpus, must certify
categorically that what has been exhibited in the corpus is actually a faithful
the language used in newspapers in its fullest form. The corpus will thus
that are true to the language. Also, these works require language data that is
If a corpus is not reliable, then resources made from the database of the
This leads us to argue that a corpus, whatever form or type it may have,
should be open to any kind of verification and assessment. In fact, this quality
will make a corpus trustworthy to language experts because they will be able
42 Corpus Linguistics
corpus linguistics, we are in a position where we easily verify each and every
observation with the database of real-life use.
2.3.8 Augmentation
A living language is bound to change with time. This is one of the basic proofs
of a language to prove its life and vitality. If a language stops to change with
time, we can consider it to be obsolete or dead.
language throbbing with life, must have an ability for ceaseless growth and
means a corpus should continue to grow with time, registering the linguistic
variations observed across time within a living language. Although most of
the present-day corpora are synchronic in nature, efforts are made to make
them diachronic so that they are able to grow in parallel with the change of
time scale, may achieve the status of a diachronic corpus. Over the years, it will
attain a chronological dimension to offer greater scope for diachronic studies
of the language and language properties to catch subtle changes caught both
in life and society and reflected in language. Such a feature has several indirect
With the power of regular augmentation, a corpus will become larger in size
and quantity, wider in coverage and multidimensional in content to reflect on
how language changes its form and texture through the stream of regular
because they find in it a scope to study the changes in life and culture of
people across ages.
corpus linguists. They are never reluctant to work on compiling data from
the sources of language use marked with new tags of time. Keeping this view
in goal, both Bank of English and Bank of Swedish go on adding new language
data from English and Swedish, respectively. For the past two decades, both
Definition and Features of a Corpus 43
the corpora are in the process of continuous growth with accumulation of new
examples from new sources. Similar efforts are also initiated for the corpora of
2.3.9 Documentation
stylistic analyses and legal enquiries, etc., which also ask for verification of
information of the resource documents from which the data is collected and
etc. In case of written text samples, this is mostly concerned with referential
information of physical texts (for example, the name of a book or newspaper;
of pages; type of text; sex, profession, age and social status of authors; etc.)
On the other hand, in case of spoken texts, this is concerned with the names
argue that extralinguistic information should be tagged within the text itself
to deal with the process of documentation because it may hamper the normal
should be stored in a separate database or file. The file should be tagged with
the corpus in such a way that anybody who wants to access this information
can easily collect it from the tagged file. This will not only keep the text in the
corpus intact in its form and texture but also make the work of corpus access,
processing and information retrieval more simple and straightforward.
that will contain all references related to the documents. For the purpose of
easy management, access and processing of the corpus, this will allow quick
separation of the plain text from the tags used in annotation. A suitable model
2.3.10 Management
Corpus data management is a highly tedious task. There are always some
errors in the text to be corrected, some modifications to be made, and some
improvements to be incorporated. At the initial stage of corpus generation, it
involves a systematic arrangement of text files according to various text types
by which the searching of information becomes faster and easier. Generally,
the utility of a corpus database is enhanced by an intelligent arrangement
technology is not advanced to perform all these works with full satisfaction,
absolute and non-changeable. These are identified after considering the types
type. These are open for future verification and modification if typological
classifications of corpora are taken into consideration. In that case, some of the
are not addressed here. Also, generation of a new type of corpus may ask
modification and recasting to fit into the new format of language corpora.
Endnotes
[1] The Latin term corpus ('body') has two direct descendants in English: corpse,
which came via the Old French cors, and corps, which came via the modern French
corps in the eighteenth century. The former entered English in the thirteenth
century as cors, and during the fourteenth century, it had its original Latin p
reinserted. At first it simply meant 'body', but by the end of the fourteenth
century, the sense 'dead body' became firmly established. However, the original
Latin term corpus itself was acquired in English in the fourteenth century
(Ayto 1990: 138).
[2] Because of the question of sampling techniques used for generating a corpus,
Sinclair (1991) prefers to use the non-committal word 'pieces' and not 'texts'. If
samples are of same size, then they are not texts. Most of them will be fragments
of texts, arbitrarily detached from their content sources. Sometimes, words such
as collection and archive usually refer to sets of language texts. However, they
differ from a corpus in the sense that they do not need to be selected or ordered.
Moreover, selection and ordering of input texts do not need to be on the same
lines as proposed for designing a language corpus. These are, therefore, quite
unlike a language corpus. The term text is also referred to in relation to a corpus
because it contains a collection of language data. It simply points to the extracts
used for both spoken and written communications.
[3] Almost similar definitions of a corpus are provided by Aarts (1991), Johansson
(1991), Leech (1991), Kennedy (1998), Stubbs (1996), Biber, Conrad, and Reppen
(1998) and others. Most of these definitions, however, fail to elaborate the
inherent texture of the concept in details.
[4] Technically, the size of a corpus implies the total sum of its components (i.e.,
words, phrases, clauses, sentences, etc.). For instance, texts from the field of
natural science should carry equal weight like that of literature, mass media.
46 Corpus Linguistics