0% found this document useful (0 votes)
476 views23 pages

Definition and Features of A Corpus

A corpus is a large collection of naturally occurring texts or spoken data that represents a language or language variety. It allows linguists to analyze authentic examples of language use rather than relying on intuition. Key features of a corpus include: - It contains samples of both spoken and written language use. - It aims to include examples from all domains and contexts of a language to give a balanced representation. - Analyzing a corpus allows researchers to make data-driven observations about a language and verify linguistic claims based on real language evidence rather than isolated citations or examples.

Uploaded by

HUN Teng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
476 views23 pages

Definition and Features of A Corpus

A corpus is a large collection of naturally occurring texts or spoken data that represents a language or language variety. It allows linguists to analyze authentic examples of language use rather than relying on intuition. Key features of a corpus include: - It contains samples of both spoken and written language use. - It aims to include examples from all domains and contexts of a language to give a balanced representation. - Analyzing a corpus allows researchers to make data-driven observations about a language and verify linguistic claims based on real language evidence rather than isolated citations or examples.

Uploaded by

HUN Teng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

2

Definition and Features

of a Corpus

2.1 Introduction

From the middle of the last century, we have observed a remarkable change
in the traditional approach to language study. It is noted that processes of
empirical linguistic research and application are gradually occupying the
place of intuition-based language research and application in almost all the

renowned linguistic research centres across the world. The causes behind such
an alternation in the approach to linguistic research are not easy to define

because the mere advent of computer technology and its use in linguistics are
not the sole reasons to divert a large number of hard-core traditional linguists
from the area of intuitive research to the domain of empirical analysis. There
are some other reasons behind this change, which leave a lasting impact on
the new generation of scholars. To trace the root of change in the mentality
of scholars as well as in the scenario of research, we have to trace the three
following factors:

• Limitation of traditional theories, observations, and principles in case

of defining features of the 'language in use'


• Introduction of language corpora as the most authentic evidence of

'real-life language use'


• All pervasive use of corpora in various linguistic activities, including
language description, processing, analysis, and application

Besides, there are some other factors that have also played important role in
attracting a large number of scholars into the field of corpus linguistics. These

factors are summarized below:

• Language corpus enables scholars to observe a natural language in the

light of its actual use in normal regular life.


Definition and Features of a Corpus 25

• It provides ample evidence to analyse language with a degree of

authenticity which is lacking in traditional language studies.

• It helps scholars to reach a conclusive position on any aspect of a

language in an inductive manner in which the final judgement is built


on the observation of numerous individual examples.

Due to these factors, linguists are no more willing to depend only on the sets

of citation and example assembled intuitively for analysis and description

of a language. On the contrary, they are far more interested in analysing

large amounts of real-life language data, which is open to verification and


validation of any kind. This has been instrumental in bringing in global change

in the approach towards language study, identified as empirical linguistics.

However, before we start exploring this new domain to learn about its nature

and functional modalities, we need to turn our attention towards its definition

and characteristic features. This will empower us to understand the field in a


comprehensive way.

Whenever we make an attempt to analyse a language scientifically,

we try to understand its form and structure, characteristic features, usage

variety, users and its actual usage in various domains of human interaction.
Information from all these domains (and from some other sectors) related

to a language in a social context cannot be directly obtained just by looking

at its form. With direct support from various linguistic components,

information and evidence, we have to delve deeper beyond the apparent

structure of a language.
In this regard, the most striking thing is that for a long time, we had

no well-defined technique that we could access to acquire information of

various types from language without much trouble. Due to the limitation in

faithful representation of linguistic information, most often we had to rely on


secondary sources. Although linguistic information acquired from secondary

sources was considered reliable, there was no method by which we could

authenticate the information after verifying it with the touchstone of real-life

language use.
Modern corpus linguistics surpasses traditional linguistics in this

particular threshold. What is assumed to be the weakest area of traditional

linguistics is the most powerful part of corpus linguistics. In practice, it does

not depend on second-hand resources or indirect evidence for description,

analysis and application of a language. With the help of computer technology,


it collects, scientifically, a large set of text samples in the form of a corpus

directly from the fields of actual language use. It then analyses the databases,

following some well-defined principles and methods normally used in

mathematics and statistics to explore the nature and function of a language.


In subsequent stages, it systematically uses linguistic information and

examples obtained from corpora in various works of applied linguistics and


language technology. Because of this reason, corpus linguistics is a far more

enriched discipline, which opens up avenues for new linguistic research


26 Corpus Linguistics

and application. It expands the horizon of linguistics for the direct benefit of

the whole linguistic community. In essence, corpus linguistics brings out a

language from the cloister of a traditional theoretical frame to give it a new

dimension for its revival and rejuvenation.

A language corpus has the ability to ventilate into many unknown aspects

of a natural language. We can learn in detail about these aspects from two

basic sources. One is spoken text, and the other is written text. Although we

know that each form is characteristically different from the other, there are

many things that we obtain in equal proportion from both forms. Moreover,

information obtained from one form becomes complementary to the other in

case of general description and analysis of a natural language. Despite such

mutual interdependency between the two, we must admit that each form has

certain unique features, which cannot be mixed up with the features of the

other. Finer distinctive features observed in spoken and written forms should,

therefore, be kept separate from each other while we initiate corpus-based

study of a language.

Each language has a set of distinctive phonemes and a set of distinctive

orthographic symbols to represent these phonemes in written form. These

symbols include a set of characters that are linguistically known as letters

or graphemes, diacritics, punctuation marks, etc. These characters are

usually used at the time of writing to represent a language. Furthermore,

these orthographic symbols are strung together in a very systematic way to

generate words, phrases and sentences, which carry a message or information

embedded within the surface structure.

Besides these elementary building blocks, a language also has sets of

morphemes, which are used to form words and a finite set of grammatical

rules, which are used to generate sentences. But the fact is that grammatical

rules are never explicit in the surface structure of sentences (Winograd

1972: 17). With profound knowledge, if we look into a piece of text, we easily

find these inherent properties of a language.

Furthermore, there are many other properties that are hidden under the

surface of the spoken and written text. For instance, the meaning of words,

sense variation of words, variation of context of use, means of referring to

things by way of using words, the process of referring to time by way of

construction of various types, hidden intension and motive of speakers and

writers, reciprocal interaction of participants within a speech event, internal

fabric of social relation between participants of a linguistic event, reference to

time, place, agent, fact and content, information of pragmatics and discourse,

reality of linguistic events, etc., are always embedded within a piece of text.
Information of these properties is never possible to retrieve just by looking

at the surface form. In fact, this information is not possible to retrieve until

and unless we explore deep into the text. Information of these properties is

possible to obtain if we examine critically the text and analyse it with close

reference to the context of occurrence. This leads us to argue that with the
Definition and Features of a Corpus 27

help of a corpus, we can explore deep into the content of a language. On the

contrary, studies of all these properties of a language are bound to be skewed

and deceptive unless these are substantiated with the evidence gathered from

a corpus of real language texts.

Another striking power of a corpus lies in its strong ability of projecting


faithfully into the stylistic patterns of individual writers. With intimate

reference to texts composed by an author, we can systematically and easily

define the stylistic nuances employed by the author to establish the author's

argument or proposition. Thus, a language corpus becomes a source for


reflecting on intralinguistic and extralinguistic features of a language.

We, therefore, argue that any scientific study and evaluation of a

language should be based on a corpus collected from texts used by a language

community. A corpus will contain not only samples of spoken text but also

samples of written text, in equal proportion, if possible. To give a balanced


and representative structure to a corpus, samples should be compiled from

all domains of language use as far as practically feasible. Keeping this

perspective in view, let us discuss the general definition, form, and features

of a corpus.

2.2 What is a Corpus?

The corpus-based language analysis and description is not altogether a new

branch of linguistics. In simple terms, it is a new approach to language study.

It supplies samples and linguistics information for all branches of linguistics.

To achieve its goal, it faithfully collects samples of text from various fields

of language use in a scientific and systematic way. A corpus is a statistically


sampled language database for the purpose of investigation, description,

application and analysis relevant to all branches of linguistics. Due to large

structure, varied composition, huge information, confirmed referential

authenticity, wide representation, easy usability, and simple verifiability a

corpus has become an indispensable resource in all branches of linguistics. In


any area of linguistics, scholars can easily refer to a corpus to verify whether

earlier propositions and examples are real, pre-proposed definitions and

explanations are logical and intuitive evidences and arguments are valid with

respect to the proofs of actual usage.


Etymologically, corpus is derived from the Latin corpus meaning 'body'.111

Although the term is randomly applied to various non-linguistic collection of

data and samples in other branches of human knowledge, in linguistics and

language-related disciplines (such as philosophy, psychology, etc.), it occupies


an esteemed status with an orientation towards a large collection of language

samples. It has been informed (Francis 1992: 17) that in the sixteenth century,
the emperor Justinian formed the Corpus Juris Civilis, which is nothing more

than 'a compilation of early Roman laws and legal principles, illustrated by

cases, and combined with explanation of new laws and future legislation to
28 Corpus Linguistics

be put into effect' {World Book 10.168). However, closer to the sense in which

the term corpus is now used is the Latin Corpus Glossary of the eighteenth
century, which assembled 'hard Latin words arranged in alphabetical order

and followed by easier Latin synonyms or equivalent in Anglo-Saxon' (Starnes


and Noyes 1946:197).
In corpus linguistics, corpus holds a special connotative sense.

According to Crystal (1995), it refers to 'a large collection of linguistic data,


either written texts or a transcription of recorded speech, which can be

used as a starting point of linguistic description or as a means of verifying

hypotheses about a language'. In a different way, it refers to 'a body of


language texts both in written and spoken form. It represents varieties of
a language used at each and every field of human interaction. Preserved
in machine readable form it enables all kinds of linguistic description and

analysis' (Crystal 1997). However, Kennedy (1998: 3) does not agree with

this definition because according to him, such a one-dimensional definition


may fail to represent the contrasts and varieties involved in the process of

corpus generation. Therefore, in the present context of linguistics, corpus


should be used in the sense of 'a large collection of texts assumed to be
representative of a given language, dialect, or other subset of a language,

to be used for linguistic analysis' (Francis 1982: 7). Although the definition
stated above tries to encompass the socio-linguistic components induced
within a language, it miserably fails to divert attention to the linguistic

criteria considered necessary for designing a corpus.

This need is addressed in the definition in which it is argued that a


corpus is a collection of 'pieces'121 of language that are selected and ordered
according to some explicit linguistic criteria in order to be used as samples

of the language (Sinclair 1996: 3). It usually refers to a large collection of


naturally occurring language texts presented in machine-readable form

accumulated in a scientific manner to characterize a particular variety or


use of language (Sinclair 1991: 172). It is methodically designed to contain
many millions of words compiled from different text types across various

linguistic domains to encompass the diversity a language usually exhibits


through its multifaceted use. It may refer to any text in written or spoken

form. A corpus, which contains constituent 'pieces' of language that are


documented as to their origin and provenance, is encoded in a standard and

homogenous way for open-ended retrieval tasks.131


Some other scholars, on the contrary, (McEnery and Wilson 1996: 215),
prefer to classify a corpus by a finer scheme of classification characterized by

its inherent features. According to them, a corpus can refer to:

• (loosely) any body of text

• (most commonly) a body of machine-readable text


• (more strictly) a finite collection of machine-readable texts, which are

sampled to be maximally representative of a language or a language


variety
Definition and Features of a Corpus 29

However, the definition of a corpus, formulated by Hunston (2002: 2) is


slightly different from others. According to Hunston, 'Linguistics have

always used the word 'corpus' to describe a collection of naturally occurring


examples of language, consisting of anything from a few sentences to a set
of written texts or tape recordings, which have been collected for linguistic
study. More recently, the word has been reserved for collections of texts (or
parts of texts) that are stored and accessed electronically. Because computers
can hold and process large amounts of information, electronic corpora are
usually larger than the small, paper-based collections previously used to
study aspects of language.'

Two important issues, which are relevant in corpus designing and


compilation come out from the above deliberations:

• Composition of a corpus, and

• Usage potential of a corpus

It is not a big problem to collect samples of texts of a language. But a mere


collection of samples does not stand for a corpus unless it is marked with
some specific properties. A corpus needs data from each and every domain of
language use without any prejudice and restriction. Theoretically, it has to be
infinite in form and content. At the same time, it has to reflect faithfully on the

varieties normally observed in regular use of language. In essence, it has to be


a reliable replica in which all types of language use are truly manifested.
It is already stated that a corpus contains a large collection of representative

samples obtained from texts covering wide varieties of language use


from numerous domains of interaction. Therefore, a corpus is Capable Of
Representing Potentially Unlimited Selections of text. Taking all these factors

into careful consideration, we can explain the term, from the features it denotes,
in the following way:

C : Compatible to computer
O : Operational in research and application
R : Representative of the source language
P : Processeable by both man and machine
U : Unlimited in the amount of language data
S : Systematic in formation and text representation

When we try to develop and design a general corpus, we need to keep in


mind that it is designed for a faithful study of linguistic properties present
in language. Thus, a systematically compiled corpus, although small in size,
must contain the following features:

• It should faithfully represent both common and special linguistic

features of a language from where it is designed and developed.


• It should be large in size141 to encompass samples of text from various
disciplines. That means, directional varieties of language use noted in
various disciplines and domains should have representation in it.
30 Corpus Linguistics

• It should be a true replica of physical texts normally found in spoken

and printed forms of a language.


• It should faithfully preserve various forms of words, punctuation

marks, spelling variations and other orthographic symbols used in


the source text. Otherwise, the actual image of language or a language

variety will be distorted.


• It should represent all linguistic usage varieties in a proportional
manner to give a general impression about the language.

• Text samples used in a corpus should be authentic and referential for


future verification.

• A corpus should be made in such a way that it becomes available in


machine-readable form for quick access and reference by common

users.
• It should enable language users to use language data in multiple tasks,
starting from simple linguistic description and analysis to statistical

analysis, language processing, translation, etc.


• Text samples should be preserved in either annotated or non-

annotated form.
• The linguistic and extralinguistic information of text samples should

be preserved in a reliable and systematic way along with the texts in


machine so that information is ready for access for future reference

and validation.

Unless defined otherwise, let us consider that a corpus should possess all
the properties mentioned above. Exception may be noted in a historical

corpus, which, due to its diachronic form and composition, is neither


unlimited nor synchronic (See Chapter 3). Such a corpus is not a serious
concern for us because a historical corpus is mostly confined within a

specific peripheral zone having marginal importance in the whole gamut

of empirical language research.


In sum, a language corpus is an empirical standard that acts as a benchmark
for validation of usage of linguistic properties available in a language. In

general, if we analyse a corpus, we get information of the following types


about a language:

• It provides information about all properties and components (for


example, sounds, phonemes, morphemes, words, stems, bases,

lemmas, compounds, phrases, idiomatic expressions, set phrases,


reduplication, sentences, proverbs, etc.) used in a language.
• It supplies grammatical and functional information (for example,

forms, compositions, patterns of using affixes and inflection markers,


patterns of constituent structure, contexts of use, usage patterns, etc.)

of words, phrases, sentences, idiomatical expressions, etc., found


in a language.
Definition and Features of a Corpus 31

• It provides usage-based information (for example, stylistic,

metaphorical, allegorical, idiomatic, figurative, proverbial, etc.) of

segments, morphemes, words, compounds, phrases and sentences

used in a language.

• It supplies clues to the extralinguistic world by way of providing


information related to time, place and agent of a language event; the

social and cultural background of a linguistic discourse; life and living

of the target speech community, discourse and pragmatics; and the

world knowledge at large.

The information of the extralinguistic world obtained from a corpus is

analysed simultaneously with intralinguistic information collected from

linguistic elements of a language to understand how a piece of text is

composed and developed, how text is used, in which context it is used, and
how it serves the purpose of users.

It is clearly understandable that designing and developing a corpus

following all these prerequisite conditions is really a tough task. However,

we can simplify the task to some extent if we redefine the entire concept

of corpus generation based on object-oriented and work-specific needs.


Because it is known that all types of corpus should not follow the same set

of designing and composition principles, we have enough liberty to design a

corpus, keeping in view the work we are supposed to do with it. For instance,

if we are interested to know about the language of the underworld, we will

definitely try to design a corpus that contains a large amount of data collected
from text samples of people related to the underworld so that the target

world is represented and reflected properly. Although such a corpus (See

Chapter 3) is highly user specific, object oriented and deliberately titled

towards a particular type of language text, it gives us much needed relief from
the rigor of observing all corpus generation issues, conditions and principles

strictly. Moreover, we would hardly be worried if a corpus of this type fails

to represent the basic general aspects of a language. The main proposition is

that the principles and conditions should vary depending on the purpose of

a corpus —there is nothing wrong in it to blame or criticize.


With the help of the modern computer, it is not difficult to develop a

large and multi-dimensional corpus, although it may be expensive and time

consuming. If we have a computer with an Internet connection, we can easily

compile a large corpus of written text of any type with samples obtained from

various Web sites. Such work may not be as expensive as we assume and

also may not be as time consuming as we propose. Although such a facility


is readily available for English, German, Spanish, French, Dutch, Italian,

Japanese, Chinese and some other languages, it is hardly available for most of

the Indian languages, including Hindi, Bengali, Tamil, Telugu, Oriya, Urdu,
Punjabi and others.151
32 Corpus Linguistics

2.3 Features of a Corpus

A corpus can be, and indeed it is, of many types (See Chapter 3). However, a

general corpus is assumed to have specific characteristic features (as default

values), which might vary for some other types. That means, a corpus, which
does not possess one or more default characteristics of a general corpus, should
be identified as a 'special corpus'—the title of which will specify its normal
pattern of deviation from the frame of a general corpus. Before we discuss

the features attributed to a special corpus, we should concentrate on the


general features attributed to a general corpus. By all means, a general corpus,
if not defined otherwise, should possess the following features: quantity,
quality, representativeness, equality, simplicity, retrievability, verihability,
augmentation, documentation, and management.

2.3.1 Quantity

The question that arises, while we determine to generate a corpus, is, how big
will the corpus be? That means, how many words will there be in a corpus?

The answer is not as simple as it appears, because it is neither possible nor

sensible to prescribe any fixed parameter for such a question. But in simple

terms, we can say that the bigger the corpus, the better its authenticity

and reliability. In essence, the number of words included in a corpus will


determine its largeness. Because the primary goal of a corpus-building project

is to include as many words as possible, we are not in a position to restrict a


corpus designer with any fixed mark for word limit.

The default value of 'quantity' signifies that it should be large with regard

to the number of words and sentences included in it. A corpus is assumed


to contain a large number of words and sentences, because the basic point

of assembling a corpus is to gather data from a variety of sources in large

quantity. The present technology enables us to increase the size of a corpus

quite rapidly, and therefore, it is not sensible to recommend any set of figures.
Furthermore, the recent advent of 'monitor corpus' (See Chapter 3) affects

remarkably the concept of size, which refers to the calculation of the 'rate of

flow' of words rather than the 'total amount'.


If we still consider 'quantity' of a corpus in terms of its size, it will

refer to the sum of the total linguistic components included in it. Thus, the

question of quantity or size is best addressed with reference to components.

Size or quantity reflects indirectly on the simplicity or complexities involved

in the process of acquiring text materials. This is again loosely related to

the issue of availability of materials of a language for general access, which


again reflects on the relative importance of an influential language over a

non-influential one.

Most often, contrary to widely used languages such as English and French,

materials of less influential languages become difficult to procure because less

influential languages have comparatively less circulation than widely used


Definition and Features of a Corpus 33

ones.161 In case of Indian languages, it is noted that some socio-economically


influential languages, such as Hindi, Urdu, Bengali, Tamil, Telugu, Malayalam,
Kannada, Punjabi, Marathi, etc., easily provide a large amount of text materials,
which are hardly found in less influential Indian languages such as Mundari,
Santali, Sadri, etc. Obviously, the less circulated and less influential languages

are not able to provide written text samples belonging to diverse fields and
disciplines, which influential languages can easily supply.

There is a specific 'quicksand' in the concept of quantity also. The

number of words is not at all a faithful clue to check this feature of a corpus.

In practicality, we can easily collect a large amount of words from a variety

of lexical resources (dictionary, thesaurus, wordbook, etc.) to claim the total

collection as a corpus. However, this cannot be a corpus because the collection

fails to represent the basic texture of use of a language. Therefore, to overcome

this problem, we argue for collecting texts from various written and spoken

sources. The advantage of this method lies in its way of gradual increment

of the number of sentence types, which automatically will ensure the normal

growth and variety of words in a corpus.

Moreover, the issue of quantity should be envisaged with regard to the

technology of the time. That means the issue of the quantity of words should

be measured with respect to availability of technology at that particular point

of time when the corpus is developed. When the actual work of electronic

corpus generation started in the second half of the last century, computer

technology was not much advanced and robust as it is today. Therefore,

collection of a marginal amount of words in a computer was really a tough

task. In those early years of electronic corpus generation. Brown Corpus, which

contained just one million words, was considered a standard one[7] because, at

that particular time, a collection of one million words in electronic form was

unthinkable for most of the linguists.

In the 1970s and 1980s, when the computer went through stages of

metamorphosis to be gifted with unprecedented power of storage and

processing, the moderate number of words was revised drastically to acquire

an order of magnitude. As a result, by the mid-1970s, small corpora were

gradually replaced by large corpora of various sizes. Within a few years, some

corpora were developed that contained more than twenty million words.

When Birmingham Collection of English Text (BCET) was compiled in 1985, it

contained more than twenty million words. In the middle of the last decade of

the last century, the number of words of Bank of English reached two hundred
million, and it is still open for further increment.

On the other hand, linguists who are working with a corpus also realize

that a collection of one million words is not at all a reliable resource to make

any faithful observation on any aspect of a natural language. They ask for a

corpus of at least one hundred million words to validate their arguments and

hypotheses. In the mid-1990s, the horizon was further expanded. In the new
millennium, we are not even satisfied with a corpus containing a hundred
34 Corpus Linguistics

million words. For instance, British National Corpus has reached the stage of

400 million words within the last few years. Yet, it shows no intention to stop.
It still continues to grow with daily doses of data coming from various sources
and fields.
There are, however, a few loopholes in the labyrinth of quantity. We
observe that a collection of data from those languages that enjoy facilities
of electronic devices is much bigger than those languages that do not have
such facilities. That means techno-savvy languages, such as English, German,
French, Italian, Spanish, etc., have better scope for generation of a corpus than
non-techno-savvy languages because techno-savvy languages, due to specific
socioeconomic, politico-cultural, and commercial-scientific reasons, enjoy
both global patronage and technical support. Therefore, availability of texts
in electronic form in these languages is much higher than in others. Also, the
Roman script, used for most of these languages, contributes to a great extent
for their global expansion.
On the contrary, languages that do not have large resources in electronic

form due to the reasons stated above have very little scope for generation
of a corpus easily. Even if we keep aside the languages of the backward and
underdeveloped communities, we can easily find that resources in electronic
form available in Indian languages, such as Hindi, Bengali, Telugu, Tamil,
etc., are not even one-tenth of the resources available for languages such as
English, Spanish, German, French, etc., although the number of speakers of
Indian languages is not less than those of Western languages. The grim truth

is that facilities of the electronic medium are not yet accessed properly by
languages of underdeveloped countries as they are accessed by languages of
advanced countries. Therefore, it is not surprising if we find that the number
of electronic corpora in Indian languages is less in number when compared to
those in the advanced countries.
The above argument, however, does not work in case of corpora of
spoken texts. Both in advanced and non-advanced languages, in reality, only
a small and marginal fraction of the whole amount of spoken interactions
are included within a speech corpus. And most strikingly, for spoken texts
of advanced and non-advanced languages, collection and processing of
speech databases involve an equal amount of complexities and technical
sophistication. Even then, we find that tools and techniques for spoken text
collection and processing are easily available for advanced languages than

backward languages. Here also, the actual socio-economic condition of the


related speech communities plays an important role—we probably cannot
ignore this cruel truth.

2.3.2 Quality

The default value for 'quality' relates to authenticity. That means all text
materials should be collected from genuine communications of people doing
their normal businesses. The role of corpus collectors is confined within the

area of acquiring data for the purpose of corpus generation, which, in return.
Definition and Features of a Corpus 35

will protect the interest of people who will make statements about the way

language is used in communication. Corpus collectors have no liberty to alter,


modify or distort the actual image of the language they are collecting. Also,
they have no right to add information from their personal observations on the
ground that the data is not large and suitable enough to represent the language
for which it is made. The basic point is that corpus collectors will collect data
faithfully following the predefined principles proposed for the task. If they
try to interpolate in any way within the body of the text, they will not only
damage the actual picture of the text but also tell heavily upon subsequent
analysis of data. This will affect on the overall projection of the language or,
worse, may yield wrong observations about the language in question.
Strategic alienation on the part of corpus collectors will restrain them from
including language texts obtained from experimental conditions or artificial

circumstances. However, it is difficult to draw a line of distinction between


the two types. For instance, consider the data collected from recordings of
conversations broadcasted on radio or television. Apparently, texts found
in these sources have nothing of abnormal or artificial quality because these
are broadcasted as they are conversed and recorded. But the truth is, these
databases are quite drawn away from actual reality. In most cases, most of these
conversations are chiselled and processed in the studio before being delivered
for broadcasting or telecasting to the target audience. Therefore, we have an
objection to considering these texts as normal and spontaneous ones because
most of the qualities of impromptu conversations are lost in these databases.
In the context of a general corpus, such a corpus is of secondary importance as

it loses most of the interactional properties normally observed in casual and


informal talks. However, such a corpus has special importance and functional
relevance in a 'special corpus' that includes samples from artificial and built-
up situations.
Furthermore, in some extreme situations, some television shows may
try to deliberately put the participants in an artificial condition to elicit
odd responses. On the other hand, casual conversations are expected to be
impromptu in nature for the purpose of a catchy presentation. But these are
rehearsed by the participants before their talks are circulated. It is therefore
required that expert linguists seriously intervene in such situations; else,
the data of special interactions will be allowed to be included in a general

corpus. However, for special works, these may be tagged as 'experimental


corpora', which, like special corpora, have specific functional relevance in
linguistic discussions.181

2.3.3 Representativeness

A corpus should include text samples from a broad range of materials in order

to attain proper representativeness. It should be balanced with all disciplines


and subject fields to represent the maximum number of linguistic features
found in a language. Besides, it should be authentic in representation of a text
variety wherefore it is developed, because future analysis and investigation
36 Corpus Linguistics

will ask for verification and authentication of information from a corpus


representing the language. For example, if we want to develop a Hindi corpus,

which is properly representative of the language, we should to keep in mind


that we need to collect data both from written and spoken sources in equal
proportion so that the corpus becomes a true replica of the language. This is

the first condition of representativeness.


Further complications, however, will arise in subsequent stages. For
instance, when developing a speech corpus for Hindi, the question is, from

which sectors and fields are we supposed to collect the data? Should we include
only texts from family interactions, or should we include data from speech
events that occur in courts and police stations, offices and clubs, schools and

colleges, playgrounds and cinema halls, shopping malls and market places,
roads and pubs, etc.? The answer is already embedded within the question. We
need to collect text from all possible source of spoken interaction, irrespective
of their place, time and situation of occurrence, and from all types of people,
irrespective of their sex, age, class, caste, education or profession. Only then a

speech corpus can be representative in the true sense of the term.


An almost similar argument stands for a corpus of written texts. Samples
should not be collected only from one or two sources. They should be maximally
representative with regard to demographic variables. A written corpus should
contain samples of text not only from imaginative writings, such as fictions,

novels, and stories, but also from informative prose texts such as natural
science, social science, medical science, engineering, technology, commerce,
banking, earth science, advertisements, posters, newspapers, personal letters,
government notices, diaries and similar sources. To be truly representative

of a language, text samples should be collected in equal proportion from all


sources, irrespective of text types, genres and time variations. Citations of
individual instances of word use as well as collections of terms cannot be

termed a corpus. Although the condition for designing a valid sample size of a
corpus is yet to be finalized, people who are seriously concerned with corpus
generation will not attempt to gather a large collection of citations of words to
claim it as a corpus.191

In the long run, the question of size becomes irrelevant in the context of
representativeness. A large corpus does not necessarily imply representation

of a language or a language variety any better than a small but properly


balanced corpus. A simple, large collection of any text samples is not
necessarily a corpus from which we can make any generalization. According
to scholars (Leech 1991), we call a corpus 'representative' only when the

findings based on its analysis can be generalized to the language as a whole


or to a specified part of it. Therefore, rather than focusing on quantity of

data, it is always better to emphasize quality of data. Here quality refers to


the variety of data, which is represented proportionately from all possible

domains of language use.


Experts argue that the overall size of a corpus needs to be set against the
diversity of sources for achieving representativeness. Within any text type.
Definition and Features of a Corpus 37

the greater the number of individual samples, the greater is the reliability of

the analysis of linguistic variables (Kennedy 1998: 68). Brozvn Corpus, LOB

Corpus and Survey of English Usage are designed in such a way that they

become maximally and truly representatives of the target language included

in them. A simple comparison of British National Corpus with Brown Corpus,

LOB Corpus, and Survey of English Usage, however, shows how these corpora

are less enriched with respect to the number of words and less diversified in

structure and variety of contents. This helps us to settle empirically the issues

related with size and representativeness of a corpus.

The issues that are related to balance and representativeness of a corpus

are also discussed with reference to some empirical issues (Summers 1991). It

is argued that even a corpus of one hundred million words is too small when

compared with the total amount of texts from which a corpus is sampled.

Summers shows how the differences in content and language of a particular

text type influence linguistic analysis at subsequent stages because the original

purpose of the text plays a vital role in drawing up inferences. Thus, on the

basis of empirical observations. Summers argues for adopting a sampling


approach by way of 'using the notion of a broad range of objectively defined

document or text types as its main organising principle' (Summers 1991: 5).

To achieve the goal of representativeness, she outlines a number of possible

principles for the selection of written samples as listed below:

• The elicist's approach, which is based on literary or academic merit or

'influentialness' of the texts

• Random selection of text samples

• Currency, or the extent to which texts are read

• Subjective judgement of the 'typicalness' of the texts


• Availability of text samples in the archives

• Demographic sampling of reading habits of people

• Empirical adjustments of the text selection procedure to meet the

linguistic specifications of a corpus

• Purpose of the investigators at the time of corpus building

In our argument, the most sensible and pragmatic approach can be the one

that combines all these approaches in a systematic way and where we can

have data from a broad range of sources and text types with due emphasis on

'currency', 'influentialness', and 'typicalness'.

Finally, questions may arise regarding the validity and usefulness of

proper representativeness of a corpus in the context of its application in

linguistic research and analysis. In reply to this question, we argue that since

language differs with respect to the topic of context, discourse of deliberation

and variation of social settings, definitely there should be some measures in

corpus collection to reflect on these inherent factors. For instance, the language

we find in mass media is characteristically different from the language we


often encounter in medical bulletins.
38 Corpus Linguistics

This implies that language is bound to vary due to a variation in situations


(Halliday and Hassan 1985), interactants (Holmes 1995), places (McMahon
1994), topics (Hymes 1974), and similar other sociolinguistic variables (Eggins

1994). Hence, if we want to derive a universal picture of a language, there is no


other alternative but to obtain samples from all possible sources of language

use. The goal of a corpus will be lost if it fails to project into all the primary
aspects of a language. Therefore, we consider proportional representation of
text samples as one of the basic features of a general corpus.

2.3.4 Simplicity

This feature signifies that a corpus should contain text samples in simple
and plain form so that target users can have easy access to the texts without
stumbling upon any additional linguistic information marked up within

the texts. There are a few corpora in which text samples are tagged with
the Standard Generalized Mark-up Language (SGML) (ISO 8879: 1986) format
in which all mark-ups are carefully used to not impose any additional

information on the texts. The role of the mark-up system in relation to

text representation is to preserve, in linear encoding, some features, which


will otherwise be lost. The system is perceived helpful, in the sense that its
presence usually does not disturb easy retrieval of original text samples from

the corpus.
Since the default value of simplicity is 'plain text', the users expect an

unbroken string of characters without any added information. If there is


anything to be marked up within a text, it should be clearly identified and

separated from the text itself. Nowadays, many texts are available in SGML
format, which, in the future, may be available in the Text Encoding Initiative
(TEI) format. In such corpora, all words, phrases and sentences are marked

up with grammatical, lexical and syntactic information. For example, British


National Corpus and LOB Corpus are marked up in this process, where mark-

ups have been carefully designed and tagged so that these do not add up any
additional linguistic information on the texts.

The basic role of a mark-up process, in relation to text representation,


is to preserve some additional features, which are useful for different types

of linguistic work. Although these are perceived helpful, their presence


must be recorded separately so that the original text is easily retrievable.
The conventions for mark-up are extendable to various annotations that

add information provided for rigorous linguistic analysis of texts. Such


information is actually related to the organization and interpretation of

the textual features. It varies from analyst to analyst and from purpose
to purpose.
A simple 'plain text' policy is usually not opposed to such type of encoding,
nor does it oppose the use of same mark-up conventions. However, we argue
that there should be clear-cut guidelines for the purpose of clarity of the text
Definition and Features of a Corpus 39

so that it becomes helpful to distinguish between plain text and encoded


text. There should be distinctions between the encoding systems, which are

used to annotate only the surface features of texts. Otherwise the encoding
system, which is used to encode texts, will create problem in analyses and

interpretations of the original texts.'101


More difficult is the question related to an annotated corpus. It is proposed
that this term may be used for any text corpus that includes codes that
record extralinguistic information of various types such as analytical marks,
provenance, etc. Again, it should be categorically stated that annotations
should be separable from plain text in a simple and agreed fashion. A set

of conventions for removing, restoring and manipulating annotations is


necessary, especially in the context of the next few years when we hope to see
a large growth of corpora tagged with annotations. It is naive to expect that

big corpora will remain easy to manage if they are full of various annotations,
because retrieval times are already becoming critical.
There are definitely specific reasons behind the practice of using mark-
up techniques on a corpus. In some specific works of language technology,
a corpus built with marked-up texts become more useful for systematic
processing and analysis of texts, which result in the development of robust
systems and sophisticated tools for language processing. Marked-up corpora
also become highly useful resource for various sociolinguistic research,

dictionary compilation, grammar writing, and language teaching.

2.3.5 Equality

The term equality of text samples is, to a certain extent, related to the feature

of 'representativeness'. In general sense, from a quantitative point of view,


each text sample should have an equal number of words. This means

samples of each text type should possess an equal number of tokens collected
from various sources. For instance, if each sample of spoken text contains

five thousand words, each sample of written text should also contain five

thousand words.

This was the norm followed in formation of Survey of English Usage, in

which each text sample had more or less the same amount of data with respect

to the number of tokens. This norm was supported by the general argument
that text samples used in a corpus should be of equal size. However, there are

specific hidden constraints behind such a proposition that cannot be avoided

easily at the time of corpus generation.

• The variety in spoken text is more than in written text in any

living language. Therefore, spoken text asks for greater or larger

representation than written texts.

• Because collection of data from written sources is a much easier task

as compared to collection of data from spoken sources, written texts


may have greater representation in the corpus.
40 Corpus Linguistics

• Parity in the amount of tokens is a highly deceptive condition

because tokens never occur in equal proportion in each text type.


• An equal amount of text cannot be collected from everywhere in a

uniform manner because the size of samples will vary proportionately

depending on the needs of subsequent application and use.

The sampling techniques used for Brown Corpus are often referred to as a

standard model in the context of maintaining a balance in case of quality of


samples at the time of generating a corpus. This model of equality in data
collection is faithfully adopted in LOB corpus, Australian Corpus of English,

Wellington Corpus of New Zealand English, Kolhapur Corpus of Indian English, and
Freiburg LOB Corpus. Also some small-sized corpora are developed following

the same ratio of textual equality although the amount of data is increased in
a proportionate manner.

At present, however, the situation is considerably changed due to the


advancement in computer technology. People are least interested to follow

this model any more in their works of corpus compilation. People now
follow more robust methods based on various statistical as well as linguistic

models and principles to make corpora balanced, multidimensional and


representative by way of including texts with a varied amount of samples
gathered from various sources. In fact, the newly compiled corpora hardly

follow the model used for Brown Corpus or LOB Corpus.

2.3.6 Retrievability

The work of corpus generation does not end with the compilation of language
data within a corpus. It also involves formatting the text in a suitable form so

that the data becomes easily retrievable by end users. That means the data
stored in a corpus should be made an easy resource for the new generation of

users. Anybody interested in the database should be able to extract relevant


information from a corpus.

This actually redirects our attention towards the techniques and tools
used for preserving language data in electronic format. Present technology

has made it possible for us to generate a corpus in a personal computer and


preserve it in such a way that future users will be able to retrieve and access
the data as and when required. The advantage, however, goes directly to those

people who are trained to handle language databases in a computer.

This, however, will not serve the goal of corpus linguistics because the
utility of a corpus is not confined to computer-trained people only. Because

a corpus is made with the language of all, it is meant for use by all. Starting
from the computer experts, it is open for linguists, social scientists, language

experts, teachers, students, researchers, historians, advertisers, technologists

and common people. The goal of a corpus will be accomplished only when

people coming from every walk of life will be able to access the corpus and

use information from it to address their linguistic and non-linguistic needs.


Definition and Features of a Corpus 41

In reality, many of these people are not trained in handling a computer or

electronic corpus. But they need to use language corpora for addressing their

needs. Therefore, a corpus must be stored in an easy and simple format so that

common people can use it.

Modern computer technology, however, has simplified the process of

corpus handling and management. Even naive people, who have never

acquired formal computer training, can compile a corpus, arrange data as they

like, use databases according to their choices and classify and analyse data

according to their needs. Due to such a wider scope for application by the

people across age, education or profession, a corpus attains a unique status in

the global scenario of language research and use never imagined before.

2.3.7 Verifiability

This feature implies that the text samples collected from various sources of

language use must be open to empirical verification. They should be reliable

and verifiable in the context of representing a language under scrutiny and

investigation. Until and unless a corpus is free and open for all kinds of

empirical analysis and verification, its importance is reduced to zero. Sample

texts, which are collected and compiled in a corpus to represent a language or


a language variety, should honestly register and reflect on the actual patterns

of language use.

To address this need, a corpus has to be made in such a way that it

easily qualifies to win the trust of the users of the language or the language

variety. The users, after verifying the data stored in a corpus, must certify

categorically that what has been exhibited in the corpus is actually a faithful

reflection of the language they use. For instance, if we develop a corpus

of the language used in Bengali newspapers, we must pay attention to the


event that the data preserved in the corpus qualifies to reflect properly on

the language used in newspapers in its fullest form. The corpus will thus

attest its authenticity and validity in synchronic and diachronic studies on

the language of newspapers.

Also, language data collected and compiled in a corpus needs to be

verifiable and authentic for some practical reasons related to applied

linguistics. Various works of applied linguistics (grammar book writing,

dictionary compilation, preparation of language teaching materials and

textbooks, writing of reference books, etc.) demand for language databases

that are true to the language. Also, these works require language data that is

verifiable in case of future debates regarding their validity and authenticity.

If a corpus is not reliable, then resources made from the database of the

corpus will also lose their reliability and authenticity.

This leads us to argue that a corpus, whatever form or type it may have,
should be open to any kind of verification and assessment. In fact, this quality
will make a corpus trustworthy to language experts because they will be able
42 Corpus Linguistics

to access it for empirical investigation either to verify earlier propositions


or to refute prior observations made by others. This particular feature puts

corpus linguistics steps ahead of intuitive linguistics. Although we hardly


get an opportunity to verify a hypothesis made in intuitive linguistics, in

corpus linguistics, we are in a position where we easily verify each and every
observation with the database of real-life use.

2.3.8 Augmentation

A living language is bound to change with time. This is one of the basic proofs

of a language to prove its life and vitality. If a language stops to change with
time, we can consider it to be obsolete or dead.

A corpus, which, in principle, aims at catching the features of a

language throbbing with life, must have an ability for ceaseless growth and

improvement. It must have facilities for augmentation with new data to


capture the changes reflected in the form and content of the language. This

means a corpus should continue to grow with time, registering the linguistic
variations observed across time within a living language. Although most of

the present-day corpora are synchronic in nature, efforts are made to make
them diachronic so that they are able to grow in parallel with the change of

time and language.

Any synchronic corpus, by way of regular augmentation of data across the

time scale, may achieve the status of a diachronic corpus. Over the years, it will
attain a chronological dimension to offer greater scope for diachronic studies

of the language and language properties to catch subtle changes caught both
in life and society and reflected in language. Such a feature has several indirect

effects on the works of both mainstream linguistics and language technology.

With the power of regular augmentation, a corpus will become larger in size
and quantity, wider in coverage and multidimensional in content to reflect on

the colourful spectrum of life and language.

The referential importance of a diachronic corpus in the study of

chronological change of language is immense. Such a corpus faithfully shows

how language changes its form and texture through the stream of regular

usages across time. Besides throwing light on the changes of language


properties, it also reflects on life, society and culture that flow on as an

ongoing perennial stream under the surface of language use by a speech

community. A corpus thus becomes valuable and authentic to social scientists

because they find in it a scope to study the changes in life and culture of
people across ages.

The feature of 'augmentability' thus becomes an important weapon for

corpus linguists. They are never reluctant to work on compiling data from

the sources of language use marked with new tags of time. Keeping this view
in goal, both Bank of English and Bank of Swedish go on adding new language

data from English and Swedish, respectively. For the past two decades, both
Definition and Features of a Corpus 43

the corpora are in the process of continuous growth with accumulation of new

examples from new sources. Similar efforts are also initiated for the corpora of

German, Spanish, French, Dutch, Italian and other languages.

2.3.9 Documentation

This feature entails that documentary information of components stored in a

corpus should be separated from the components. In general, it is necessary


to preserve detailed information of sources from which language samples

are collected. It is a practical requirement on the part of corpus designers

to deal with problems related to verification of source texts, validation of

examples and dissolving of copyright problems. Besides, there are other


linguistic and extralinguistic issues related to sociolinguistic investigations,

stylistic analyses and legal enquiries, etc., which also ask for verification of

information of the resource documents from which the data is collected and

included in the corpus.

Corpus designers are, therefore, asked to document meticulously all


types of extralinguistic information related to the types of text, sources of text,

etc. In case of written text samples, this is mostly concerned with referential
information of physical texts (for example, the name of a book or newspaper;

names of topics; year of first publication; year of second edition; number

of pages; type of text; sex, profession, age and social status of authors; etc.)
On the other hand, in case of spoken texts, this is concerned with the names

of speakers; situations of speech events; dates and times of speech events;


number of participants in speech events; age, sex, profession and social status

of participants; manner of involvement of participants, etc.


There are, however, some controversies regarding the process of

documentation of extralinguistic information in a corpus. Some experts

argue that extralinguistic information should be tagged within the text itself

so that users can retrieve both linguistic and extralinguistic information


together without much trouble. Others argue that this is not a sensible way

to deal with the process of documentation because it may hamper the normal

process of text processing, data access and information retrieval. Moreover,


every user may not need to access this information always when dealing

with texts of corpus. Therefore, since extralinguistic information is not a

primary component of text, it should not be included in a corpus. Rather, it

should be stored in a separate database or file. The file should be tagged with

the corpus in such a way that anybody who wants to access this information
can easily collect it from the tagged file. This will not only keep the text in the

corpus intact in its form and texture but also make the work of corpus access,
processing and information retrieval more simple and straightforward.

In essence, proper documentation entails that the corpus designers keep

all information of documentation of text samples in a separate, place from


the text itself for future reference. If required, there should be a 'header file'
44 Corpus Linguistics

that will contain all references related to the documents. For the purpose of

easy management, access and processing of the corpus, this will allow quick

separation of the plain text from the tags used in annotation. A suitable model

is the TEI system, which includes a simple minimal header containing a

reference to the documentation. For management of the corpus, this allows


effective separation of the plain text from the annotation with only a small

amount of programming effort. The robustness of real-time search procedures

is not hampered in this process.

2.3.10 Management

Corpus data management is a highly tedious task. There are always some
errors in the text to be corrected, some modifications to be made, and some
improvements to be incorporated. At the initial stage of corpus generation, it
involves a systematic arrangement of text files according to various text types
by which the searching of information becomes faster and easier. Generally,
the utility of a corpus database is enhanced by an intelligent arrangement

of text files in an electronic archive. The task of information retrieval from a


corpus also requires utmost care and sincerity on the part of corpus designers
so that required files and necessary data are easily available to target users.
Also, systematic arrangement of data makes interdisciplinary research and
application work more effective and fruitful.

After a corpus is developed and stored, the designers need necessary


schemes for its maintenance, standardization, augmentation and upgrading.
Maintenance is needed so that the data is not corrupted by virus infection or
damaged by some external effects. Standardization is needed so that a corpus
becomes comparable with other corpora developed in and across languages
or language types. Augmentation is required to enlarge existing databases
with new examples and text samples obtained from new sources. Finally,
upgrading is needed so that the existing data is converted properly for use
in new systems and techniques. Since computer technology is changing very
fast with time, a corpus database needs to be continuously upgraded to be at
par with the new system and software. Else, the whole effort will be ruined
for ever. In general, the process of upgrading a corpus database involves the
following issues:

• Preservation of data from a computer hard disk to a floppy disk, from


a floppy disk to a compact disk, and from a compact disk to the next
available storage facility
• Displacement of a corpus database from a Disk Operating System
(DOS) environment to a Windows environment or to a new
environment available
• Conversion of language texts from Indian Standard Code for Information

Interchange (ISCII) to American Standard Code for Information Interchange


(ASCII), from ASCII to Unicode, from Unicode to some other more
user-friendly coding system, etc.
Definition and Features of a Corpus 45

In essence, adaptation to new hardware and software technology has to

be taken care of with utmost importance. Although present computer

technology is not advanced to perform all these works with full satisfaction,

we expect that software technology will improve to a large extent to address

all these requirements.

There is no reason to consider that the features discussed above are

absolute and non-changeable. These are identified after considering the types

of corpora developed so far (See Chapter 3) in various languages of the world.


Therefore, these features are more of a general nature than of any specific

type. These are open for future verification and modification if typological

classifications of corpora are taken into consideration. In that case, some of the

features discussed above will be redesigned to address uniqueness of form

and content of particular corpora.

Future scholars may identify some corpus-specific features, which

are not addressed here. Also, generation of a new type of corpus may ask

for identification of a feature that is not discussed here. Innovation of new

technology and application of new principles may result in formation and


design of a corpus of a new type. In that case, features stated here are open for

modification and recasting to fit into the new format of language corpora.

Endnotes

[1] The Latin term corpus ('body') has two direct descendants in English: corpse,
which came via the Old French cors, and corps, which came via the modern French
corps in the eighteenth century. The former entered English in the thirteenth
century as cors, and during the fourteenth century, it had its original Latin p
reinserted. At first it simply meant 'body', but by the end of the fourteenth
century, the sense 'dead body' became firmly established. However, the original
Latin term corpus itself was acquired in English in the fourteenth century
(Ayto 1990: 138).
[2] Because of the question of sampling techniques used for generating a corpus,
Sinclair (1991) prefers to use the non-committal word 'pieces' and not 'texts'. If
samples are of same size, then they are not texts. Most of them will be fragments
of texts, arbitrarily detached from their content sources. Sometimes, words such
as collection and archive usually refer to sets of language texts. However, they
differ from a corpus in the sense that they do not need to be selected or ordered.
Moreover, selection and ordering of input texts do not need to be on the same
lines as proposed for designing a language corpus. These are, therefore, quite
unlike a language corpus. The term text is also referred to in relation to a corpus
because it contains a collection of language data. It simply points to the extracts
used for both spoken and written communications.
[3] Almost similar definitions of a corpus are provided by Aarts (1991), Johansson
(1991), Leech (1991), Kennedy (1998), Stubbs (1996), Biber, Conrad, and Reppen
(1998) and others. Most of these definitions, however, fail to elaborate the
inherent texture of the concept in details.
[4] Technically, the size of a corpus implies the total sum of its components (i.e.,
words, phrases, clauses, sentences, etc.). For instance, texts from the field of
natural science should carry equal weight like that of literature, mass media.
46 Corpus Linguistics

engineering and social science. Thus, balanced representation of texts may be


obtained from all disciplines and domains in a proportionate manner. However,
in practice, the total number of tokens included in a corpus determines its size.
A number of words may be fixed for some corpora, while it may continue to
increase regularly for others.
[5] However, the entire situation is changing rapidly, which makes us quite
optimistic about collecting written text samples from Indian languages in their
own scripts from the virtual world very soon.
[6] Such an intricate picture of the relative circulation relationship of languages,
however, does not hold any relevance in case of a speech corpus. Here, text
materials of the most influential and pervasive languages occupy an equal status
as that of less influential languages in informal and impromptu conversations.
Moreover, speech events of both types of languages are not normally recorded
in full detail.
[7] The one million words collected for construction of the Brown Corpus were
roughly equally divided into several different genres. While each genre
contained nearly 500 samples, each sample contained nearly 2000 words. All the
samples of text were obtained from written and published sources of various
types.
[8] An experimental corpus is a kind of special corpus, which is assembled to study
the finer details of spoken language in specific interactions. Such a corpus is
small in size and is produced by way of asking informants to read out strange
messages in anechoic chambers.
[9] However, a collection of citations of words may be used as a valuable resource
for designing dictionaries and word books. This has been a long, traditional
practice in lexicography.
[10] In case of spoken transcription, this distinction has to be made carefully because
orthographic transcribers may add analytic notations in the text. These should
be conventional and familiar in form and representation so that people can
treat them as sophisticated mark-ups to distinguish them from intonation
annotation or grammatical tagging.

You might also like