Article in Corpus Linguistics

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

An introduction to corpus linguistics

Colm Smyth*

Abstract
In any language, it can often be difficult to ascertain as to which word should be used in a given
context. In the past, people have mostly made these vocabulary choices by using their intuition.
Nowadays, we can use a corpus to help in this decision making process. This paper will give an
introduction to the area of corpus linguistics and its methodologies. A brief look will also be taken at the
implications for language teachers.

Keywords:corpus, collocation, word-form, mutual information, T-score.

An outline of collocation and the measurements


Introduction used to strengthen assumptions will be made
from the collocations. Next, there will be a
In any language it can often be confusing as discussion of patterning, usage and phraseology
to which word or phrase should be used in a in text. Finally, there will be a brief discussion of
given situation or, indeed, what the exact implications for the language teacher.
meaning of a given word is. Moreover,
oftentimes, words which at first appear to Corpus Linguistics
have similar meanings and usages may
actually be used in slightly different ways. Corpus linguistics is a relatively new field of
They may have a pattern of usage that is linguistic research. It involves the collection of
unique to them. In the past these kinds of data; spoken, written, or both, and collating it
situations were generally judged by using into one or more text files. These text files are
intuition. However, given the advent of then searchable and the resulting data can be
technology and corpus linguistics, it is now further studied for the purpose of linguistic
possible to study and analyse these patterns of research. Kennedy (1998:1) describes a corpus as
usage. In the past, it simply was not feasible to ‘a body of written text or transcribed speech
manually do a meaningful study of this kind. which can serve as a basis for linguistic research.’
In this paper, a look will be taken at the area An important point to remember, as pointed out
of corpus linguistics. Firstly, a brief outline of by Hunston and Laviosa (2000), is that any
what corpus linguistics is will be given. There information found from research done on a
will then be a description of some of the corpus is only applicable for the data studied. It
methodologies behind corpus research, with an cannot necessarily be applied to the language as
emphasis placed on the word- based approach. a whole. They also point out that any results

*未来科学部英語系列講師 Lecturer, Department of English Language, School of Science and Technology for Future Life

東京電機大学総合文化研究 第14号 2016年 105
corpus itself and that when it comes to a corpus, (2001:93) point out, it will inevitably be a
the bigger it is the better. When measuring the significant factor in the size of corpus used.
size of a corpus, we are interested in the total This tagged corpus can now be easily searched
word count. Aijmer and Altenberg (2001) for instances of any type of grammatical word. A
describe corpus linguistics as ‘…the study of key point to remember is that once the corpus
language on the basis of text corpora.’ has been annotated with the word tags for
A corpus, while having the potential to be grammatical class it is no longer in its raw,
limitless in size, is created for the explicit unprocessed, form. The result of this being that,
purpose of research and can be tailored to the according to Leech (2001:19), words are no
study of one particular area, for example tabloid longer searched for, instead it is ‘…grammatical
or broadsheet journalism, novels, radio abstractions…’ that are examined. This
broadcasts etc. This applicability to the area of represents a slight shift in the assumed idea of
study is, according to Leech (2001:9), what how corpus research might normally be carried
makes a corpus different from a large archive of out. It allows for the comparison of categories,
random data. such as the usage of past and present tense in a
selected corpus. This method is best represented
Methodologies by the pioneering work of Biber (1986 and 1988).
It is worth noting that once a corpus has been
There are two main methodologies used for the tagged, it cannot be untagged. Therefore, it may
study of corpora. These are, according to be advisable for the researcher to make a
Hunston and Laviosa (2001), category based and back-up copy of the corpus before taking the step
word-form based. A look will now be taken at of tagging it.
both of these methodologies.
Word-form based
Category based
According to Hunston and Laviosa (2001), this
This approach to corpus data analysis, approach differs from category based in that
according to Hunston and Laviosa (2001), there is a very minimal tagging of the corpora
necessitates the putting of all words in the and any tagging done is fully automated, there is
corpora into a particular category, such as verb, no manual intervention by the researcher to do,
adjective, noun, conjunction etc. before any work or amend, any tagging. The overall result of this
can be carried out on the corpus. This can be difference in approach is that the subject of the
carried out automatically by software known as study is moved away from the grammatical
a tagger. Hunston and Laviosa (2001) also point abstractions of the category based approach and
out that this process is not 100% fool proof and instead the focus is placed on the individual
there may be some slight errors in the tagging of words, or phrases, and the ways in which they
some words. This necessitates the manual act within the text.
intervention by the researcher to correctly tag The word-form based approach can help a
any words that were erroneously tagged by the researcher determine the different meanings
tagging software. This work can be extremely which a word has and furthermore the patterns
time consuming depending on the size of the in which this differing meaning tends to occur. To
corpus being used and, as Hunston and Laviosa help with this research collocation is used.

106 Bulletin of Tokyo Denki University, Arts and Sciences No.14 2016
Collocation measures the amount of non-randomness
present when two words occur.’ Hunston and
Hunston and Laviosa (2001), state that Laviosa (2000:16-17), state that this gives a
collocation is the propensity for words to occur more accurate idea of the relationship between
near each other in a text. In other words, they two words. They go on to say that MI score
co-occur, or they are co-located. However, they assesses the importance of a collocation and that
also point out that just because two words it shows a clearer picture of the relationship
frequently occur near each other, this does not between words than that given by a simple
necessarily mean that there is a high collocation list alone. It is a measurement of
significance to this co-occurrence. For instance, two-way attraction. Walter (2010:435) states
for any given word of which the collocates are that because a word that occurs infrequently
searched for, there is a high probability that it collocates with another word, it is unlikely that
will collocate with the some of the most this collocation happens by chance. However,
frequently occurring words in the English according to Baker (2006:102), one drawback of
language e.g. the, a, etc. Therefore, the collocate MI score is that it tends to attach a high
list should not be taken at face value. Hunston significance to words that occur rarely in a text,
(2002:68) states that collocation is: ‘…the therefore giving somewhat misleading results. It
tendency of words to be biased in the way they is therefore not immediately clear how accurate,
co-occur.’ To gain a true idea of the important or usable, the results are. According to Hunston
collocates which a word has, two measurements (2002), only MI scores of 3 or higher should be
are applied; these are mutual information and considered to be important. To help verify the
T-score. These will be discussed in a little more importance of any given collocation, as well as
detail later. When calculating the collocates of a calculating MI score, another measurement
word, the search is usually performed within the called T-score is used.
four words to the left and four words to the right
of the search, or node, word. This space within T-Score
which the search is performed is known as the
span and its idea was put forward by Sinclair et This measurement takes into account evidence
al (1970). As noted by Baker (2006:103), the size for the collocation throughout the corpus.
of the span will have a bearing on the collocates Hunston (2002:72) points out that T-score is used
found. In other words, venturing into a bigger to analyse and validate a collocation when we:
span increases the chances of finding words ‘…need to know how much evidence there is for
which are not true collocates being included in it…how certain we can be that the collocation is
the results. the result of more than vagaries of a particular
corpus.’ This differs from MI score in that it gives
Mutual Information a clearer insight to which words have a strong
attraction to the node word and words which do
Mutual Information, henceforth referred to as not occur frequently in the corpus are not given a
MI score, is used to calculate the number or high significance. Therefore, it is more explicit
actual occurrences of a word against the number about the importance of a collocation. But as
of times that word was predicted to occur. Hunston and Laviosa (2000) point out, T-score
Hunston (2002:71) says that ‘…MI score only shows the words which are important to the

東京電機大学総合文化研究 第14号 2016年 107
node word, not which words the node word is Implications for the language teacher
important to. It is a measurement of one-way
attraction. According to Hunston (2002), a Corpus linguistics has the potential to be a
T-score of 2 or higher should be considered powerful tool in the arsenal of a teacher, whether
important. or not the course in question in specifically
linguistics related or not. In particular, a writing
Patterns class is ideally suited to such study as the
teacher could set out rules for the type of files
When talking about patterns in text, Hunston that students submit and dictate the format that
and Laviosa (2000) state that it is referring to file names should take. These files would be
the grammatical patterns in which a word occurs. immediately ready for inclusion in a specialised
Regardless of whether a word is a noun, corpus for both individual classes and a group of
adjective, adverb, pronoun, preposition etc., they classes. This would allow the teacher to tailor
all occur in some form of grammatical pattern. future lessons to the needs of the students as the
These patterns can be analysed and coded into a corpus would help highlight any common or
standardised form. The coding used by Hunston frequent errors and, hopefully, aid in discovering
and Laviosa (2000) is that which is also in why this type of error was made. The corpus
employed by Collins. could also be student specific, which would
The analysing and coding of grammatical greatly enhance feedback that a teacher gives.
patterns helps to show how a word is used and Creating a corpus for a communication course
ultimately shows the meaning, or meanings, would, naturally, be more time consuming, but
which a word has in a given pattern, or context. would also offer the same potential benefits.
According to Hunston and Laviosa (2000:29), However, it would be quite difficult to create the
Hunston (2002:138-139) and Sinclair (1991), type of student specific corpus mentioned above.
these different meanings are generally
highlighted by being part of differing Bibliography
grammatical patterns. Furthermore, as Hunston
(2002:139) points out, a pattern is not Aijmer, K. & Altenberg, B. (2001), ‘English Corpus

necessarily exclusive to one meaning of a word. Linguistics.’ Longman, London.

Differing meanings may share the same pattern, Baker, P. (2006), ‘Using Corpora in Discourse Analysis.’

however Hunston (2002:139) reassures that the Bloomsbury Academic, London.

relationship between the pattern and the Hunston, S., (2002), ‘Corpora in Applied Linguistics.’

meaning still holds true. Cambridge University Press, Cambridge.

Hunston and Laviosa (2000:28) also say that Hunston, S. & Laviosa, S. (2000), ‘Corpus Linguistics.’

the study of patterns affords us the opportunity Birmingham: School of English, CELS.

to verify whether or not our native speaker Kennedy, G. (1998), ‘An Introduction to Corpus Linguistics.’

intuition is correct and allows for the recognising Longman, London.

of a possible change in language behaviour Leech, G. (2001), ‘The state of the art in corpus linguistics.’ In

earlier than may otherwise be possible. Aijmer & Altenberg (2001), ‘English Corpus Linguistics.’

Longman, London.

Sinclair, J., (1991), ‘Corpus Concordance Collocation.’ Oxford

University Press, Oxford.

108 Bulletin of Tokyo Denki University, Arts and Sciences No.14 2016
Sinclair, J., Daley, R. and Jones, S., (1970), ‘English lexical

studies.’ Report No. 5060, Office of Scientific and Technical

Information, London.

Walter, E.(2010), ‘Using a corpus to write dictionaries.’ In

O’Keefe & McCarthy (eds.) ‘The Routledge Handbook of

Corpus Linguistics.’ (2010).

東京電機大学総合文化研究 第14号 2016年 109

You might also like