Riassunto Using Corpora in Discourse Analysis Di Paul Baker
Riassunto Using Corpora in Discourse Analysis Di Paul Baker
Paul Baker
1. INTRODUCTION
This book is about using corpora and corpus process in order to uncover linguistic patterns which
can enable us to moke sense of the ways that language is used in the construction of discourses.
Some people may know a lot about discourse analysis but not about corpus linguistic; for others
the
opposite may be the case, for others still, both areas might be equally opaque. We will begin by
giving a description of corpus linguistic and discourse.
Corpus linguistic
Corpus linguistic is the study of language based on example of real life language use. Corpora are
generally large (consisting of thousands or even millions of words), representative sample of a
particular type of naturally occurring language, so they can therefore be used as a standard
reference
with which claims about language can be measured. Electronic corpora are often annotated whit
additional linguistic information. Other types of information can be encoded within corpora, for
example in spoken corpora (containing transcript of dialogue) attributes such as sex, age, socio-
economic group and region can be encoded for each participant. This would allow language
comparasons to be made about different types of speakers. Up until the ’70 only a small number of
studies utilized corpus-based approaches and in the ’80 that corpus linguistics as a methodology
became popular. Between 1976-1991 corpus linguistic has been employed in a number of areas of
linguistic including dictionary creation, as an aid to interpretation of literary text, forensic linguistic,
language description, language variation studies and language teaching materials.
Discourse
The term discourse is used in social and linguistic research in a number of inter-related yet
different
ways. In traditional linguistic it is defined a language above the sentence or above the clause. The
term discourse is also sometimes applied to different types of language use or topic, for example,
we can talk about political discourse, colonial discourse, media discourse and environmental
discourse. A number of researchers have used corpora to examine discourse styles of people who
are learners of English. Discourse can also be defined as practices which systematically form the
objects of which they speak. In order to expand, discourse is a system of statements which
constructs an object as a set of meanings, metaphors, representations, images, stories,
statements
and so on that in some way together produce a particular version of events. Therefore, discourses
are not valid descriptions of people’s beliefs or opinions and they cannot be taken as representing
an
inner aspect of identify such as personality or attitude. They are connected to practices and
structures that are lived out in society from day to day. Discourses can therefore be difficult to pin
down or describe – they are constantly changing, interacting whit each other breaking off and
merging. One way that discourses are constructed is via language. Language is not the same as
discourse, but we can carry out analyses of language in texts in order to uncover traces of
discourses.
The shift to post-structuralism
Discourse analysts have used corpora in order to analyse data such as political texts, teaching
materials, scientific writing and newspaper articles. Such studies have shown how corpus analysis
can uncover ideologies and evidence for disadvantage. Corpus-based techniques have been
employed in studies which have attempted to analyse difference in language usage based on
identity. There are a small number of researchers who are applying corpus methodologies in
discourse analysis, this is still a cross-disciplinary field which is somewhat under-subscribed, and
appears to be subject to some resistance. All methods of research have associated problems
which
need to be addressed and are also limited in terms of what they can and can not achieve. One
criticism of corpus-based approaches is that they are too broad. More researches have
problematized corpora as constituting linguistic applied rather than applied linguistics for ex.
Widdowson claims that corpus linguistics only offers a partial account of real language because it
does not address the lack of correspondence between corpus finding and native speaker intuitions.
Others researchers should encourage corpus-based work which takes into account potential
problems, perhaps supplementing their approach whit other methodologies. There is no reason
why
corpus-based research on lexical items should not use diachronic corpora in order to track
changes
in word meaning and usage over time and several large-scale corpus building projects have been
carried out whit the aim of creating historic corpora from different time periods.
Corpus linguistics also tend to be conceptualized as a quantitative method of analysis. Before the
1980, corpus linguistics had struggled to make an impact upon linguistic research because
computers were not sufficiently powerful enough or widely available to put the theoretical
principles into practice. By the 1980, an alternative means of producing knowledge has become
available, roughly based around the concept of post-modernism and referred to as post-
structuralism
or social constructionism. Post-structuralists have developed close formulations between the
concepts of language, ideology and hegemony, based on the work of a lot of writers (for example
Gramsci). One area that corpus linguistics has excelled in has been in generating descriptive
grammars of languages based on naturally occurring language use, but focusing on language as
an
abstract system. Corpus linguistics approach can be perceived as equally time consuming. Large
numbers of texts must first be collected, while their analysis often requires learning how to use
computer programs to manipulate data: the access to corpora is not always easy and it is often
simply less effort to collect a smaller sample of data which can be transcribed and analysed by
hand, without the need to use computers or mathematical formulae.
Advantages of the corpus-based approach to discourse analysis
As well as helping to restrict bias, corpus linguistic is a useful way to approach discourse analysis
because of the incremental effect of discourse. Discourses are circulated via language use and the
task of discourse analysis is to uncover how language is employed to reveal underlying discourses.
Becoming more aware of how language is drawn on to construct discourses or various ways of
looking at the world, we should be more resistant to attempts of texts to manipulate us by
suggesting to us what is “common-sense” or “accepted wisdom”. A single word, phrase or
grammatical construction may suggest the existence of a discourse but it can sometimes be
difficult
to tell whether such a discourse is typical or not. A word, phrase or construction may trigger a
cultural stereotype. A lot of human communication is not a matter of choice but is constrained by
normativities which are determined by patterns of inequality. Consulting a large corpus of genral
British English, we find the words “confined” and “wheelchair” have fairy strong patterns of co-
occurrence: the phrase “confined to a wheelchair” occurs 45 times in the corpus and the more
neutral term “wheelchair user(s) occurs 37 times. There are enough cases to suggest that one
discourse of wheelchair users constructs them as being deficient in a range of ways. Every time we
read or hear a phrase like “wheelchair bound” or “despite being in a wheelchair”, our perception of
wheelchair users are influenced in a certain way.
Resistant and changing discourses
The repeated patterns of language use demonstrate evidence of particular hegemonic discourses
or
majority “common-sense” ways of viewing the world, corpus data can reveal the opposite.
Discourses are not static, they continually shift position. A hegemonic discourse ten years ago may
be viewed as a resistant or unacceptable discourse today and this can be shown by looking at
changing frequencies of word use in a diachronic corpus, or by comparing more the one corpus
containing text from different time periods. If we compare 2 equal corpora of British English
containing written texts from the early 1960 and 1990 we see that in 1990 corpus there are various
type of words which occur much more frequently than in 1960 corpus. In addiction, we can find
that certain terms have become less frequent (girl, Mr, Mrs) were more popular in 1960 than they
were in 1990, suggesting that perhaps sexist discourses or formal ways of addressing people have
become less common. May be that a word is no more or less frequent than it used to be, but its
meanings have changed over time. For ex. in the early 1960 the word “blind” appears in a literal
sense referring to people or animals who cannot see, and in the 1990 corpus it being used in a
range
of more metaphorical (and negative) ways.
Triangulation
Tognini and Bonelli (2001) makes a useful distinction between corpus-based and corpus-driven
investigations. A corpus-driven analysis proceeds in a more inductive way (the corpus is the data
and the patterns in it are noted as a way of expressing regularities in language.
Triangulation is a term coined by Newby in 1977 and is now accepted by most researchers. There
are several advantages of triangulation: it facilitates validity check of hypotheses, it anchors finding
in more interpretation and explanations, and it allows researchers to respond flexibly to unforeseen
problems and respects of their research.
Some concerns
Corpus linguistics are a useful method of carrying out discourse analysis, there are still a few
concerns which are necessary to discuss.
First, corpus data is usually only language data (written or transcribed spoken) and discourses are
not confined to verbal communication. Discourses can be embedded within images, for ex. pictures
of heterosexual couples often occur in advertising. In many cases discourses can be produced via
interaction between verbal and visual texts. The social condition of production and interpretation of
texts are important in helping the researcher understand discourses surrounding them.
Researchers
may choose to interpret a corpus-based analysis of language in different ways, depending on their
own positions. For ex. people from socially disadvantages groups tend to use more non standard
language and taboo terms than those form more advantaged group: in this case terms helping to
show identity group membership.
A corpus-based analysis will tend to place focus on patterns. Frequent patterns of language do not
always necessarily imply underlying hegemonic discourses. The power of individual texts or
speakers in a corpus may not be evenly distributed. General corpora are often composed of data
from numerous sources (newspaper, novel, letters, etc.). We way be able to annotate texts in a
corpus to take into account aspects of production and reception, such as author occupation/status
or
readership, but this will not always be possible. A hegemonic discourse can be most powerful
when
it does not even have to be invoked, because it is just taken for granted.
A corpus based analysis of language is only one possible analysis out of many, and open to
contestation. It is an analysis which focuses on norms and frequent patterns within language.
There
can be analyses of language that go against the norms of corpus data. Corpus linguistic does not
provide a single way of analyzing data. There are numerous ways of making sense of linguistic
patterns: collocations, keywords, frequency list, clusters, dispersion plots, etc. We may decide, for
ex., to investigate for co-occurences in a corpus in relation to how discourses are formed.
2. CORPUS BUILDING
Introduction
One the potential problems with using corpora in the analysis of discourse is decontextualized
data.
The relationship between different texts in a corpus or between sentences in the same file may be
obscured in quantitative analyses. The process of finding and selecting texts, obtaining
permissions,
transferring to electronic format, checking and annotating files may also provide the researcher
with
initial hypotheses as certain patterns are noticed, and such hypotheses could form the basis for the
first stages of corpus research.
Capturing data
There are also good reasons for building a specialized corpus. One of the easiest ways to collect
corpus texts is to use data which already exist in electronic format. For ex. the United Kingdom
Parliament website contains full transcripts of daily debates from the British House of Lords and
house of Commons. There are a lot of internet archived: Bibliomania, the Oxford Text Archive, the
Electronic text Centre, etc. It is also possible to save files in other format which retain the images,
styles and layout of the page. A problem with saving files is that we need to assume that all of the
language data we are collecting is going to be recognizable in plain text, which is not always the
case. One problem with saving the entire page from a website address is that we way end up with
unwanted text such as menus, titles or links to other pages. Once the site has been copied, it may
still be necessary to strip the files of unwanted text in any case, and some websites are
constructed
in order to prevent copiers from taking their content in this way.
Scanning and keying in
If text cannot be obtained from the internet, then there may be other ways that they can be
collected
electronically. For example British newspapers (The Guardian or The Independent) publish CD-
Rom archives of texts. If existing electronic sources are unavailable, then two other options present
themselves. The first involves converting paper document by running them through a scanner with
OCR (Optical character recognition) software. In general, the best types of texts that respond to
OCring are those which are published in a straightforward format. The usually last resort of the
corpus builder is to key in the text by hand. There are numerous companies which offer keying in
service.
Spoken texts
Certain types of texts will present problems to the corpus builder. Written data is generally much
easier to obtain than spoken data. Conversations and monologues will need to be transcribed by
hand and there is a range of information: prosodic information, paralinguistic information, non
linguistic data, pauses, etc. Sometimes archives containing transcripts of spoken data are already
in
existence. These transcripts may have been cleaned or glossed in order to remove or limit the
effect
of interruption as false starts, hesitations, etc. Scripted data always reflect how people really
speak.
Online texts
Different types of problems need to be overcome when collecting data from the internet,
particularly from of archived written texts that were originally published elsewhere. One growing
area of interest is consisting of the language which occurs in text messages, emails, chat rooms,
bulletin boards and newsgroups. It may be a good idea to save different version of the corpus, one
which retains everything in the format it originally occurred in, the other which only contains
unique “first time” entries.
Permissions
Before text are copied into a corpus database, compilers must seek and gain the permission of the
authors and published who hold copyright for the work, or the informed consent of individuals
whose right to privacy must be recognized. Often obtaining signed permissions can be a slow and
complex task, as individual permission must be gained for each text that is placed in the corpus.
Commercial corpora are often large and contain representative text from many sources. Different
publishers and funding bodies may vary in regard to their attitude towards the necessity of
permissions. In all cases to obtain permission may help to safeguard the researcher.
Annotation
It is usually recommended that corpus builders employ some from of annotation scheme to their
text files in order to aid analysis and keep track of the structure of the corpus. Because the
convention for representing typographical features in electronic texts can vary depending on the
software used to edit the text. One system in Standard Generalized Markup language (SGML)
created in the 1980 as a standard way of encoding electronic text by using codes to define
typeface,
page layout, etc. In general the codes are enclosed between less than and greater than symbols: <
>.
The Text Encoding Initiative (TEI) is a related system developed in the 1990 which specifies a set
of SGML codes which are to be specifically used for different types of text mark up. Corpus
analysis packages tend to be capable of handling SGML codes but they may be less equipped to
deal an ad hoc coding system created by a researcher working alone.
Headers can be useful form of record keeping, particularly if a corpus consist of many files from
different sources, created at different times. The headers may also contain information about the
author or genre of the file. Other meta-linguistic information that could appear in headers for
written files includes publication date, medium of text, level of difficulty, audience size, age and
sex of author and target audience.
Grammatical annotation is one procedure that is commonly assigned to corpora at some stage
towards the end of the building process; can be useful in that it enables corpus users to make
more
specific analyses.
The important point is that different form of annotation are often carried out on corpora and can
result in more sophisticated analyses of data but that this is not compulsory.
Obtaining access to a reference corpus can be helpful for two reasons: first, reference corpora are
large and representative enough of a particular genre of language, that they can themselves be
used
to uncover evidence of particular discourses; secondly, a reference corpus acts as a good
benchmark
of what is normal in language by which your own data can be compared to. We can compare a
large
reference corpus to a smaller corpus in order to examine which words occur in the smaller text
more
frequently than we would normally expect them to occur by chance alone. The access to a
reference corpus is potentially useful for carrying out discourse analysis, even if the corpus itself is
not the main focus analysis. Perhaps more problematic issue is to do with gaining access to
corpora.
researchers will be at an advantage. some corpus builders allow user limited access for a trial
period
before buying a smaller sample of their corpus.
Introduction
Frequency is one of the most concept underpinning the analysis of corpora. Frequency list can be
employed to direct the researcher to investigate various parts of a corpus, how measures of
dispersion can reveal trends across texts and how frequency data can help to give the user a
sociological profile of a given word or phrase enabling greater understanding of its use in particular
contexts. Related to the concept of frequency is that of dispersion.
Join the club
Frequency and dispersion can be employed in a small corpus of data for example in a corpus
which
consists 12 leaflets advertising holidays published in 2005 with the goal to investigate discourses
of
tourism.
Holidays brochures are interesting text type to analyse because they are an inherently persuasive
from of discourse. their main aim is to ensure that potential customers will be sufficiently impressed
to book holiday.
Frequency counts
Using the corpus analysis software WordSmith, a word list of the 12 text files was obtained. A word
list is a list of all of the words in a corpus along with their frequencies and percentage contribution
that each word makes towards the corpus. The most frequent words in the corpus are grammatical
words: pronoums, determiners, conjunctions, prepositions. There are words describing holiday
residences (studios, facilities, apartments), and other attractions (beach, pool, club).
Considering clusters
We need to consider frequencies beyond single words. using WordSmith it is possible to derive
frequency lists for clusters of words. BAR and CLUB are the most frequent lexical lemmas in the
holiday corpus, they are also only lemmas in the top ten that relate to alcohol. We can consider
another class of words: verbs, that they play a particularly role in tourist discourse. In holiday
corpus the most frequently verbs are imperative verbs clusters.
Dispersion plots
Another way of looking at the word is to think about where it occurs within individual texts and
within the corpus as a whole. A dispersion plot gives a visual representation of where a search
term
occurs in the corpus. The plot has also been standardized so that each file in the corpus appears
to
be of the same length. this is useful in that it allows us to compare where occurrences of the
search
term appears, across multiple files.
By examining the frequency list the most frequent informal terms in the corpus were collected and
are presented in a table where it necessary to explore the context of some words in detail, in order
to
remove occurrences that were not used in a colloquial or informal way. Most of the terms occurred
more often in written, rather than spoken British English. the spoken text tend to contain of the
more informal meanings of the words. For the authors of the holiday leaflets to use informal
language in order to index youthful identities we need to assume that they believed that such
language was typical of this identity and that the target audience would also read the leaflets in the
same way. By using a form language which is strongly associated with youthful identities the
audience may feel that they are been spoken to in a narrative voice that they would find desirable
or
at least are comfortable with. The use of colloquialisms also contributes to normalization of certain
types of youthful identities. it suggest a shared way of speaking for young people, who do not use
informal language may be alerted to a discrepancy between their linguistic identities and those of
people featured in the brochure.
Conclusion
The analysis of frequent lexical lemmas revealed some of the most important concepts in the
corpus
and a more detailed analysis of clusters and individualize incidences containing these termes
revealed some if the ways that holidaymakers were constructed. By investigating how hight
frequency informal language occurred in a reference corpus of spoken British English, we were
able
to gain evidence in order to create hypotheses about how the readership of the holidays leaflets
were
constructed.
4 CONCORDANCES
Introduction
A concordance analysis is one of the most effective techniques which allows researchers to carry
out a sort of close examination. A concordance is simply a list of all the occurrences of a particular
search term in a corpus and is also sometimes referred to as key word in context or KWIC,
althought it should be noted that "key word in context" has a different meaning to the concept of
key words. Here key word simply means the word that is currently under examination and that can
be any word that takes the interest of the researcher. In order to demonstrate how concordances
can
be of use to discourse analysis we need to carry out an examination of a new set of data, a corpus
of
newspaper articles that are one of the easiest text types to collect. The relative ease in which
newspaper data can be appropriated for corpus use suggests that it should be employed whit care
rather than overused; newspaper data is very useful area of producing and reproducing
discourses.
Journalist are able to influence readers by producing their own discourses or helping to reshape
existing one. texts can only take on meaning when consumers intercat with them. Discourses
within newspapers are usually the result of collaboration between multiple contributors and single
articles may express a variety of views on the same object. When using a corpus of newspaper
articles it is important to bear in mind that the processes of production and reception of any
particular article are complex and multiple.
Refugees are a particularly interesting subject to analyses in term of discourse because they
consist
of one of the most relatively powerless group in society. One aspect of this conceptualization of
discourse relating to ways of looking at the world is that it enables or encourages a critical
perspective of language and society. The minority group are frequent topics of political talk and
text, but have very little control over their representations in political discourse.
In the media, refugees are rarely able to construct their own identities and discourses , but instead
have such identities and discourses constructed for them, by more powerful spoken people.
In order to construct the corpus of newspaper articles an internet based archive called Newbank
was
used, that contains articles from a large variety of British broadsheet and tabloid newspapers
including Daily Mail, Daily Mirror, The Guardian and The Times. Only articles which are
published in the year 2003 were considered, which included the words refugee or refugees. We
first
need to scan the concordances lines, trying to pick out similarities in language use, by looking at
the
words and phrase which occur to the left and right hand sides of the terms refugee and refugees.
Sorting concordances
The concordance lines are presented to us in order in which they occur in the corpus.
We could sort the list alphabetically one or more places to the left or right of the search term.
Refugees are also constructed in terms of metaphors which construct them as transported goods
or a
packages again, as a token of their dehumanization.
Carrying out further sort on the concordance didn't reveal any ore interesting patterns or clues
about
discourses. One possible avenue of research is simply to consider all of concordance lines that
have
not already been used to demonstrate the discourses of refugees. Some of concordance lines are
longer that usual, meaning that more context needed to be taken into account before patterns of
meaning could be derived from the concordance.
Our concordances based analysis of the termes refugee and refugees in the small corpus of
newspaper articles has been useful in revealing a range of discourses: refugees as victims, as
recipients of official attempts to help, as a natural disaster and as a criminal nuisance. A
concordance analysis elucidates semantic preference. Semantic preference is the relation between
lemma or word form and a set of semantically related words. Semantic preference also occurs with
multi words units and is therefore related to the concept of collocation but focus on a lexical set of
semantic categories rather than a single word or a related set of grammatical words. However,
semantic preference is also related to the concept of discourse prosody where patterns in
discourse
can be found between a word, phrase or lemma and a set of related words that suggest a
discourse.
The difference between semantic preference and discourse prosody is not always clear-cut.
Semantic preference denotes aspects of meaning which are indipendent of speakers, whereas
discourse prosody focuses on the relationship of a word to speakers and hearers, and is more
concerned with attitude. Another term is semantic prosody which has been used by other
researchers in a way which makes it akin to discourse prosodies. We can use this term for
analysing
the language used in a type of phrases. A corpus-based approach is useful in that it helps to give
wider view of the range of possible ways of discussing refugees. Corpus data can help to establish
which sorts of languages strategies are most frequent or popular (for ex. the refugees as water
metaphor was found to be much more frequent other metaphors).
Points of concern
A concordance analysis is one of the more qualitative form of analysis associated with corpus
linguistics. It is responsability of the analyst to recognize linguistic patterns and also to explain why
they exist. One aspect of the concordance analysis that we need considered is that when carrying
out searches on a particular subject as well as euphemisms and similes for that subject, it might
also
be case that it is referred to numerous times with determiners or pronouns. However concordances
of pronouns and determiners are likely to include many irrelevant examples.
Step-by-step guide to concordance analysis
5 COLLOCATES
Introduction
Carrying out a close analysis of search terms via a concordance can be helpful in revealing traces
of
discourses within texts; concordance can be in some cases can consist of hundreds or even
thousands lines. Researchers can rely on sampling methods which are helpful in reducing the
lenght
of time spent on analysis, but a problem is that may also fail to reveal salient aspects of the
concordance. Another problem is that patterns are not always as clear-cut in a concordance as we
would like them to be. In the British National Corpus all words co-occur with each other to some
degree. When a word regularly appears near another word, and the relationship is statistically
significant then such co-occurences are referred to as collocates and the phenomena of certain
words frequently occurring next to or near each is collocation. Collocation is a way of
understanding meanings and associations between words which are otherwise difficult to ascertain
from a small-scale analysis if a single text. Words can take on meaning by the context that they
occur in. To Explore how discourse analysis can be carried out by focusing primarily on
collocation. In order to carry out a linguistic analysis it is useful to examine the usage of words in a
corpus. In the British National Corpus, a large corpus, we can view that it as being more or less
representative of general British English.
Deriving collocates
There are a number of different procedure of collocation calculated. The simplest is to count the
number of times a given word appears within, say a 5 words window to the left or right of a search
term. If we use this procedure we get a list of words. One the problem with this technique is that
hight frequency words generally tend to be function words which does not always reveal much of
interest, particularly in term of discourse. A number of statistical tests take into account the
frequency of words in a corpus and their relative number of occurrences both next to and away
from
each other. One such test is called Mutual Information (MI). Mutual information is calculated by
examining all of the places where two potential collocates occur in a text or corpus. An algorithm
then computes what the expected probability of these two words occurring near to each other.
The word bachelor occurs more frequently in the corpus, than more spinThe word bachelor occurs
more frequently in the corpus, than more spinster. Examining concordances which contain
bachelor
along with these collocates it is clear that they all relate to having a degree (e.g. bachelor of arts).
Here the meaning of bachelor (a type of degree) is different to the meaning we are concerned with
(a man who has not married). Homonyms are a rare and accidental phenomenon.
Polysemy, where two words with the same spelling have interrelated meanings are much more
common. While the collocates of bachelor which suggest a meaning of university education no
longer have the same association with bachelor as unmarried man, the two meanings are perhaps
due to historical polysemy rather than being accidental homonyms. What we seen with the
strongest
collocates of bachelor is a somewhat dualistic picture of discourse. A young bachelor receives a
positive discourse prosody connected to living a happy, possibility urban existence. This is
supported by an analysis of the collocates days, life, eligible, and party. The positive discourse
prosody is tied to the fact a bachelor life is expected to be a short-term situation. When bachelor
becomes a long-term state, then it is viewed as more problematic: repeatedly characterized in a
corpus by poverty, eccentricity, old age and loneliness. There is an implication than there is
something wrong or unfortunate about a man who goes through his whole life without marrying.
Resistant discourses
A collocational analysis has shown us some of the most salient discourses and different ways of
referring to bachelors and spinsters. A collocational analysis is useful for two reasons. First it
provides a focus for our initial analysis which is particularly helpful when a large number of
concordance lines need to be sorted multiple times in order to reveal lexical patterns. Secondly, it
gives us the most salient lexical patterns surrounding a subject from which a number of discorses
can be obtained. When two words frequently collocate, there is evidence that the discourses
surrounding them are particularly powerful perhaps to the point where even one half of the pair is
likely to prime someone who hears or reads that words to think of the other half. Collocates can act
as triggers, suggesting unconscious associations which are ways that discourses can be
maintained.
Corpus data gives us one way of understanding language, based on what is typical.
Collocates may also contain traces of resistant discourses, which are worth exploring in the
remaining concordance lines.
Collocational networks
7 KEYNESS
Frequency revisited
A frequency list can help to provide researchers with the lexical foci of any given corpus.
Investigating the reason why a particular word appears so frequently in a corpus can help to reveal
the presence of discourses, especially those of a hegemonic nature. Compiling a frequency list is
an
important first step to giving un idea about what to focus on. Simple frequency possess limitations.
A new research is on political debates on fox hunting in the British House of Commons. Politicians
are aware that they are playing a language game with huge consequences, and that they must
appear
to speak with authority and conviction: they often developed a style of speaking which is opaque,
vague or empty. Has been build a corpus of parliamentary debates on the issue. The majority of
Commons members voted for the ban to the ahead, although in each debate a range of options
could
be debated and voted upon. The first procedure is create a word list of the "fox hunting" corpus.
The
corpus size was 129.798 words. The most frequent words tend to be grammatical item such as
determiners, prepositions and conjunctions. Sometimes grammatical items in themselves can be
indicative of particular discourse, for ex. if a conjunction like "and" is repeatedly used to stress a
connection between 2 objects of discussion. The most frequent lexical words are perhaps more
interesting, here we find words that we would have expected or guessed to appear. There are
terms
of adress associated with the context of a parliamentary debate (Mr, friend, right); other words
associated with the context parliament (house, minister) and words connected with the subject
under
discussion. On way of finding out what lexical items are interesting in a frequency list is to compare
more than one list together. If a word occurs comparatively more often in, say, a corpus of modern
English children's stories, when compared to British National Corpus we could conclude that such
a
word has high saliency in the genre of children's stories and is worth investigating in further detail.
The corpus has been split into two. The speech of all of the people who voted to ban fox hunting
and the speech of those who voted for hunting to remain. Some words appear in a list and almost
all
are connected to either the subject under discussion or the context where the debates took place:
parliament.
Introducing keyness
Using WordSmith it is possible to compare the frequencies in one wordlist against another in order
to determine which words occur statistically more often in wordlist A when compared with wordlist
B and vice versa. Then all of the words that do occur more often than expected in one file when
compared another are compiled together into another list, called a keyword list. This list gives a
measure of saliency whereas a simple word list only provide frequency. WordSmith takes into
account the size of each sub corpus and the frequencies of each word within them. Keyword list
tend to show up 3 types of words. First there are proper nouns, secondly nouns, verbs, adjectives,
adverbs and finally a hight frequency of grammatical words.
Analysis of keywords
The majority of the keywords found are the "aboutness" variety in both parts of the list. The word
"criminal" is used by those were opposed to a ban on hunting and it occurs 38 times in the
collective speech of the pro-hunters and only twice in the speech of the anti-hunters. It is
necessary
to examine individual keywords in more detail, by carrying out conconrdance of them and looking
at their collocates. When a concordance of "criminal" was carried out on the corpus data, it was
found that common phrases containing the word "criminal" included the criminal law, a criminal
offence, criminal santions, etc.
The lemma MAKE seem to be a relatively important collocate of "criminal". Looking at the
concordance of the word "criminal" there are other concordance lines which suggest a similar
pattern, but do not include MAKE.
Terms like "invoke" or "impose" are rhetorical strategies used to a particular discourse position.
The
word "dogs" occurs 182 times in the speech of the anti-hunters and 74 times in the speech of those
who want hunting to remain legal. A concordance of the "use of dogs" was carried out for the
whole
corpus. The keyword list has given us a small number of words to examine and once the proper
nouns have been discounted this leaves us with just 16 words in total. Finally consider another
used
by pro hunt speakers: practices. This word is interesting because it is difficult to determine exactly
what it means. it occurs as a plural (veterinary practises, slaugther practises ans livestock
practices,
etc.). This term is therefore used to refer to a multitude of technique connected to animals, or it is
also creates an association between non-lethal ways of dealing with animals.
So far our keywords analysis has been based on the idea that there are 2 sides to the debate and
that
by comparing one side against another we are likely to find a list of keywords which will then act as
signpost to the underlying within the debate on fox-hunting. Our analysis so far has uncovered
some interesting difference between the 2 sides of the debate. We need to separating all of the
speech in the different debates into different files. the task of creating these files can be off-putting
and in any case not always necessary.
In term of proportions taking into account the relative size of the sub-corpora the anti-hunt
speakers
actually used, for ex. the term "cruelty" less than pro-hunters. Examining this word in more detail, it
becomes apparent that although it occurs with a frequency on each side of the debate.
Comparing a smaller corpus or set of texts to a larger reference corpus, is therefore a useful way
of
determining key concepts across the smaller corpus as a whole. For many studies where the text
or
set of text under scrutiny is relatively uniform, using a reference corpus may be all that is needed.
using a reference corpus may be useful in revealing those words that are under represented in the
data. When comparing a smaller corpus with a reference corpus, WordSmith also gives a list of all
the negative keywords and this list doesn't take into account word which appeared zero times in
the
small corpus. Negative keywords can help to show topics or words of style which are not favoured
in a corpus, which in itself can be illuminating.
Key clusters
Another way of spotting words which occur frequently in 2 comparable sets of text but may be used
for different purposes is to focus on key clusters of words. using WordSmith it is possible to derive
wordlists of clusters of words, rather than single words. WordSmith allows the user to specify the
size of the cluster under examination generally the larger the cluster size we specify the fewer the
number of key cluster that are produced. Taking a cluster size of three, a list of key clusters was
obtained by comparing the speech of pro-hunters with those were against hunting. This list
contained some interesting cluster. When reporting the analysis of keyness, it is worth mentioning
dispersion, particularly in cases like this where dispersion brings up something unexpected. This
requires a more close analysis of words and phrases in the corpus, rather than simply recounting
frequencies from wordlists.
Key categories
A simple key list will reveal differences between sets of texts or corpora, it is sometimes the case
that lower frequency words will not appear in the list because they do not occur often enough to
make a sufficient impact. This may be a problem as low frequency synonyms tend to be
overlooked
in a keyword analysis. Finding key categories could help to point the existence of particular
discourse types, they would be a useful way of revealing discourse prosodies. In order for such
analyses to be carried out it is necessary to undertake the appropriate forms of annotation. The
automatic sematic annotation system used to tag the fox-hunting corpus was the USAS (UCREL
Semantic Analysis System). Tags can be assigned a number of plus or minus codes to show
where
meaning resides on a binary or linear distinction. Once the semntic annotation had been carried
out,
word lists of the sides of the fox-hunting debate were created and compared with each other to
create a keyword list. From this list, the relevant key semantic tags were singles out for analysis.
Conclusion
A keyword list is a useful tool for directiong researchers to significant lexical differences between
texts. Carrying out comparisons between 3 or more sets of data, grouping infrequent keywords
according to discursive similarity, showing awareness of keywords or dispersion plots, carrying out
analyses on key cluster will enable researchers to obtain a more accurate picture of how keywords
function in texts. Keywords can reveal a great deal about frequencies in texts which is unlikely to
be
matched by researchers intuition. As with all statistical methods, how the researchers choose to
interpret the data is ultimately the most important aspect of corpus-based research.
7 BEYOND COLLOCATION
Introduction
In this chapter we focus on aspects of discourse analysis which are more concerned with
grammatical rather than lexical patterns. We will be considering a number of ways that a more
grammar based analysis can be of value to researchers looking at discourse via corpora. We look
at
a single term, the lemma ALLENGE and its forms. The analysis of this lemma was inspired by
reading of an article on a news website about an alleged rape. The article acted as a springboard,
raising a number of questions about ALLENGE. A corpus analysis would help us to establish
whether or not the patterns of language found in the article are typical or atypical of general
English
usage. The verb allege and its related forms, is therefore a key aspect in the discursive
construction
of stories about rape.
Nominalization
Nominalization involves a process being converted from a verb or adjective into a noun or a multi-
noum compound (e.g. discover --> discovery, solve --> solution). Nominalizations often involve
reductions or deletetions in some way.
In the BNC, the word “allenge, allenging, alleged, alleges, allegendly, allegation and allegations”
collectively occur domain. They also occur much more often in written to be spoken texts than
written or spoken texts. The lemma "allege" is associated also with a variety of forms of news
reporting.
It is important that we distinguish between the verb and adjectival forms of the word alleged. The
BNC is grammatically tagged: the tag AJ0 stands for adjective so we can find only the adjectival
uses of alleged by carrying out a search on allegend=AJ0; the tag VVD contains the past tense,
the
tag VVN a past participle. The categorization of grammatical forms of alleged in the BNC is made
even more complicated by the presence of portmanteau tags.
The adjectival form alleged is also quite popular in the corpus and this term occur 1687 times.
The forms of allege are usually collocate with adverbial form (allegedly) but also with verbs
suggesting crimes: abducted, tortured, murdered, committed, etc. The adjectival form alleged
collocates with words specifying groups of people who are accused: accomplices, perpetrators,
collaborators as well as crime: infringements, atrocities, etc. The nominalized allegation (s) form
has a discourse prosody for denial which is not found with any of the other forms of allege.
Allegations is a word which shown a semantic preference for the concept of denial in general
English, coupled with the adjective "ludicrous". The word ludicrous was used by the spokeswoman
in the article to describe the allegations, to obtain an idea of what other thing are commonly
ascribed as ludicrous. Most of th other collocates are adverbs: almost, rather, most, too, so, how,
even which are relate to the scale of which something is described as ludicrous.
The pairing of ludicrous and allegations is a particularly powerful language strategy as both words
contain strong associations of untruth or unreality embedded them.
Modality
The analysis of allege in the BNC has shown how it has a strong association with news reporting
and how the nominalized form has a strong discourse prosody for denial, while the non
nominalized
forms do not. Modality relates to speaker or writer authority and is based around the use of a range
of modal auxiliary verbs. Such a strong presence or absence of certain modal verbs is an
indication
of power relationship and status relatively powerful seem to be paired with modal verbs which give
them more freedom and choice, while more controlling modal verbs are used with less powerful
groups. Using the BNC, we carried out searches on the noun, adjective, verb and adverb forms of
allege and then counted the number of times modal verbs appeared within three spaces to left and
right of them.
The most popular construction are:
- Noun forms: allegation would, could or will;
- Verb forms: may, would or could allege;
- Adjective forms: alleged .. should/will;
- Adverb form: would allegedly.
In general British English, the modal verbs would, will, can and could are most popular in overall
language usage, while modal like shall, ought and need are much less common. "Could" appears
more often in conjunction with allege while can occurs less often than expected. The adjective
dorm
alleged tends to be connected to 2 of the most certain modal verbs: should and will.
The nominalized forms of allege tended to strongly correlate with words related to denial and this
suggest that these forms of denial are made more certain modal verbs.
Attribution
The presence (or absence) of different types of actors in narratives of rape can have consequence
which relate to both the focus of the story and the way the agency of those involved is represented.
In the news article both the person who is accused of raping, and the person who claim to be
raped
are mentioned several times and in a variety of ways. If we only carry out a search, in the corpus,
on
sentences which contain allege and rape, we may miss additional information which comes later or
earlier in the text.
Metaphor
We have seen that the word chois allegations is of particular salience to the news article, because
it
carried a strong association with denial. In the article it was used in a direct quote by
spokeswoman
for the person who the allegations are being made about, but it was not used by the actual
narrative
voice of the article, with the more neutral term alleged occuring 3 times instead. We have used a
reference corpus to look at collocation of the word allegations, as well as patterns or modal use
and
the presence or absence of various types of actor associated with allegations rape.
Another way of understanding some of the hidden associations of the word allegation is to consider
it in terms of metaphor. Metaphors are a particularly revealing way of helping to reveal discourses
surrounding a subject. Looking at the precence of metaphors in a corpus and noting their relative
frequencies to each other, should provide researchers with a different way of focusing on
discourse.
There isn't a simple way of carrying out a metaphor based analysis on a corpus: the researcher
carries out a close reading of a sample of text in order to identify candidate metaphors; corpus
context are examined to determine whether keywords are metaphoric or literal. In our corpus
abstract concepts are often constructed via metaphors which reference concrete entities, and it is
the
case that allegation(s) will have metaphor in common with similar terms like "accusation" or
"claim".
The corpus not only helps to uncover the possible metaphors surrounding a word or concept, but it
can also be useful in revealing how that metaphor works in a range of other cases, enabling
researchers to gain a greater understanding of its meaning. We see allegations referred to in terms
of
heavy, weight, violence, penetration, waste, fire, flight and horses. Some of these metaphors
appear
to be more frequent than others. The term allegation is found in a range of general metaphorical
patterns in British English it is not possible to say that any single metaphor dominates the way than
we think of the term.
Further directions
We could have expanded our analysis of the term allegations to consider other linguistic
phenomena
like a range of lexical, semantic and grammatical features or we may also to consider co-
ordination.
There are some techniques in critical discourse analysis which are more difficult to carry out on
corpora. At present, a great deal of corpus bases discourse analysis is still focused at the lexical
level. The challenge to future researchers is to find ways to make grammar and semantic based
analysis of corpora a more feasible proposition.
8 CONCLUSION
This book identified some of the most some of the most useful methodological techniques of
corpus.based research (frequencies, collocations, keywords, concordance, dispertions) and show
how they can be effectively used in the analysis of discouse. The main points about language and
dicourse that our corpus based analysis have revealed:
- Corpus based discourse analysis is not simply a quantitative procedure but one which involves a
great deal of human choice at every stage: research questions, designing and building corpora,
deciding which techniques to use, interpreting the results and framing explanations for them.
- Attitudes and discourse are embedded in language via our cumulative, lifelong exposure to
language patterns and choises: collocations, semantic and discourse prosodies.
- We are often unconscious of the patterns of language we encounter across our lifetime, but
corpora are useful in identifyng them: they emulate and reveal this cumulative exposure.
Corpus building
The design and availablility of corpora are paramount to its analysis. Diachronically, language and
society are constantly changing and discourses are changing as well. there is an urgent need to
build
more up-to-date corpora in order to reflect this passing of time. Some aspects of language use do
not change as rapidly as others. The contents of the BNC are a testament to the way that people
wrote and spoke in the early 1990. using corpora of texts that were created decades or centuries
ago
will help researchers to explore the ways that language was once used, shedding light on the
reason
behind current meanings, collocations and discourse prosodies of particular words phrases or
grammatical constructions. Comparing a range of corpora from different historic time periods will
give us a series of linguistic "snap-shot" which will allow discourses to appear to come to life. An
aspect of corpus building which is particularty relevant for discourse analysis is the fact that context
is so important. Corpora that include both the electronic text only with annotation from and the
original texts would be useful for making sense of individual texts within them. In the case of
newspaper or magazine articles it would be useful to make references back to the original page(s)
so
we could note aspect such as font size and style. colors, layout and visuals.
Corpus analysis
It is important that a corpus based analysis will not give researchers a list of discourses around a
subject. The analysis will point to patterns in language which must then be interpreted in order to
suggest the existence of discourse. The corpus based analysis can only show what is in the
corpus,
although it may be a far reaching analysis, it can never be exhaustive. Corpora are so large and
we
may be tempted to think that our analysis has covered every potential discursive construction
around a given subject. The wide variety of altenative statistical avaible to the corpus user might
mean that data an be subtly massaged in order to reveal results that are interestnig, controversial
or
confirm our suspicions. When using a general corpus, issue surrounding the variety types of
reduction and reception for all of the texts within, can become highly problematic. One option could
be recognize that the general corpus consist of a multitude of voice and to use such data sparingly
instead carrying out the analysis of discourses on more specialized corpora, where issues of
production and reception can be more easily articulated. Another possibility could simply be to
argue from a perspective that society is inter-connected and all texts influence each other. A
corpus
based analysis of discourse affords the researchers with the patterns and trends in language.
People
are not computers though and their ways of interacting with texts are very different, both from
computers and from each other. Corpus based discourse analysis should play an important role in
term s of removing bias, testing hypotheses, identifying norms and outiliers and raising new
research questions.