0% found this document useful (0 votes)
25 views49 pages

Corpus 2

Uploaded by

Ayeza Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views49 pages

Corpus 2

Uploaded by

Ayeza Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 49

Corpus Analysis and Linguistic

Theories

Today! We will discuss


corpus analysis;
methods and tools.
Naved Nawaz
What is corpus linguistics?

• Corpus linguistics is a field which focuses upon a


set of procedures, or methods, for studying
language.
• We can take a corpus-based approach to many
areas of linguistics.
• Importantly, the development of corpus linguistics
has also spawned new theories of language –
theories which draw their inspiration from attested
language use and the findings drawn from it.
• But corpus linguistics is not a monolithic,
consensually agreed set of methods and
procedures.
• It is in fact a heterogeneous field – although
there are some basic generalisations that we
can make.
The main features of corpus linguistics
• Research in corpus linguistics deals with some
set of machine-readable texts which is
deemed an appropriate basis on which to
study a particular research questions.
• The set of texts or corpus is usually of a size
which defies analysis by hand and eye alone
within any reasonable timeframe.
• For this reason, corpora are invariably
exploited using software search tools.
• Concordancers allow users to look at words in
context.
• Other tools allow the production of frequency data,
for example a word frequency list, which lists all
words appearing in a corpus and specifies how
many times each one occurs in that corpus.
• Concordances and frequency data exemplify
respectively the two forms of analysis, namely
qualitative and quantitative, that are equally
important to corpus linguistics.
Different types of corpus study
The following features effectively distinguish
different types of studies in corpus linguistics:
• Mode of communication;
• Corpus-based versus corpus-driven linguistics;
• Data collection regimes;
• The use of annotated versus unannotated
corpora;
• Multilingual versus monolingual corpora.
Mode of communication
• Corpora may encode language produced in
any mode of communication – for example
there are corpora of spoken language and
there are corpora of written language.
• Many corpora contain data from more than
one mode, such as the
British National Corpus (BNC).
Written corpora
• Corpora representing written language usually
present the smallest technical challenge to
build, since much data already exists in
electronic format (ew.g. on the web).
• Until recently, encoding writing systems other
than the Roman alphabet was prone to error
(Baker et al. 2000).
• Written corpora can still be time consuming to
produce when the materials have to be
scanned or typed from printed or handwritten
original documents.
• However, with the advent of Unicode, this
problem is being consigned to history.
• But in general, the construction of written
corpora has never been easier.
Spoken corpora
• Spoken corpus data is typically produced by
recording interactions and then transcribing
them.
• These transcriptions may be linked back
systematically to the original recording
through a process called time-alignment so
that concordance results can be connected to
the correct location in the sound file.
• Orthographically transcribed material is rarely
a reliable source of evidence for research into
variation in pronunciation; phonemically
transcribed material is of much more use in
this respect.
Other modes of communication
• Corpora which include gesture, either as the primary
channel for language (as in sign language corpora) or
as a means of communication parallel to speech, are
relatively new.
• Corpus linguistic studies focusing on the visual
medium are only just beginning to be undertaken on
a truly large scale, for example investigating the
relationship between gesture and speech (Carter and
Adolphs 2008), or constructing large corpora of sign
language material (Johnston and Schembri 2006).
Corpus-based versus corpus-driven
linguistics
• The distinction between corpus-based and
corpus-driven language study was introduced
by Tognini-Bonelli (2001).
• Corpus-based studies typically use corpus data
in order to explore a theory or hypothesis,
aiming to validate it, refute it or refine it. The
definition of corpus linguistics as
a method underpins this approach.
• Corpus-driven linguistics rejects the
characterisation of corpus linguistics as a
method and claims instead that the
corpus itself should be the sole source of our
hypotheses about language.
• It is thus claimed that the corpus itself
embodies a theory of language (Tognini-
Bonelli 2001: 84-5).
Data collection regimes
• Two broad approaches to the issue of
choosing what data to collect have emerged:
• the monitor corpus approach, where the
corpus continually expands to include more
and more texts over time; and
• the balanced corpus or sample
corpus approach.
Monitor corpora
• A monitor corpus is a dataset which grows in size
over time and contains a variety of materials.
• The relative proportions of different types of
materials may vary over time.
• The Bank of English (BoE), developed at the
University of Birmingham, is the best known
example of a monitor corpus.
• The BoE was started in the 1980s (Hunston 2002:
15) and has expanded since then to well over half a
billion words.
• The BoE represents one approach to the monitor corpus;
the Corpus of Contemporary American English (COCA;
Davies 2009b) represents another.
• COCA expands over time like a monitor corpus, yet it
does so according to a much more explicit design than
the BoE.
• Each extra section added to COCA complies to the same,
set breakdown of text-varieties.
• This corpus represents something of a halfway house – a
monitor corpus that proceeds according to a sampling
frame and regular sampling regime.
Balanced corpora
• In contrast to monitor corpora, balanced corpora, also
known as sample corpora, try to represent a particular
type of language over a specific span of time.
• In doing so they seek to be
balanced and representative within a
particular sampling frame.
• So, for example, if we want to look at the language of
service interactions in shops in the PAK in the late 1990s,
the sampling frame is clear. we would only accept data
into our corpus which represents interactions of this
sort.
• A good example of a corpus that seeks balance
and representativeness within a given
sampling frame is the Lancaster-Oslo/Bergen (
LOB)corpus. This represents a ‘snapshot’ of
the standard written form of modern British
English in the early 1960s across a range of
2,000 word samples.
Opportunistic corpora
• There are many corpora that do not necessarily match the
description of either a monitor or a sample corpus comfortably.
• Such corpora are best described as opportunistic corpora.
• These corpora do not adhere to a rigorous sampling frame.
Rather, they represent nothing more nor less than the data that
it was possible to gather for a specific task.
• Sometimes technical restrictions prevent the collection of data
to populate an idealised sampling frame. This was particularly
common prior to widespread electronic publishing and the web.
• Today, an opportunistic approach is often needed with spoken
data in particular: converting spoken recordings into machine-
readable transcriptions is a very time consuming task.
Annotated versus unannotated corpora

The tree diagram – a commonplace of (corpus)


linguistics! Give illustration..??
What is corpus annotation?
• Linguistic analyses encoded in the corpus data itself
are usually called corpus annotation.
• For example, we may wish to annotate a corpus to
show parts of speech, assigning to each word a
grammatical category label.
• So when we see the word talk in the sentence I
heard John's talk and it was the same old thing, we
would assign it the category noun in that context.
• This would often be done using some mnemonic
code or tag such as N.
• While the phrase corpus annotation may be
unfamiliar, the basic operation it describes is not – it
is just like the analyses of data that have been done
using hand, eye, and pen for decades.
• For example, in Chomsky (1965), 24 invented
sentences are analysed; in the parsed version of LOB,
a million words are annotated with parse trees.
• So corpus annotation is largely the process of
recording common analysis in a systematic and
accessible form.
Annotating data: how to get started

CLAWS tagger can be used for grammatical


tagging of a small-to-medium text using the
web-interface.
• This tagger, created by UCREL at Lancaster
University, is the software that was used to tag
the BNC.
• It can be set to use either of two tagsets, the
standard C7 and the less-complex C5.
• A more complex form of grammatical
annotation is parsing.
• One easy way to try out parsing is to use the
online Stanford Parser.
• This program does two different types of
parsing ; dependency parsing and constituency
parsing – and is also
openly available to download and use on your
own computer.
Monolingual versus multilingual corpora
• Many corpora are monolingual – they contain data in only
one language. But there are two types of multilingual
corpora.
Comparable corpora
• A comparable corpus contains components in two or more
languages that have been collected using the same sampling
method, e.g. the same proportions of the texts of the same
genres in the same domains in a range of different
languages in the same sampling period.
• The subcorpora of a comparable corpus are not translations of
each other. Rather, their comparability lies in the similarity of
their sampling frames.
• An example is the use of the
LOB corpus sampling frame for the
Lancaster Corpus of Mandarin Chinese (McEnery et
al. 2003), making these corpora comparable.
Parallel corpora
• By contrast, a parallel corpus contains native language
(L1) source texts and their (L2) translations.
• In this case, the sampling frame is automatically the
same for all the languages in the corpus.
• Examples include the the Canadian Hansard corpus
(Brown et al. 1991) and the CRATER corpus (McEnery
and Oakes 1995).
Accessing and analysing corpus data
• In this section, we'll be looking at three
important issues that arise when we access and
analyse corpus data.
• How can we make use of
corpus metadata, markup and annotation?
• What kinds of corpus analysis software are
available, and what can they do?
• What do we need to know about
statistics in corpus linguistics?
Metadata and markup
• Metadata is information that tells you something
about the text itself.
• For example, the metadata may tell you who wrote a
text and when it was published.
• The metadata can be encoded in the corpus text, or
held in a separate document or database.
• Textual markup encodes information within the text
other than the actual words.
• For example, the sentence breaks or paragraph
breaks in a written text.
• In spoken corpora, the information conveyed
by the metadata and textual markup may be
very important to the analysis.
• The metadata would typically identify the
speakers in the text and give some useful
background information on each of them,
such as their age and sex.
• Textual markup would then be used to
indicate utterance boundaries.
• For example, in the BNC, each utterance is marked up and is linked to the
metadata for a particular speaker. For each speaker, the following
metadata is stored:
• Name (anonymised)
• Sex
• Age
• Social class
• Education
• First language
• Dialect/Accent
• Occupation
• We can use this metadata to limit searches in the BNC in a linguistically
motivated way — for example, to extract all examples of the word surely as
spoken by females aged between 35 and 44.
• A system of encoding called XML (the eXtensible Markup Language) is
often used for both markup and metadata.
• It is based on angle-bracket tags such as <u> and </u> for the beginning
and end of an utterance, respectively.
Linguistic annotation
• Annotation typically uses the same encoding
conventions as textual markup. For instance,
the angle-bracket tags of XML can easily be
used to indicate where a noun phrase begins
and ends, with a tag for the start (<np>) and
the end (</np>) of a noun phrase:
• <np>The cat</np> sat on <np>the mat</np> .
A wide range of annotations have been applied automatically
to English text, by analysis software (also called taggers)
such as:
• constituency parsers such as Fidditch (Hindle 1983)
• dependency parsers such as the Constraint Grammar
system (Karlsson et al. 1995)
• part-of-speech taggers such as CLAWS (Garside et al. 1987)
• semantic taggers such as USAS (Rayson et al. 2004)
• lemmatisers or morphological stemmers
• Biber’s (1988) tagger for linguistic variations
• The virtue of all these forms of annotation is
that, when they exist in a corpus, we can run
searches for the tags rather than word-forms.
• For example, some of the results from a
grammatically-aware search for
all words tagged as a past participle in the BNC
.
Tools for corpus analysis
• The single most important tool available to the corpus
linguist is the concordancer.
• A concordancer allows us to search a corpus and
retrieve from it a specific sequence of characters of any
length — perhaps a word, part of a word, or a phrase.
• This is then displayed, typically in one-example-per-line
format, as an output where the context before and
after each example can be clearly seen.
• The appearance of concordances does vary between
different tools.
• As well as concordances, three other functions are
available in most modern corpus search tools:
• Frequency lists — the ability to generate
comprehensive lists of words or annotations (tags) in
a corpus, ordered either by frequency or
alphabetically
• Collocations — statistical calculation of the words or
tags that most typically co-occur with the node word
you have searched for
• Keywords (or key tags) — lists of items which are
unusually frequent in the corpus or text you are
investigating, in comparison to a reference corpus;
like collocation, calculated with statistical tests
Statistics in corpus linguistics
• Corpora are an unparalleled source of quantitative data for
linguists.
• So corpus linguists often test or summarise their quantitative
findings through statistics.
• Some other areas of linguistics also frequently appeal to
statistical notions and tests.
• Psycholinguistic experiments, grammatical elicitation tests
and survey-based investigations, all commonly involve
statistical tests of some sort.
• However, frequency data are so regularly produced in corpus
analysis that most corpus-based studies undertake some form
of statistical analysis, even if it is relatively basic and
descriptive, e.g. using percentages to describe the data in
some way.
Descriptive statistics
• Most studies in corpus linguistics use basic descriptive
statistics if nothing else.
• Descriptive statistics are statistics which do not seek
to test for significance. Rather they simply describe
the data in some way.
• The most basic statistical measure is a frequency
count, a simple tallying of the number of instances of
something that occurs in a corpus.
• For example, there are 1,103 examples of the
word Lancaster in the written section of the BNC.
• A special type of ratio called the type-token ratio is
another basic corpus statistics.
• A token is any instance of a particular wordform in a
text. Comparing the number of tokens in the text to
the number of types of tokens — where
each type is a particular, unique wordform — can
tell us how large a range of vocabulary is used in
the text.
• We determine the type-token ratio by dividing the
number of types in a corpus by the number of
tokens.
• The result is sometimes multiplied by 100 to
express the type-token ratio as a percentage.
Beyond descriptive statistics
• To better understand the frequency data arising from a
corpus, corpus linguists appeal to statistical measures
which allow them to test the significance of any
differences observed.
• significance tests can be used to assess how likely it is
that a particular result is a coincidence, simply due to
chance.
• Typically, if there is a 95% chance that our result is not a
coincidence, then we say that the result is significant. A
result which is not significant cannot be relied on
• The two most common uses of significance tests
in corpus linguistics are calculating keywords (or
key tags) and calculating collocations.
• To extract keywords, we need to test for
significance every word that occurs in a corpus,
comparing its frequency with that of the same
word in a reference corpus.
• When looking for a word's collocations, we test
the significance of the co-occurrence frequency
of that word and everything that appears near it
once or more in the corpus.
Doing a significance test
• It is based on four simple figures. Let's assume we are
testing a difference between Corpus 1 and Corpus 2 in
the frequency of some linguistic phenomenon X. In
this case, the figures you need are:
• The frequency of X in Corpus 1;
• The total number of opportunities for X to coccur in
Corpus 1;
• The frequency of X in Corpus 2;
• The total number of opportunities for X to coccur in
Corpus 2
• When we have our four figures, we can insert
them into the following form:
Corpus 1 Corpus 2

Frequency of X
(e.g. freq of word)
Total opportunities
for X
(e.g. Corpus size)
• Imagine, for example, that you are
investigating a word that occurs 52 times in
Corpus 1, which has 50,000 tokenws in total;
but occurs57 times in Corpus 2, which
is 75,000 tokens in size. Obviously, this word is
noticeably rarer, in relative terms, in Corpus 2;
but is the difference significant?
• Enter the figures into the web-form above to
conduct the log-likelihood test of significance!
Don't include any commas in the numbers you
type in.
• You should get results that look like this:
• Item O1 %1 O2 %2 LL
Word 52 0.10 57 0.08 + 2.65
Here's how to interpre3t this result:
• O1 and O2 are observed frequencies, the
numbers you entered
• %1 and %2 are the observed frequencies in
normalised (percentage) form
• The + sign indicates that the word is more
frequent, on average, in Corpus 1 (a minus sign
would indicate it is more frequent in Corpus 2)
• The LL score is the log-likelihood, which tells us
whether the result can be treated as significant
• The higher the LL is, the less likely it is that the
result is a random fluke. The LL must be above
3.84 for the difference to be significant at
the p < 0.05 level (also called the 95% level).
So this difference is not statistically significant.
• A keyword analysis basically consists of doing
this analysis for every word-type in the
corpus!

Feeling????…..

You might also like