0% found this document useful (0 votes)
11 views85 pages

Topics

The document discusses corpus linguistics, highlighting its significance in language studies through the analysis of large, computer-readable text collections. It outlines various types of corpora, methodologies such as corpus-based and corpus-driven linguistics, and the importance of data collection and annotation. Additionally, it addresses the benefits and limitations of corpus analysis, as well as the tools and techniques used for linguistic research.

Uploaded by

Breet Eyes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views85 pages

Topics

The document discusses corpus linguistics, highlighting its significance in language studies through the analysis of large, computer-readable text collections. It outlines various types of corpora, methodologies such as corpus-based and corpus-driven linguistics, and the importance of data collection and annotation. Additionally, it addresses the benefits and limitations of corpus analysis, as well as the tools and techniques used for linguistic research.

Uploaded by

Breet Eyes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

Lecture 7.

Corpus
linguistics
© Kushneruk Svetlana
Leonidovna

Doctor of Philology, Professor of


Chelyabinsk State University
1. The notion of a
corpus. Typology of
corpus linguistic
research

Corpus linguistics
revolutionized language
studies because it has
provided new ways of
analyzing and describing
the use of language
• Corpus linguistics
• a powerful methodology
that can be employed to
explore a wide variety of
issues related to the use of
vocabulary
• Corpus linguistics
• an area which focuses upon
a set of procedures, or
methods, for studying
language
Corpora can be defined as
large, principled and
computer-readable collections
of texts that allow analysis of
patterns of language use
across different contexts.

Corpora consist of texts


stored in an electronic format,
which enables researchers to
use special software to
conduct automatic searches
and gain insights into the
structure and regularity of
naturally occurring language.
it is empirical,
analyzing the actual
patterns of use in
natural texts

Important it utilizes a large


collection of natural
features of texts as the basis for
analysis
corpus-
based it makes extensive
use of computers for
analysis analysis

it depends on both
quantitative and
qualitative analytical
techniques
Typology of corpus linguistic research.
1.1. Mode of communication

corpora of written corpora of spoken


language language
www.publications.parliament.uk/pa/cm/cmhansrd.htm
ICE-GB is the British component of the
International Corpus of English (ICE)

www.ucl.ac.uk/english-usage/projects/ice-gb/
Video corpora
http://sourceforge.net/projects/thedrs
1.2. Corpus-based versus
corpus-driven linguistics

Corpus-based
• corpus linguistics is perceived as a methodology ⇒
corpus data are used to verify the existing theories of
language
Corpus-driven
• tends to view corpus linguistics as a theory which
offers a new way of looking at the creation of
meaning in a narrow sense and different aspects of
the use of language in a broader sense
Corpus-based
studies
use corpus data in order to
explore a theory or
hypothesis, established in
the current literature, in
order to validate it, refute
it or refine it

[McEnery & Hardie 2012: 6]


Corpus-driven
linguistics
Ø claims that the corpus
itself should be the sole
source of hypotheses
about language
Ø the corpus itself embodies
its own theory of language

[McEnery & Hardie 2012: 6].


1.3. Data collection
regime
The monitor corpus
approach
• seeks to develop a dataset which
grows in size over time and which
contains a variety of materials
The Bank of English (BoE)
many books on corpus linguistics
suggested that the BoE could be
used as a ‘monitor corpus’ to look
ongoing changes in English
The Web as Corpus
• It takes as its starting point a
massive collection of data that
is ever-growing, and uses it for
the study of language.
• The content of the web is not
divided by genre ⇒ the
material returned from a web
search tends to be an
undifferentiated mass, which
requires a great deal of
processing to sort into
meaningful groups of texts.
The sample corpus
approach
ü The sample corpora
represent a particular type of
language over a specific span
of time
ü A balanced corpus covers a
wide range of text categories
which are supposed to be
representative of the language
variety under consideration.
ü Representativeness refers to
the extent to which a sample
includes the full range of
variability in a population.
1.4. Annotated versus unannotated corpora

Corpus annotation is largely the process of


providing those analyses which a linguist
would carry out anyway on whatever data
they worked with.

Annotation is an umbrella term that refers


to procedures such as tagging and parsing
which are carried out to add linguistic
information to a corpus
Tony McEnery & Andrew Hardie
distinguish between three types of
information that can accompany a
corpus
• metadata
details about a given text such as the name
of the author
• textual markup
information about the formatting of the text
such as where italics starts and ends or when
a given speaker starts speaking
• linguistic annotation
assigning grammatical categories or tags to
all the words within a corpus
Layers of annotation

• part-of-speech (PoS)
tagging
• syntactic (grammatical)
parsing
• error annotation
• semantic annotation
• phonetic annotation
CLAWS
Constituent Likelihood Automatic Word-tagging System
http://ucrel.lancs.ac.uk/claws/
1.5. Total accountability versus
data selection

The principle of
total
accountability one way of
we must not satisfying
select a falsifiability is to
favourable subset use the entire
of the data corpus to test the
hypothesis
1.6. Multilingual versus
monolingual corpora

Many corpora are


monolingual
Ø may represent a range
of varieties and genres
of a particular language
Ø limited to that one
language
https://www.ucl.ac.uk/english-usage/projects/ice.htm
the English-Norwegian Parallel Corpus (ENPC)
https://www.hf.uio.no/ilos/english/services/knowledge-resources/omc/enpc/
Type A: Source texts in one language plus
translations into one or more other languages
• the Canadian Hansard
Ø consisting of debates from the Canadian Parliament published in
the country's official languages, English and French

• CRATER
Ø Corpus Resources and Terminology Extraction is a project involving
three languages: English, French and Spanish.
Ø consists entirely of technical texts from the International
Telecommunications Union ⇒ 5,5 million words
Ø texts are tagged with part-of-speech and morphological annotation
Type B: Pairs or groups of monolingual corpora
designed using the same sampling frame
https://www.lancaster.ac.uk/fass/projects/corpus/LCMC/
Type С: A combination of A and B

EMILLE
Enabling Minority Language Engineering
was a 3-year project at Lancaster University and
Sheffield University
Its end product was a 97 million word electronic
corpus of South Asian languages, especially those
spoken in the UK
http://www.emille.lancs.ac.uk/about.php
2. Providing data on linguistic
phenomena

Frequency and Lists of all


distribution of common words
Lexical specific words in a language or
and phrases genre
• processes involving
word formation
nouns formed with
suffixes *ism or
*ousness
• сontrasts in the use
of grammatical
alternatives
HAVE + proven/proved
sincerest/most sincere
High-frequency grammatical features ➮ modals,
passives, perfect or progressive aspect
➮ Less frequent grammatical variation
John started to walk / walking
She’d like (for) him to stay overnight
Phraseological patterns
• Collocational preferences
for specific words
true feelings, true story
• Constructions
[V NP into V-ing]
they talked him into staying
[V POSS way PREP]
he elbowed his way through
the crowd
• Collocates as a guide to meaning and usage
(n) highbrow, (adj) highbrow
• Semantic prosody
the types of words preceding the verb
outweigh
3. Corpus design

In 2005, Sinclair
proposed a set of
principles that should be
considered with regard to
the process of
developing a corpus

John Sinclair
(1933-2007)
• 1. The contents of a corpus should be selected without regard for
the language they contain, but according to their communicative
function in the community in which they arise.
• 2. Corpus builders should strive to make their corpus as
representative as possible of the language from which it is chosen.
• 3. Only those components of corpora which have been designed
to be independently contrastive should be contrasted.
• 4. Criteria for determining the structure of a corpus should be
small in number, clearly separate from each other and efficient as a
group in delineating a corpus that is representative of the language
variety under examination.
• 5. Any information about a text other than the alphanumeric
string of its words and punctuation should be stored separately from
the plain text and merged when required in applications.
• 6. Samples of language for a corpus should consist of entire
documents or transcriptions of complete speech events.
• 7. The design and composition of a corpus should be documented
fully with information about the contents and arguments in
justification of the decisions taken.
• 8. The corpus builder should retain, as target notions,
representativeness and balance. While these are not precisely
definable and attainable goals, they must be used to guide the
design of a corpus and the selection of its components.
• 9. Any control of subject matter in a corpus should be imposed by
the use of external, and not internal, criteria.
• 10. A corpus should aim for homogeneity in its components while
maintaining adequate coverage, and rogue texts should be avoided.
Corpus research

size balance

representativeness
• Representativeness concerns the issue of
how well a corpus represents a given
language or variety that is under study.
• Balance refers to the structure and type
of data used to build a corpus.
• A well-balanced corpus should consist of
several subsections that represent different
types of language use.
4. Benefits of corpus analysis
• 1. One can use corpus data to explore different aspects of
language.
• 2. Corpus linguistics is an empirical approach which relies on
frequency-based analyses.
• 3. Corpus linguistics focuses on the phraseological nature of
language.
• 4. Corpus investigations highlight different functions of language
and demonstrate the central role of context in the analysis of
linguistic behavior.
• 5. Corpus linguistics presents us with powerful tools for
exploring the distribution of specific linguistic features across a
wide range of domains of language use.
5. Limitations of corpus analysis

1. A corpus can show us only what it contains.


2. A corpus may be too small.
3. A corpus presents language out of its context.
4. A corpus cannot interpret data.
6. Types of
corpora
6.1. General and specialized corpora
General corpora consist of a wide range of
texts that represent natural language as it
is used across a variety of contexts.

Specialized corpora do not aim to


comprehensively represent a language as a
whole, but only specialized segments of it.
BNC https://www.english-corpora.org/bnc/
Michigan Corpus of Academic Spoken English
(MICASE) https://quod.lib.umich.edu/m/micase/
British Academic Written English (BAWE) Corpus of proficient
student writing
https://ota.bodleian.ox.ac.uk/repository/xmlui/handle/20.500.12024/2539#
Vienna-Oxford International Corpus of English,
VOICE (https:// www.univie.ac.at/voice/)
English as a Lingua Franca in Academic Settings Corpus,
ELFA. https://www.kielipankki.fi/corpora/elfa/
6.2. Written and spoken
corpora

The majority of corpora represent


written language

CANCODE (Cambridge and Nottingham


Corpus of Discourse in English) Corpus
large collection of spoken British English and it
has been used as a basis for a number of studies
into the specific nature of spoken language
Santa Barbara Corpus of Spoken American English
https://www.linguistics.ucsb.edu/research/santa-barbara-corpus
Hong Kong Corpus of Spoken English
http://rcpce.engl.polyu.edu.hk/HKCSE/default.htm
• Represent data from
specific historical
6.3. Historical periods and they are
(diachronic) particularly useful if
scholars are interested
corpora in the process of
language change
Corpus of Historical American English, COHA
https://www.english-corpora.org/coha/
ARCHER, A Representative Corpus of Historical English
Registers
https://www.projects.alc.manchester.ac.uk/archer/
• Often employed by
researchers working
in the area of
translation studies,
6.4. Parallel and who use them to
comparable make direct
comparisons
corpora between the same
texts written in
different languages
Oslo Multilingual Corpus
http://www.hf.uio.no/ilos/english/services/omc/
German, French and Finnish source texts, and their respective translations
Digital Corpus of the European Parliament
https://ec.europa.eu/jrc/en/language-technologies/dcep
6.5. Web as a corpus

A corpus can be
Internet is constantly
regularly updated,
growing and the
which makes it very
number of websites is
similar to monitor
increasing.
corpora.
WebCorp
Linguist’s
Search
Engine
http://wse1.webcor
p.org.uk/
is an example of an
interface that can
be used to explore
data found on the
web
Mark Davies’s Google Books interface
https://support.google.com/websearch/answer/9523832
https://books.google.com/ngrams/info
The example of a database composed of web-based
data is the NOW Corpus
https://www.english-corpora.org/now/
7. Corpus tools and types
of analysis
• Сorpora are computer-readable
collections of texts which enable
linguistic analysis by means of
special computer programs called
concordancers
• The most popular concordancers
are WordSmith tools, Sketch
Engine, MonoConc and AntConc
https://www.lexically.ne
t/wordsmith/
• https://www.laurenceanthony.net/software/antconc/
The most basic type of corpus
7.1. Frequency analysis is checking the frequency
analysis and of occurrence of a given word or a
concordancing phrase
a search by means of a web-based
interface
https://www.english-corpora.org/
• use a search box located on the left-hand side of the
interface
• type in a word or a phrase that you want to explore (a
node)
• ‘Chelyabinsk’
• word as a lemma = [chelyabinsk]
Information about the number of the
occurrences of ‘Chelyabinsk’ in the
whole corpus
Lines of texts which demonstrate how the
word is used in context: concordances
All frequency values for the word ‘Chelyabinsk’ across the
different portions of the corpus (Chart option)
Choose a different subcorpus
News on the Web
Examples of concordances for the
word ‘Chelyabinsk’
a standard format for displaying
corpus data
Key Word In Context
(KWIC)
one can easily analyze the co-
text of the node – all the words
that precede and follow it
• By analyzing the immediate
company of words, one can
explore patterns of co-
occurrence between words
and study how words tend to
form various kinds of
lexical, grammatical, lexico-
grammatical combinations
Frequency of words
across different sections
Frequency of
the word
‘Chelyabinsk’
by country
7.2. Wordlists

lists of words or phrases ranked according to their


frequency or the number of their occurrences in a
given corpus

wordlists are a powerful tool for making


comparisons between corpora that represent
different language uses

if we assume that the most frequent words are also the most useful
ones, language teachers can use this information to decide which words
should be addressed first where English is taught as a second/foreign
language
COCA
Corpus of Contemporary American English
search for words by meaning ‘industrial’
7.3. Word combinations and
n-gram analysis / cluster analysis

chunks n-grams lexical bundles

Words tend to co-occur and form collocations,


colligations and other examples of word
combinations

N-gram is a technical term used to denote word


combinations which consist of two or more words
that repeatedly occur consecutively in a corpus
Corpus software AntConc
https://www.laurenceanthony.net/software/antconc/
7.4. Keyness analysis
and keywords

• Keyword – a word which


occurs with unusual frequency
in a given text.
• Such words are useful because
they provide information
about the keyness or
specificity of a given corpus in
terms of what it is about.
Keywords tool on Lextutor
http://www.lextutor.ca/key/
Example .txt format files
Trump inaugural address
lists of keywords might be a starting point
for a qualitative analysis

concordance lines from the inaugural


address Corpus ➮
investigate how the words ‘america’
and ‘prosper’ are used in specific
contexts

You might also like