Topics
Topics
Corpus
linguistics
© Kushneruk Svetlana
Leonidovna
Corpus linguistics
revolutionized language
studies because it has
provided new ways of
analyzing and describing
the use of language
• Corpus linguistics
• a powerful methodology
that can be employed to
explore a wide variety of
issues related to the use of
vocabulary
• Corpus linguistics
• an area which focuses upon
a set of procedures, or
methods, for studying
language
Corpora can be defined as
large, principled and
computer-readable collections
of texts that allow analysis of
patterns of language use
across different contexts.
it depends on both
quantitative and
qualitative analytical
techniques
Typology of corpus linguistic research.
1.1. Mode of communication
www.ucl.ac.uk/english-usage/projects/ice-gb/
Video corpora
http://sourceforge.net/projects/thedrs
1.2. Corpus-based versus
corpus-driven linguistics
Corpus-based
• corpus linguistics is perceived as a methodology ⇒
corpus data are used to verify the existing theories of
language
Corpus-driven
• tends to view corpus linguistics as a theory which
offers a new way of looking at the creation of
meaning in a narrow sense and different aspects of
the use of language in a broader sense
Corpus-based
studies
use corpus data in order to
explore a theory or
hypothesis, established in
the current literature, in
order to validate it, refute
it or refine it
• part-of-speech (PoS)
tagging
• syntactic (grammatical)
parsing
• error annotation
• semantic annotation
• phonetic annotation
CLAWS
Constituent Likelihood Automatic Word-tagging System
http://ucrel.lancs.ac.uk/claws/
1.5. Total accountability versus
data selection
The principle of
total
accountability one way of
we must not satisfying
select a falsifiability is to
favourable subset use the entire
of the data corpus to test the
hypothesis
1.6. Multilingual versus
monolingual corpora
• CRATER
Ø Corpus Resources and Terminology Extraction is a project involving
three languages: English, French and Spanish.
Ø consists entirely of technical texts from the International
Telecommunications Union ⇒ 5,5 million words
Ø texts are tagged with part-of-speech and morphological annotation
Type B: Pairs or groups of monolingual corpora
designed using the same sampling frame
https://www.lancaster.ac.uk/fass/projects/corpus/LCMC/
Type С: A combination of A and B
EMILLE
Enabling Minority Language Engineering
was a 3-year project at Lancaster University and
Sheffield University
Its end product was a 97 million word electronic
corpus of South Asian languages, especially those
spoken in the UK
http://www.emille.lancs.ac.uk/about.php
2. Providing data on linguistic
phenomena
In 2005, Sinclair
proposed a set of
principles that should be
considered with regard to
the process of
developing a corpus
John Sinclair
(1933-2007)
• 1. The contents of a corpus should be selected without regard for
the language they contain, but according to their communicative
function in the community in which they arise.
• 2. Corpus builders should strive to make their corpus as
representative as possible of the language from which it is chosen.
• 3. Only those components of corpora which have been designed
to be independently contrastive should be contrasted.
• 4. Criteria for determining the structure of a corpus should be
small in number, clearly separate from each other and efficient as a
group in delineating a corpus that is representative of the language
variety under examination.
• 5. Any information about a text other than the alphanumeric
string of its words and punctuation should be stored separately from
the plain text and merged when required in applications.
• 6. Samples of language for a corpus should consist of entire
documents or transcriptions of complete speech events.
• 7. The design and composition of a corpus should be documented
fully with information about the contents and arguments in
justification of the decisions taken.
• 8. The corpus builder should retain, as target notions,
representativeness and balance. While these are not precisely
definable and attainable goals, they must be used to guide the
design of a corpus and the selection of its components.
• 9. Any control of subject matter in a corpus should be imposed by
the use of external, and not internal, criteria.
• 10. A corpus should aim for homogeneity in its components while
maintaining adequate coverage, and rogue texts should be avoided.
Corpus research
size balance
representativeness
• Representativeness concerns the issue of
how well a corpus represents a given
language or variety that is under study.
• Balance refers to the structure and type
of data used to build a corpus.
• A well-balanced corpus should consist of
several subsections that represent different
types of language use.
4. Benefits of corpus analysis
• 1. One can use corpus data to explore different aspects of
language.
• 2. Corpus linguistics is an empirical approach which relies on
frequency-based analyses.
• 3. Corpus linguistics focuses on the phraseological nature of
language.
• 4. Corpus investigations highlight different functions of language
and demonstrate the central role of context in the analysis of
linguistic behavior.
• 5. Corpus linguistics presents us with powerful tools for
exploring the distribution of specific linguistic features across a
wide range of domains of language use.
5. Limitations of corpus analysis
A corpus can be
Internet is constantly
regularly updated,
growing and the
which makes it very
number of websites is
similar to monitor
increasing.
corpora.
WebCorp
Linguist’s
Search
Engine
http://wse1.webcor
p.org.uk/
is an example of an
interface that can
be used to explore
data found on the
web
Mark Davies’s Google Books interface
https://support.google.com/websearch/answer/9523832
https://books.google.com/ngrams/info
The example of a database composed of web-based
data is the NOW Corpus
https://www.english-corpora.org/now/
7. Corpus tools and types
of analysis
• Сorpora are computer-readable
collections of texts which enable
linguistic analysis by means of
special computer programs called
concordancers
• The most popular concordancers
are WordSmith tools, Sketch
Engine, MonoConc and AntConc
https://www.lexically.ne
t/wordsmith/
• https://www.laurenceanthony.net/software/antconc/
The most basic type of corpus
7.1. Frequency analysis is checking the frequency
analysis and of occurrence of a given word or a
concordancing phrase
a search by means of a web-based
interface
https://www.english-corpora.org/
• use a search box located on the left-hand side of the
interface
• type in a word or a phrase that you want to explore (a
node)
• ‘Chelyabinsk’
• word as a lemma = [chelyabinsk]
Information about the number of the
occurrences of ‘Chelyabinsk’ in the
whole corpus
Lines of texts which demonstrate how the
word is used in context: concordances
All frequency values for the word ‘Chelyabinsk’ across the
different portions of the corpus (Chart option)
Choose a different subcorpus
News on the Web
Examples of concordances for the
word ‘Chelyabinsk’
a standard format for displaying
corpus data
Key Word In Context
(KWIC)
one can easily analyze the co-
text of the node – all the words
that precede and follow it
• By analyzing the immediate
company of words, one can
explore patterns of co-
occurrence between words
and study how words tend to
form various kinds of
lexical, grammatical, lexico-
grammatical combinations
Frequency of words
across different sections
Frequency of
the word
‘Chelyabinsk’
by country
7.2. Wordlists
if we assume that the most frequent words are also the most useful
ones, language teachers can use this information to decide which words
should be addressed first where English is taught as a second/foreign
language
COCA
Corpus of Contemporary American English
search for words by meaning ‘industrial’
7.3. Word combinations and
n-gram analysis / cluster analysis