0% found this document useful (0 votes)
57 views11 pages

Coca2020 Overview

The document describes an update to the Corpus of Contemporary American English (COCA). It details the size and genres included in the corpus, and explains new features that allow comparisons across genres, time periods, and related words. These features help analyze frequency, meaning, and syntactic patterns in American English over the past 30 years.

Uploaded by

api-546482329
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views11 pages

Coca2020 Overview

The document describes an update to the Corpus of Contemporary American English (COCA). It details the size and genres included in the corpus, and explains new features that allow comparisons across genres, time periods, and related words. These features help analyze frequency, meaning, and syntactic patterns in American English over the past 30 years.

Uploaded by

api-546482329
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

The COCA corpus (new version released March 2020)

The corpora from English-Corpora.org are the world’s most widely-used corpora. The Corpus of Contemporary
American English (COCA) is by far the most widely-used of these corpora. In early 2020, we dramatically
expanded the scope and size and features of COCA to make it even more useful for researchers, teachers, and
learners.

The corpus contains more than one billion words of data, including 20 million words each year from 1990-2019
(with the same genre balance year by year). This makes COCA the only corpus of English that is 1) large 2) recent
and 3) has a wide range of genres. The following table shows the genres in the corpus.

Genre # texts # words Explanation


Spoken 44,803 127,396,932 Transcripts of unscripted conversation from more than 150 different TV and radio programs
(examples: All Things Considered (NPR), Newshour (PBS), Good Morning America (ABC), Oprah)

Fiction 25,992 119,505,305 Short stories and plays from literary magazines, children’s magazines, popular magazines, first
chapters of first edition books 1990-present, and fan fiction.
Magazines 86,292 127,352,030 Nearly 100 different magazines, with a good mix between specific domains like news, health, home
and gardening, women, financial, religion, sports, etc.

Newspapers 90,243 122,958,016 Newspapers from across the US, including: USA Today, New York Times, Atlanta Journal Constitution,
San Francisco Chronicle, etc. Good mix between different sections of the newspaper, such as local
news, opinion, sports, financial, etc.

Academic 26,137 120,988,361 More than 200 different peer-reviewed journals. These cover the full range of academic disciplines,
with a good balance among education, social sciences, history, humanities, law, medicine,
philosophy/religion, science/technology, and business

Web (Genl) 88,989 129,899,427 Classified into the web genres of academic, argument, fiction, info, instruction, legal, news, personal,
promotion, review web pages (by Serge Sharoff). Taken from the US portion of the GloWbE corpus.

Web (Blog) 98,748 125,496,216 Texts that were classified by Google as being blogs. Further classified into the web genres of
academic, argument, fiction, info, instruction, legal, news, personal, promotion, review web pages.
Taken from the US portion of the GloWbE corpus.
TV/Movies 23,975 129,293,467 Subtitles from OpenSubtitles.org, and later the TV and Movies corpora. Studies have shown that the
language from these shows and movies is even more colloquial / core than the data in actual
"spoken corpora".

485,179 1,002,889,754

GENRES Because COCA has so much data from each of these eight genres, it provides useful information
about the frequency of words, phrases, and grammatical constructions across the genres – whether they are
very informal (e.g. TV and movie subtitles or in spoken transcripts), more formal (e.g. academic articles), or
somewhere in between (e.g. magazines and newspapers). For example, the following two charts show the
frequency of the word lucky in the different genres, as well as the ”get passive” (e.g. he got promoted).

lucky get passive


The great general balance (and large size) of COCA also allows you to see the frequency of related phrases
across genres. For example, the following table shows “soft NOUN”.

You can also compare two genres (or sets of genres). For example, the following table shows collocates
(nearby words) of chair in ACADEMIC (left) and FICTION (right).

You can even compare synonyms of a given word in different genres. For example, this table shows the
synonyms of strong in TV/MOVIES (left) and ACADEMIC (right).
The ability to focus in on specific genres means that you can find “just the right word” for a particular
concept in a particular genre. For example, suppose that you want to know what word related to potent is
the most frequent with a form of argument in academic English. As the following image shows, you simply
search for =potent ARGUMENT and limit the search to academic, and you would see the following results,
and then (as with any search results) can see the matching phrases in context,

In addition to comparing across genres, you can also compare words, to “tease out” differences between
related words. For example, the following chart shows words that occur with deep and profound, showing that
(for example) deep breath is common but profound breath never occurs, or that profound effect is common
but deep effect is not. You can also compare other related words, such as adjectives used with the words men
and women, or verbs occurring near Obama or Trump.
HISTORICAL COCA is the only large corpus of English that has extensive data from the entire period of the last
30 years – 20 million words per year from 1990-2019 (with the same genre balance year by year). This means
that in addition to seeing variation by genre, you can also map out recent changes in English in ways that are
not possible with any other corpus – such as with the frequency of awesome from 1990-2019,.

And of course you can look at much more than just simple words or phrases. COCA is the only corpus that
allows you to map out changes in syntactic constructions over the past 30 years, as with the “like construction”
(and I’m like, no way) or the “end up V-ing” construction (you’ll end up paying way too much) – both of which
have increased in each five year period since the early 1990s.

like construction (CONJ PRON BE like ‘|,) END up V-ing

Just as we compared collocates (nearby words) in different genres above (for chair in ACAD and FIC), we can
also compare collocates in different periods, to look at semantic change. For example, the following table shows
collocates of green in the 1990s (left) and the 2010s (right).

We can also see what is being said about a topic over time. For example, the following shows the collocates of crisis
in each five year period (and genre) from 1990-2019, which shows what we were worrying about in these different
periods.
As with all of the other corpora from English-Corpora.org, there is a very wide range of searches, including:
words, phrases, substrings, lemmas, part of speech, synonyms, and customized wordlists. For example, the
search WEAR * ADJ @CLOTHES takes just about one second to search through the billion words to find
strings like the following (and it doesn’t require learning unnecessarily convoluted search syntax).

Because of COCA’s advanced architecture, even searches for very general searches like NOUN + NOUN or
VERB ADJ NOUN take just 1-2 seconds to search through the billion word corpus:
A unique feature of COCA, which makes it very useful for language learners and teachers, is the ability to
browse through a list of the top 60,000 words (lemmas) in the corpus, and then to see an extremely wide
range of information on each of these words.

For example, the following are just a few examples of high frequency words (about word #5000 in the 60,000
word list), medium frequency (~25,000), and low frequency (~45,000) words. For each word in the list, users
can hear the word pronounced, see videos with that word in the text, find related images from Google
Images, an see a translation for their preferred target language.

It is even possible to search the 60,000 word list by pronunciation (very helpful, because of difficult English
spelling). For example, the following is a partial list of two syllable words (accented on the second syllable)
that rhyme with stay:
Each of the top 60,000 words in the corpus has a “home page” such as the following, with links to other pages
with more information. Users can save words to a “favorites” list for later review, and go back through a
history of all of their “word” pages
Favorites list History Rank #1-60,000
Words that co-occur in 22 million web pages
Distribution across genres

Words that co-occur nearby


Pronunciation Definition

Links to videos

Links to translations
Google images

2, 3, 4 word strings

Texts that have this as a “keyword”

The word in context (to see patterns of use)


Each of the top 60,000 words also has more detailed pages, including a “dictionary” page, related topics,
collocates, clusters, websites, and concordance lines. Samples of each of these pages are given below.

DICTIONARY page

Includes a definition, links to Google Images, pronunciation, videos, and translations (to desired language) at
(up to) four different sites. Also includes synonyms and words with more specific meanings (hyponyms) and
more general meanings (hypernyms) (both from WordNet). It also includes frequency information (including
rank order, number of tokens, and two “range” measures of how well the word is spread throughout the
~500,000 texts, as well as its distribution across the eight main genres in the corpus. Finally, it also includes the
frequency of the different forms of the lemma (e.g. verb forms), and related words.

“One click” link to Google Images (alternative for China)

“One-click”
links to
translations
(4 different
sites)

“One click”
links to
videos from
PlayPhrase,
YouGlish,
and Yarn
TOPICS page

Shows words that tend to co-occur in the 22 million webpages in the corpus. In many cases these provide better
insight into meaning and usage than collocates (the standard tool for finding textually related words), and yet
we’re not aware of any other corpus that has these. Users can click on any of these words to follow a “chain” of
related words.

COLLOCATES page

Words that occur nearby. Click on any word to see the collocates for the new word, or click on the “text” icon to see
the word in context. The gray boxes show the most normal placement of the node word (e.g. Christmas tree, but
tree trunk). You can also sort by Mutual Information score and set frequency thresholds (Advanced Options).
CLUSTERS page

The most common 2, 3, and 4 word strings. You can decide how “wide” or “tight” you want the clusters to be
(e.g. whether they include highly frequent words like articles and prepositions).

KWIC / concordance page

See 100 – 1000 random lines, with the surrounding words highlighted for part of speech. You can re-sort the
words by words to the left or right, all with the goal of seeing the patterns in which the word occurs.
Texts page

Move between page showing the


texts in which the word is a
keyword (see above), and the
page showing all keywords from a
specific texts (right).

Summary

COCA has a number of features that sets it apart from any other corpus. These include its size (1.0 billion
words), how up-to-date it is (texts through Dec 2019), the wide range of genres (TV/Movie subtitles, spoken,
blogs, web, fiction, magazine, newspaper, academic), and its searches (range of query types, and the ease
and speed of its searches), including the ability to limit to and compare across genres and time periods.

In addition, it is different from most of the other corpora from English-Corpora.org in the attention that it
gives to the top 60,000 words in the corpus, and the wide range of information for each word, including
frequency information, definitions, synonyms, WordNet entries, related topics, concordances (new display in
COCA), clusters, websites that have the word as a “keyword”, and KWIC/concordance lines.

All of these features make COCA the ideal corpus for researchers, teachers, and learners.

You might also like