Tutorial Cqpweb
Tutorial Cqpweb
Hannah Kermes
Corpus linguistics
What is a corpus?
What is a corpus?
What is a Corpus?
Tagsets
• http://cwb.sourceforge.net
• tools for encoding, indexing, compression, decoding, and
frequency distribution
• corpus query processor (CQP)
• fast corpus search (regular expression syntax)
• can work efficiently with large corpora (up to 500 million words,
depending on annotations)
• use in interactive or batch mode
• results displayed in terminal windows
• CQPweb - web interface for CQP
• CWB/Perl interface for post-processing, scripting
CQPweb
• corpus query
• simple query syntax and CQP syntax
• standard query vs. restricted query
• frequency lists
• keywords
• post-processing
• sorting
• frequencies
• frequency distribution
• collocation
• categorization
CQPweb at UdS
• https://fedora.clarin-d.uni-saarland.de/cqpweb/
Choosing a corpus
• word form
“word”
[word=“word”]
• lemma
[lemma=“use”]
• part-of-speech
[pos=“VV”]
Concordance (def)
Definition
A concordance is an alphabetical listing of the words in a text, given
together with the contexts in which they appear. The most common
form of concordance today is the Keyword-in-Context (KWIC) index, in
which each word is centered in a fixed-length field (e.g., 80 characters).
Concordances
• character set
“[whm]orse”
“[∧ whm]orse”
• character range
“[a-zA-Zäöüß]orse”
“[∧ a-z]orse”
• alternatives
“im(portant|possible)”
[lemma=“research|investigation”]
• conjunction
[lemma = “research” & pos=“NN”]
• optionality
“(re)?search”
• 0 - inf times (Kleene star)
“search.*”
• 1- inf times
“search.+”
• n times
“.*search.{2}”
• n - m times
“.*search.{2,3}”
• adjectives or nouns
[pos = “JJ|NN”]
• verbs
[pos = “V.*”]
• full verbs
[pos = “VV.*”]
• past tense full verbs
[pos=“VV[DN]”]
Repetition operators
Operators
Simple patterns
• word sequence
“it” “is” “possible” “that”
• adjective noun collocations
[pos="JJ.*"][pos="NN.*"]
• semantically related words
[word=“\w+”]“and|or”[word=“\w+”]
• verbally derived adjectives
[(word=“.+(ed|ing)”) & (pos=“JJ”)][pos=“NN”]
• an optional token
“it” “is” []? “possible” “that”
• a range of tokens
“it” “is” []{1,3} “possible” “that”
• optional tokens
“it” “is” []* “possible” “that”
• avoid crossing sentence boundaries
“it” “is” []* “possible” “that” within s
• titles
/region[title];
<title>[]*</title>
• a ngram including a full verb
<ngram>[]*[pos="VV.*"][]*</ngram>
Frequency breakdown
Frequency breakdown
Sorting results
Frequency distribution
• frequency of query
• distribution across text categories
• distribution table and bar charts
• cross tables
• (download of) text frequencies
• frequencies and frequency per million (words)
Frequency distribution
Collocation: a definition
Definition
“lexically and/or pragmatically constrained recurrent co-occurrences of
at least two lexical items which are in a direct syntactic relation with
each other” Bartsch (2004)
Collocation in CQPweb
Collocations
Anchor points
Download tabulate
• frequencies of matches
• based on tabulate command
• TABULATE OPTIONS:
groups any number and combination of tokens
• SORTING OPTIONS:
sorting, counting and display option
• at the moment: download function
Download tabulate
TABULATE OPTIONS :
• range of tokens
match .. matchend lemma
• separate tokens
match lemma, matchend pos
• structural attributes
match .. matchend lemma, match text_reg, match
ngram_len
SORTING OPTIONS :
• sorting and counting
• sorting and counting; display as matrix
• sorting and counting; display as matrix; include second column in
sort string
Saved queries
Saved queries
Query history
Query history
Categorise queries
Categorise queries
Categorise queries
• http://cqpweb.lancs.ac.uk/
Present-day English; Historical English; Learner English; Arabic;
European languages; South and East Asian languages; LACITO
corpora
• BNC web: http://bncweb.lancs.ac.uk/cgi-binbncXML/
BNCquery.pl?theQuery=search&urlTest=yes
• OPUS: http://opus.lingfil.uu.se/bin/opuscqp.pl
Parallel corpora
• A collection of English Corpora
(http://corpus.leeds.ac.uk/protected/query.html)
• find more on http://cwb.sourceforge.net/demos.php
• COSMAS: http://www.ids-mannheim.de/cosmas2/
• Digitales Wörterbuch der Deutschen Sprache:
http://www.dwds.de/
• FALKO:
http://korpling.german.hu-berlin.de/falko-suche/
• BYU corpora: http://corpus.byu.edu/
Contemporary and Historical American English; BNC; ...
• Corncordancier-corpus francais: http:
//www.lextutor.ca/concordancers/concord_f.html