0% found this document useful (0 votes)
20 views64 pages

Tutorial Cqpweb

Uploaded by

Katzee Junge
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views64 pages

Tutorial Cqpweb

Uploaded by

Katzee Junge
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

CQPweb tutorial

Querying, postprocessing, exporting

Hannah Kermes

July 21, 2013


Outline
1 Introduction
Corpus Linguistics
The IMS Corpus Workbench
2 Queries
Basic queries
Using regular expressions
Regular expressions summary
Regular Expressions
Combination of attribute constraints
Simple patterns
Querying structures
3 Post-processing
Frequency breakdown
Sorting
Frequency distribution
Collocations
Frequency distribution II
July 21, 2013 2 / 62
Introduction Corpus Linguistics

Corpus linguistics

• investigate language use and usage on the basis of natural


language data
• keyword in context (kwic)
• quantitative and qualitative analysis

July 21, 2013 3 / 62


Introduction Corpus Linguistics

What is a corpus?

Lemnitzer & Zinsmeister (2006), p. 7:


Ein Korpus ist eine Sammlung schriftlicher oder gesprochener
Äußerungen. Die Daten des Korpus sind typischerweise digitalisiert, d.
h. auf Rechnern gespeichert und maschinenlesbar. Die Bestandteile
des Korpus, die Texte, bestehen aus den Daten selbst sowie
möglicherweise aus Metadaten, die diese Daten beschreiben, und aus
linguistischen Annotationen, die diesen Daten zugeordnet sind.

July 21, 2013 4 / 62


Introduction Corpus Linguistics

What is a corpus?

Sinclair (1991), p. 171:


• A collection of naturally occurring language text, chosen to
characterize a state or variety of a language
• A body of naturally-occurring (authentic) language data which can
be used as a basis for linguistic research
Leech in Garside et al. (1997), p. 1:
• In the past thirty-five years, the term corpus has been increasingly
applied to a body of language material which exists in electronic
form, and which may be processed by computer for various
purposes

July 21, 2013 5 / 62


Introduction Corpus Linguistics

What is a Corpus?

• a collection of written or spoken utterances


• typically electronic and machine readable
• generally assembled with particular purposes in mind
• data, metadata, linguistic annotation

July 21, 2013 6 / 62


Introduction Corpus Linguistics

Basic annotation on token level

• RAW text: only sequences of characters without explicit


information about words or sentences
• tokenizing: segmentation of RAW text into sentences and words
(tokens)
• sequence of characters is divided into sentences
• sentences are divided into tokens
• stemming/lemmatizing
• stemming: cutting off suffixes (no lexicon involved)
• lemmatization: base form taken form a lexicon
• PoS-tagging
• labeling each word in a sequence of words with the appropriate
part-of-speech (pos)

July 21, 2013 7 / 62


Introduction Corpus Linguistics

Tagsets

• A tagset is a set of part-of-speech tags


• The size and choice of the tagset vary
• a dozen or several hundred tags
• usually between 50 and 200
• Classical 8 classes: noun, verb, article, particle, pronoun,
prepositions, adverb, conjunction
• large tagsets can include inflection information
(number, gender, case features)
• problem: sparse data problem
(1) Er hat seine Frau gesprochen →direct object
(2) Dann hat seine Frau gesprochen →subject

July 21, 2013 8 / 62


Penn Treebank Tagset
CC Coordinating conjunction TO to
CD Cardinal number UH Interjection
DT Determiner VV verb, base form
EX Existential there VVD verb, past tense
FW Foreign word VVG verb, gerund or pres. part.
IN Preposition / subjunctin VVN verb, past participle
JJ Adjective VVP verb, non-3rd p sing. pres.
JJR Adjective, comparative VVZ verb, 3rd person sing. pres.
JJS Adjective, superlative VB aux be, base form
LS List item marker VBD aux be, past tense
MD Modal VBG aux be, gerund or pres. part.
NN Noun, singular or mass VBN aux be, past participle
NNS Noun, plural VBP aux be, non-3rd p sing. pres.
NNP Proper noun, singluar VBZ aux be, 3rd person sing. pres.
NNPS Proper noun, plural VH verb, base form
PDT Predeterminer VHD verb, past tense
POS Possessive ending VHG verb, gerund or pres. part.
PP Personal pronoun VHN verb, past participle
PP$ Possessive pronoun VHP verb, non-3rd p sing. pres.
RB Advberb VHZ verb, 3rd person sing. pres.
RBR Adverb, comparative WDT Wh-determiner
RBS Advber, superlative WP Wh-pronoun
RP Particle WP$ Possessive wh-pronoun
SYM Symbol WRB Wh-adverb
Stuttgart-Tübingen Tagset (STTS)
ADJA attibutive adj.
PRF reflexive personal pronoun
ADJD predicative or adverbial adj.
PWS sustituting interrogative pron.
ADV adverb
PWAT attributing interrogative pron.
APPR preposition / left circumpos.
PWAV adverbial interrog./rel. pron.
APPART preposition with fused article
PAV pronominal adverb
APPO postposition
PTKZU zu preceding infinitive
APZR right circumpos.
PTKNEG negation particle
ART definite or indef. article
PTKVZ separated verb particle
CARD cardinal number
PTKANT reply particle
FM foreign-lg material
PTKA particle with adj. or adv.
ITJ interjection
TRUNC trauncated word
KOUI subjunction with zu+Inf.
VVFIN full verb, finite
KOUS subjunction with clause
VVIMP full verb, imperative
KON coordinating conjunction
VVINF full verb, infinitive
KOKOM comparative conjunction
VVIZU full verb, zu-infinitive
NN common noun
VVPP full verb, past participle
NE proper noun
VAFIN auxillary, finite
PDS substituting pronoun
VAINF auxillary, infinitive
PDAT attributive pronoun
VAPP auxillary, past participle
PIS subsituting indef. pronoun
VMFIN modal, finite
PIAT attributing indef. pronoun
VMINF modal, infinitive
PIDAT PIAT allowing determiner
VMPP modal, past participle
PPER personal pronoun
XY non-word
PPOSS subsituting possessive pronoun
$, comma
PPOSAT attributive possessive pronoun
$. end-of-sentence punctuation
PRELS subsituting relative pronoun
$( other punctuation
PRELAT attributive relative pronoun
Introduction Corpus Linguistics

Corpus linguistic research

• What kind of corpus do I need?


• corpus type
• annotation type
• Are there corpora available or do I have to compile a new corpus?
• What kind of linguistic evidence is relevant for my research
question?
• How can I break down the linguistic evidence into a corpus query?

• linguistic triggers (lexical, morpho-syntactic, syntactic, semantic)


• word level, pattern-matching, syntactic/functional/semantic analysis
• go for the low hanging fruits first!
• Qualitative vs. Quantitative analysis

July 21, 2013 11 / 62


Introduction The IMS Corpus Workbench

The IMS Corpus Workbench (CWB)

• http://cwb.sourceforge.net
• tools for encoding, indexing, compression, decoding, and
frequency distribution
• corpus query processor (CQP)
• fast corpus search (regular expression syntax)
• can work efficiently with large corpora (up to 500 million words,
depending on annotations)
• use in interactive or batch mode
• results displayed in terminal windows
• CQPweb - web interface for CQP
• CWB/Perl interface for post-processing, scripting

July 21, 2013 12 / 62


Introduction The IMS Corpus Workbench

What can you do with CQP?

• extract linguistic evidence


• examples (concordances, KWIC - keyword in context)
• frequency distributions
• processing of results
• sorting
• grouping
• counting

July 21, 2013 13 / 62


Introduction The IMS Corpus Workbench

CQPweb

• corpus query
• simple query syntax and CQP syntax
• standard query vs. restricted query
• frequency lists
• keywords
• post-processing
• sorting
• frequencies
• frequency distribution
• collocation
• categorization

July 21, 2013 14 / 62


Introduction The IMS Corpus Workbench

CQPweb at UdS

• https://fedora.clarin-d.uni-saarland.de/cqpweb/

July 21, 2013 15 / 62


Queries Basic queries

Choosing a corpus

July 21, 2013 16 / 62


Queries Basic queries

Basic queries (CQP Syntax)

• word form
“word”
[word=“word”]
• lemma
[lemma=“use”]
• part-of-speech
[pos=“VV”]

July 21, 2013 17 / 62


Queries Basic queries

Standard Query: CQP Syntax

July 21, 2013 18 / 62


Queries Basic queries

Concordance (def)

Definition
A concordance is an alphabetical listing of the words in a text, given
together with the contexts in which they appear. The most common
form of concordance today is the Keyword-in-Context (KWIC) index, in
which each word is centered in a fixed-length field (e.g., 80 characters).

July 21, 2013 19 / 62


Queries Basic queries

Concordances

July 21, 2013 20 / 62


Queries Using regular expressions

Using regular expressions

• character set
“[whm]orse”
“[∧ whm]orse”
• character range
“[a-zA-Zäöüß]orse”
“[∧ a-z]orse”
• alternatives
“im(portant|possible)”
[lemma=“research|investigation”]
• conjunction
[lemma = “research” & pos=“NN”]

July 21, 2013 21 / 62


Queries Using regular expressions

Using regular expressions: repetition


operators

• optionality
“(re)?search”
• 0 - inf times (Kleene star)
“search.*”
• 1- inf times
“search.+”
• n times
“.*search.{2}”
• n - m times
“.*search.{2,3}”

July 21, 2013 22 / 62


Queries Using regular expressions

Using regular expressions: pos examples

• adjectives or nouns
[pos = “JJ|NN”]
• verbs
[pos = “V.*”]
• full verbs
[pos = “VV.*”]
• past tense full verbs
[pos=“VV[DN]”]

July 21, 2013 23 / 62


Queries Regular Expressions

Characters and tokens

. matchall (any character) a...z, A...Z, 0...9, ?, !, ...


[aeiou] character sets a,e,i,o,u
[0-9] character ranges 0,1,2,...9
[a-zA-Z] (but NOT [a-Z] or [A-z]) a...z,A...Z
[a-zäöüß] with diacritics a...z, ä,ö,ü,ß
[∧ aeiou] negation of character set b,c,d,..
aber auch: A,E,...
\w word character a-z, A-Z ,ä,ö,ü, 0-9, _,-,...
\W non word character ,;:.!?...
\s whitespace
\d digit [0-9] 0,1,2,...9
\. \[ \] meta sign: "literal" .[]

July 21, 2013 24 / 62


Queries Regular Expressions

Repetition operators

? 0 - 1 times years? year,years


* 0 - ∞ times (Kleene star) want.* want,wants,wanting
+ 1 - ∞ times ha+ ha, haa, haaa, ha...
{n} n times ha{2} haa
{n,} n - ∞ times (ha){2,} haha, hahaha,...
{,n} 0 - n times ha{,2} h, ha, haa
{n,m} n - m times ha{2,3} haa, haaa

July 21, 2013 25 / 62


Queries Regular Expressions

Operators

= equals [word = “apple”]


!= not equals [word != “apple”]
(...) grouping (im)?possible
| disjunction (or ) ([word = “apple”] | [word = “pear”])
& conjunction (and) [word = “house” & pos = “NN”]

July 21, 2013 26 / 62


Queries Combination of attribute constraints

Combination of attribute constraints

• operators: & (and), | (or), ! (not)


• verbs with the prefix under-, over-
[(pos = “V.*”) & (word = “(under|over).+”)]
• word with the prefix under- where word and lemma are not the
same
[(lemma=“under.+”) & (word != lemma)]

July 21, 2013 27 / 62


Queries Simple patterns

Simple patterns

• word sequence
“it” “is” “possible” “that”
• adjective noun collocations
[pos="JJ.*"][pos="NN.*"]
• semantically related words
[word=“\w+”]“and|or”[word=“\w+”]
• verbally derived adjectives
[(word=“.+(ed|ing)”) & (pos=“JJ”)][pos=“NN”]

July 21, 2013 28 / 62


Queries Simple patterns

simple patterns: finding nearby words

• an optional token
“it” “is” []? “possible” “that”
• a range of tokens
“it” “is” []{1,3} “possible” “that”
• optional tokens
“it” “is” []* “possible” “that”
• avoid crossing sentence boundaries
“it” “is” []* “possible” “that” within s

July 21, 2013 29 / 62


Queries Querying structures

Querying structural attributes

• titles
/region[title];
<title>[]*</title>
• a ngram including a full verb
<ngram>[]*[pos="VV.*"][]*</ngram>

July 21, 2013 30 / 62


Queries Querying structures

Querying structural attributes

• present participle at the beginning of a sentence


<s>[pos=“VVG”]
• present participle at the end of a sentence
[pos=“VVG”][pos=“SENT”]? </s>
• an adjective which is not inside an evaluative pattern
[pos=“JJ.*” & !evaluation]
• an full verb in a evaluative pattern of possibility
[pos=“VV.*” &
_.evaluation_meaning=“possibility”]

July 21, 2013 31 / 62


Post-processing Frequency breakdown

Frequency breakdown

• frequencies of matches in the corpus


• words
• annotations (part-of-speech)
• words + annotations (part-of-speech)
• frequencies and percentage

July 21, 2013 32 / 62


Post-processing Frequency breakdown

Frequency breakdown

July 21, 2013 33 / 62


Post-processing Sorting

Sorting results

July 21, 2013 34 / 62


Post-processing Frequency distribution

Frequency distribution

• frequency of query
• distribution across text categories
• distribution table and bar charts
• cross tables
• (download of) text frequencies
• frequencies and frequency per million (words)

July 21, 2013 35 / 62


Post-processing Frequency distribution

Frequency distribution

July 21, 2013 36 / 62


Post-processing Frequency distribution

Frequency distribution: cross table

July 21, 2013 37 / 62


Post-processing Frequency distribution

Frequency distribution: bar chart

July 21, 2013 38 / 62


Post-processing Collocations

You shall know a word by the company it


keeps Firth (1957)

• semantic aspects of a word


• usage of a word
• terminology
• character of the text (genre, register)
• important for translation

July 21, 2013 39 / 62


Post-processing Collocations

Collocation: a definition

Definition
“lexically and/or pragmatically constrained recurrent co-occurrences of
at least two lexical items which are in a direct syntactic relation with
each other” Bartsch (2004)

• words that show a tendency to co-occur


• statistically salient patterns
• predictability of word combinations - “mutual expectancy” Firth
(1957)
• semi-compositional and lexically determined word combination
Grossmann & Tutin (2003)

July 21, 2013 40 / 62


Post-processing Collocations

Collocation in CQPweb

• based on word forms/part-of-speech


• different statistics: mutual information, log-likelihood, t-score,
z-score, ...;
• window size 4-10 in both directions
• total frequency, expected/observed collocate frequency
• filtering according to collocate or tags
• download collocation results

July 21, 2013 41 / 62


Post-processing Collocations

Collocations: choose settings

July 21, 2013 42 / 62


Post-processing Collocations

Collocations

July 21, 2013 43 / 62


Post-processing Collocations

Collocations: detailed information

July 21, 2013 44 / 62


Post-processing Frequency distribution II

Anchor points

1. match (beginning of match)


2. matchend (end of match)
3. target (marked by @)

July 21, 2013 45 / 62


Post-processing Frequency distribution II

Anchor points: target

• the target position is set using @


“it” [lemma=“be”] @[pos=“JJ.*”] “to|that”
• a target set on an optional element, will also be optional
“it” [lemma=“be”] @[]? [pos=“JJ.*”] “to|that”
• a target set on an element with flexible length, will be set on the
last item
“it” [lemma=“be”] @[word!="to|that"]1,3
[pos=“JJ.*”] “to|that”

July 21, 2013 46 / 62


Post-processing Frequency distribution II

Download tabulate

• frequencies of matches
• based on tabulate command
• TABULATE OPTIONS:
groups any number and combination of tokens
• SORTING OPTIONS:
sorting, counting and display option
• at the moment: download function

July 21, 2013 47 / 62


Post-processing Frequency distribution II

Download tabulate
TABULATE OPTIONS :
• range of tokens
match .. matchend lemma
• separate tokens
match lemma, matchend pos
• structural attributes
match .. matchend lemma, match text_reg, match
ngram_len
SORTING OPTIONS :
• sorting and counting
• sorting and counting; display as matrix
• sorting and counting; display as matrix; include second column in
sort string

July 21, 2013 48 / 62


Post-processing Frequency distribution II

Sorting and counting

July 21, 2013 49 / 62


Post-processing Frequency distribution II

Sorting and counting

July 21, 2013 50 / 62


Post-processing Frequency distribution II

Sorting and counting, display as matrix

July 21, 2013 51 / 62


Post-processing Frequency distribution II

Sorting and counting, display as matrix

July 21, 2013 52 / 62


Extras

Saving and exporting

• save query results in personal workspace


• exporting
• (sorted) query results
• frequency breakdown
• collocation results
• text frequencies
• frequency distribution of query results

July 21, 2013 53 / 62


Extras

Saved queries

July 21, 2013 54 / 62


Extras

Saved queries

July 21, 2013 55 / 62


Extras

Query history

• queries are stored in a query history


• insert query string into query window
• recreate query results

July 21, 2013 56 / 62


Extras

Query history

July 21, 2013 57 / 62


Extras

Categorise queries

• categorize query results


• develop set of categories
• assign categories to query results

July 21, 2013 58 / 62


Extras

Categorise queries

July 21, 2013 59 / 62


Extras

Categorise queries

July 21, 2013 60 / 62


Extras

Online corpora based on CQP

• http://cqpweb.lancs.ac.uk/
Present-day English; Historical English; Learner English; Arabic;
European languages; South and East Asian languages; LACITO
corpora
• BNC web: http://bncweb.lancs.ac.uk/cgi-binbncXML/
BNCquery.pl?theQuery=search&urlTest=yes
• OPUS: http://opus.lingfil.uu.se/bin/opuscqp.pl
Parallel corpora
• A collection of English Corpora
(http://corpus.leeds.ac.uk/protected/query.html)
• find more on http://cwb.sourceforge.net/demos.php

July 21, 2013 61 / 62


Extras

Other online corpora

• COSMAS: http://www.ids-mannheim.de/cosmas2/
• Digitales Wörterbuch der Deutschen Sprache:
http://www.dwds.de/
• FALKO:
http://korpling.german.hu-berlin.de/falko-suche/
• BYU corpora: http://corpus.byu.edu/
Contemporary and Historical American English; BNC; ...
• Corncordancier-corpus francais: http:
//www.lextutor.ca/concordancers/concord_f.html

July 21, 2013 62 / 62


References

Bartsch, S. 2004. Structural and functional properties of collocations in English. A


corpus study of lexical and pragmatic constraints on lexical co-occurence. Narr,
Tübingen.
Burger, H. 2007. Phraseologie. eine Einführung am Beispiel des Deutschen. Schmidt,
Berlin.
Evert, S. 2004. The statistics of word coocurrences – word pairs and collocations.
Ph.D. thesis, University of Stuttgart.
Evert, S. 2008. Corpora and collocations. In A. Lüdeling and M. Kytö, editors, Corpus
Linguistics. An International Handbook, volume I. Mouton de Gruyter, Berlin.
Firth, J. 1957. A synopsis of linguistic theory 1930–55. In Studies in linguistic
analysis, pages 1–32. The Philological Society, Oxford.
Garside, R., Leech, G., & MacEnery, A. 1997. Corpus Annotation. Addison Wesley
Longman, Harlow.
Grossmann, F. & Tutin, A., editors. 2003. Les collocations - analyse et traitement,
volume E1 of Travaux et Recherches en Linguistique Appliqué. De Werelt,
Amsterdam.
Hausmann, F. J. 2004. Was sind eigentlich Kollokationen. In K. Steyer, editor,
Wortverbindungen – mehr oder weniger fest, Jahrbuch 2003, pages 309–334.
Institut für Deutsche Sprache.
Heid, U. 2007. Computational linguistic aspects of phraseology II. In H. Burger,
D. Dobrovol’skij, P. Kühn, and N. R. Norrick, editors, Phraseologie/Phraseology. Ein
July 21, 2013 62 / 62
Extras

internationales Handbuch der zeitgenössischen Forschung/An International


Handbook of Contemporary Research, pages 1036–1044. Mouton de Gruyter,
Berlin.
Heid, U. 2008. Computational Phraseology: an Overview. In S. Granger and
F. Meunier, editors, Phraseology: An interdisciplinary perspective, pages 337–360.
John Benjamins Publishing Company.
Lemnitzer, L. & Zinsmeister, H. 2006. Korpuslinguistik: Eine Einführung. Narr,
Tübingen.
Lüdeling, A. & Kytö, M., editors. 2008a. Corpus Linguistics. An International
Handbook, volume 1 of Handbücher zur Sprach- und
Kommunikationswissenschaft. Mouton de Gruyter, Berlin.
Lüdeling, A. & Kytö, M., editors. 2008b. Corpus Linguistics. An International
Handbook, volume 2 of Handbücher zur Sprach- und
Kommunikationswissenschaft. Mouton de Gruyter, Berlin.
Sinclair, J. 1991. Corpus, Concordance, Collocation. Oxford University Press, Oxford.
Storrer, A. 2011. Korpusgestützte Sprachanalyse in Lexikographie und Phraseologie.
In K. Knapp, editor, Angewandte Linguistik. Ein Lehrbuch. Francke Verlag. 3.
Auflage.

July 21, 2013 62 / 62

You might also like