0% found this document useful (0 votes)

20 views64 pages

Tutorial Cqpweb

Uploaded by

Katzee Junge

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views64 pages

Tutorial Cqpweb

Uploaded by

Katzee Junge

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

CQPweb tutorial

Querying, postprocessing, exporting

Hannah Kermes

July 21, 2013

Outline
1 Introduction
Corpus Linguistics
The IMS Corpus Workbench
2 Queries
Basic queries
Using regular expressions
Regular expressions summary
Regular Expressions
Combination of attribute constraints
Simple patterns
Querying structures
3 Post-processing
Frequency breakdown
Sorting
Frequency distribution
Collocations
Frequency distribution II
July 21, 2013 2 / 62
Introduction Corpus Linguistics

Corpus linguistics

• investigate language use and usage on the basis of natural

language data
• keyword in context (kwic)
• quantitative and qualitative analysis

July 21, 2013 3 / 62

Introduction Corpus Linguistics

What is a corpus?

Lemnitzer & Zinsmeister (2006), p. 7:

Ein Korpus ist eine Sammlung schriftlicher oder gesprochener
Äußerungen. Die Daten des Korpus sind typischerweise digitalisiert, d.
h. auf Rechnern gespeichert und maschinenlesbar. Die Bestandteile
des Korpus, die Texte, bestehen aus den Daten selbst sowie
möglicherweise aus Metadaten, die diese Daten beschreiben, und aus
linguistischen Annotationen, die diesen Daten zugeordnet sind.

July 21, 2013 4 / 62

Introduction Corpus Linguistics

What is a corpus?

Sinclair (1991), p. 171:

• A collection of naturally occurring language text, chosen to
characterize a state or variety of a language
• A body of naturally-occurring (authentic) language data which can
be used as a basis for linguistic research
Leech in Garside et al. (1997), p. 1:
• In the past thirty-five years, the term corpus has been increasingly
applied to a body of language material which exists in electronic
form, and which may be processed by computer for various
purposes

July 21, 2013 5 / 62

Introduction Corpus Linguistics

What is a Corpus?

• a collection of written or spoken utterances

• typically electronic and machine readable
• generally assembled with particular purposes in mind
• data, metadata, linguistic annotation

July 21, 2013 6 / 62

Introduction Corpus Linguistics

Basic annotation on token level

• RAW text: only sequences of characters without explicit

information about words or sentences
• tokenizing: segmentation of RAW text into sentences and words
(tokens)
• sequence of characters is divided into sentences
• sentences are divided into tokens
• stemming/lemmatizing
• stemming: cutting off suffixes (no lexicon involved)
• lemmatization: base form taken form a lexicon
• PoS-tagging
• labeling each word in a sequence of words with the appropriate
part-of-speech (pos)

July 21, 2013 7 / 62

Introduction Corpus Linguistics

Tagsets

• A tagset is a set of part-of-speech tags

• The size and choice of the tagset vary
• a dozen or several hundred tags
• usually between 50 and 200
• Classical 8 classes: noun, verb, article, particle, pronoun,
prepositions, adverb, conjunction
• large tagsets can include inflection information
(number, gender, case features)
• problem: sparse data problem
(1) Er hat seine Frau gesprochen →direct object
(2) Dann hat seine Frau gesprochen →subject

July 21, 2013 8 / 62

Penn Treebank Tagset
CC Coordinating conjunction TO to
CD Cardinal number UH Interjection
DT Determiner VV verb, base form
EX Existential there VVD verb, past tense
FW Foreign word VVG verb, gerund or pres. part.
IN Preposition / subjunctin VVN verb, past participle
JJ Adjective VVP verb, non-3rd p sing. pres.
JJR Adjective, comparative VVZ verb, 3rd person sing. pres.
JJS Adjective, superlative VB aux be, base form
LS List item marker VBD aux be, past tense
MD Modal VBG aux be, gerund or pres. part.
NN Noun, singular or mass VBN aux be, past participle
NNS Noun, plural VBP aux be, non-3rd p sing. pres.
NNP Proper noun, singluar VBZ aux be, 3rd person sing. pres.
NNPS Proper noun, plural VH verb, base form
PDT Predeterminer VHD verb, past tense
POS Possessive ending VHG verb, gerund or pres. part.
PP Personal pronoun VHN verb, past participle
PP$ Possessive pronoun VHP verb, non-3rd p sing. pres.
RB Advberb VHZ verb, 3rd person sing. pres.
RBR Adverb, comparative WDT Wh-determiner
RBS Advber, superlative WP Wh-pronoun
RP Particle WP$ Possessive wh-pronoun
SYM Symbol WRB Wh-adverb
Stuttgart-Tübingen Tagset (STTS)
ADJA attibutive adj.
PRF reflexive personal pronoun
ADJD predicative or adverbial adj.
PWS sustituting interrogative pron.
ADV adverb
PWAT attributing interrogative pron.
APPR preposition / left circumpos.
PWAV adverbial interrog./rel. pron.
APPART preposition with fused article
PAV pronominal adverb
APPO postposition
PTKZU zu preceding infinitive
APZR right circumpos.
PTKNEG negation particle
ART definite or indef. article
PTKVZ separated verb particle
CARD cardinal number
PTKANT reply particle
FM foreign-lg material
PTKA particle with adj. or adv.
ITJ interjection
TRUNC trauncated word
KOUI subjunction with zu+Inf.
VVFIN full verb, finite
KOUS subjunction with clause
VVIMP full verb, imperative
KON coordinating conjunction
VVINF full verb, infinitive
KOKOM comparative conjunction
VVIZU full verb, zu-infinitive
NN common noun
VVPP full verb, past participle
NE proper noun
VAFIN auxillary, finite
PDS substituting pronoun
VAINF auxillary, infinitive
PDAT attributive pronoun
VAPP auxillary, past participle
PIS subsituting indef. pronoun
VMFIN modal, finite
PIAT attributing indef. pronoun
VMINF modal, infinitive
PIDAT PIAT allowing determiner
VMPP modal, past participle
PPER personal pronoun
XY non-word
PPOSS subsituting possessive pronoun
$, comma
PPOSAT attributive possessive pronoun
$. end-of-sentence punctuation
PRELS subsituting relative pronoun
$( other punctuation
PRELAT attributive relative pronoun
Introduction Corpus Linguistics

Corpus linguistic research

• What kind of corpus do I need?

• corpus type
• annotation type
• Are there corpora available or do I have to compile a new corpus?
• What kind of linguistic evidence is relevant for my research
question?
• How can I break down the linguistic evidence into a corpus query?

• linguistic triggers (lexical, morpho-syntactic, syntactic, semantic)

• word level, pattern-matching, syntactic/functional/semantic analysis
• go for the low hanging fruits first!
• Qualitative vs. Quantitative analysis

July 21, 2013 11 / 62

Introduction The IMS Corpus Workbench

The IMS Corpus Workbench (CWB)

• http://cwb.sourceforge.net
• tools for encoding, indexing, compression, decoding, and
frequency distribution
• corpus query processor (CQP)
• fast corpus search (regular expression syntax)
• can work efficiently with large corpora (up to 500 million words,
depending on annotations)
• use in interactive or batch mode
• results displayed in terminal windows
• CQPweb - web interface for CQP
• CWB/Perl interface for post-processing, scripting

July 21, 2013 12 / 62

Introduction The IMS Corpus Workbench

What can you do with CQP?

• extract linguistic evidence

• examples (concordances, KWIC - keyword in context)
• frequency distributions
• processing of results
• sorting
• grouping
• counting

July 21, 2013 13 / 62

Introduction The IMS Corpus Workbench

CQPweb

• corpus query
• simple query syntax and CQP syntax
• standard query vs. restricted query
• frequency lists
• keywords
• post-processing
• sorting
• frequencies
• frequency distribution
• collocation
• categorization

July 21, 2013 14 / 62

Introduction The IMS Corpus Workbench

CQPweb at UdS

• https://fedora.clarin-d.uni-saarland.de/cqpweb/

July 21, 2013 15 / 62

Queries Basic queries

Choosing a corpus

July 21, 2013 16 / 62

Queries Basic queries

Basic queries (CQP Syntax)

• word form
“word”
[word=“word”]
• lemma
[lemma=“use”]
• part-of-speech
[pos=“VV”]

July 21, 2013 17 / 62

Queries Basic queries

Standard Query: CQP Syntax

July 21, 2013 18 / 62

Queries Basic queries

Concordance (def)

Definition
A concordance is an alphabetical listing of the words in a text, given
together with the contexts in which they appear. The most common
form of concordance today is the Keyword-in-Context (KWIC) index, in
which each word is centered in a fixed-length field (e.g., 80 characters).

July 21, 2013 19 / 62

Queries Basic queries

Concordances

July 21, 2013 20 / 62

Queries Using regular expressions

Using regular expressions

• character set
“[whm]orse”
“[∧ whm]orse”
• character range
“[a-zA-Zäöüß]orse”
“[∧ a-z]orse”
• alternatives
“im(portant|possible)”
[lemma=“research|investigation”]
• conjunction
[lemma = “research” & pos=“NN”]

July 21, 2013 21 / 62

Queries Using regular expressions

Using regular expressions: repetition

operators

• optionality
“(re)?search”
• 0 - inf times (Kleene star)
“search.*”
• 1- inf times
“search.+”
• n times
“.*search.{2}”
• n - m times
“.*search.{2,3}”

July 21, 2013 22 / 62

Queries Using regular expressions

Using regular expressions: pos examples

• adjectives or nouns
[pos = “JJ|NN”]
• verbs
[pos = “V.*”]
• full verbs
[pos = “VV.*”]
• past tense full verbs
[pos=“VV[DN]”]

July 21, 2013 23 / 62

Queries Regular Expressions

Characters and tokens

. matchall (any character) a...z, A...Z, 0...9, ?, !, ...

[aeiou] character sets a,e,i,o,u
[0-9] character ranges 0,1,2,...9
[a-zA-Z] (but NOT [a-Z] or [A-z]) a...z,A...Z
[a-zäöüß] with diacritics a...z, ä,ö,ü,ß
[∧ aeiou] negation of character set b,c,d,..
aber auch: A,E,...
\w word character a-z, A-Z ,ä,ö,ü, 0-9, _,-,...
\W non word character ,;:.!?...
\s whitespace
\d digit [0-9] 0,1,2,...9
\. \[ \] meta sign: "literal" .[]

July 21, 2013 24 / 62

Queries Regular Expressions

Repetition operators

? 0 - 1 times years? year,years

* 0 - ∞ times (Kleene star) want.* want,wants,wanting
+ 1 - ∞ times ha+ ha, haa, haaa, ha...
{n} n times ha{2} haa
{n,} n - ∞ times (ha){2,} haha, hahaha,...
{,n} 0 - n times ha{,2} h, ha, haa
{n,m} n - m times ha{2,3} haa, haaa

July 21, 2013 25 / 62

Queries Regular Expressions

Operators

= equals [word = “apple”]

!= not equals [word != “apple”]
(...) grouping (im)?possible
| disjunction (or ) ([word = “apple”] | [word = “pear”])
& conjunction (and) [word = “house” & pos = “NN”]

July 21, 2013 26 / 62

Queries Combination of attribute constraints

Combination of attribute constraints

• operators: & (and), | (or), ! (not)

• verbs with the prefix under-, over-
[(pos = “V.*”) & (word = “(under|over).+”)]
• word with the prefix under- where word and lemma are not the
same
[(lemma=“under.+”) & (word != lemma)]

July 21, 2013 27 / 62

Queries Simple patterns

Simple patterns

• word sequence
“it” “is” “possible” “that”
• adjective noun collocations
[pos="JJ.*"][pos="NN.*"]
• semantically related words
[word=“\w+”]“and|or”[word=“\w+”]
• verbally derived adjectives
[(word=“.+(ed|ing)”) & (pos=“JJ”)][pos=“NN”]

July 21, 2013 28 / 62

Queries Simple patterns

simple patterns: finding nearby words

• an optional token
“it” “is” []? “possible” “that”
• a range of tokens
“it” “is” []{1,3} “possible” “that”
• optional tokens
“it” “is” []* “possible” “that”
• avoid crossing sentence boundaries
“it” “is” []* “possible” “that” within s

July 21, 2013 29 / 62

Queries Querying structures

Querying structural attributes

• titles
/region[title];
<title>[]*</title>
• a ngram including a full verb
<ngram>[]*[pos="VV.*"][]*</ngram>

July 21, 2013 30 / 62

Queries Querying structures

Querying structural attributes

• present participle at the beginning of a sentence

<s>[pos=“VVG”]
• present participle at the end of a sentence
[pos=“VVG”][pos=“SENT”]? </s>
• an adjective which is not inside an evaluative pattern
[pos=“JJ.*” & !evaluation]
• an full verb in a evaluative pattern of possibility
[pos=“VV.*” &
_.evaluation_meaning=“possibility”]

July 21, 2013 31 / 62

Post-processing Frequency breakdown

Frequency breakdown

• frequencies of matches in the corpus

• words
• annotations (part-of-speech)
• words + annotations (part-of-speech)
• frequencies and percentage

July 21, 2013 32 / 62

Post-processing Frequency breakdown

Frequency breakdown

July 21, 2013 33 / 62

Post-processing Sorting

Sorting results

July 21, 2013 34 / 62

Post-processing Frequency distribution

Frequency distribution

• frequency of query
• distribution across text categories
• distribution table and bar charts
• cross tables
• (download of) text frequencies
• frequencies and frequency per million (words)

July 21, 2013 35 / 62

Post-processing Frequency distribution

Frequency distribution

July 21, 2013 36 / 62

Post-processing Frequency distribution

Frequency distribution: cross table

July 21, 2013 37 / 62

Post-processing Frequency distribution

Frequency distribution: bar chart

July 21, 2013 38 / 62

Post-processing Collocations

You shall know a word by the company it

keeps Firth (1957)

• semantic aspects of a word

• usage of a word
• terminology
• character of the text (genre, register)
• important for translation

July 21, 2013 39 / 62

Post-processing Collocations

Collocation: a definition

Definition
“lexically and/or pragmatically constrained recurrent co-occurrences of
at least two lexical items which are in a direct syntactic relation with
each other” Bartsch (2004)

• words that show a tendency to co-occur

• statistically salient patterns
• predictability of word combinations - “mutual expectancy” Firth
(1957)
• semi-compositional and lexically determined word combination
Grossmann & Tutin (2003)

July 21, 2013 40 / 62

Post-processing Collocations

Collocation in CQPweb

• based on word forms/part-of-speech

• different statistics: mutual information, log-likelihood, t-score,
z-score, ...;
• window size 4-10 in both directions
• total frequency, expected/observed collocate frequency
• filtering according to collocate or tags
• download collocation results

July 21, 2013 41 / 62

Post-processing Collocations

Collocations: choose settings

July 21, 2013 42 / 62

Post-processing Collocations

Collocations