CQP Query Language Tutorial
CQP Query Language Tutorial
CQP Query Language Tutorial
Contents
1 Introduction
1.1
1.2
1.3
2.1
Getting started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2
2.3
Display options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4
Useful options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.5
11
2.6
12
2.7
12
2.8
13
2.9
13
15
3.1
15
3.2
15
3.3
Anchor points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
3.4
Frequency distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
3.5
18
3.6
Random subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
3.7
20
CONTENTS
22
4.1
Using labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
4.2
Structural attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
4.3
24
4.4
25
27
5.1
27
5.2
Word lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
5.3
Subqueries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
5.4
30
5.5
32
5.6
33
36
6.1
36
6.2
37
6.3
40
7 Undocumented CQP
42
7.1
Zero-width assertions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
7.2
43
7.3
Easter eggs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
A Appendix
45
45
46
47
49
INTRODUCTION
Introduction
1.1
Desiging and evaluating Extraction Tools for Collocations in Dictionaries and Corpora
INTRODUCTION
Technical aspects
CWB uses proprietary token-based format for corpus storage:
binary encoding ) fast access
1.2
INTRODUCTION
The following steps illustrate the transformation of textual data with some XML markup into the
CWB data format.
1. Formatted text (as displayed on-screen or printed)
An easy example. Another very easy example. Only the easiest examples!
2. Text with XML markup (at the level of texts, words or characters)
<text id=42 lang="English"> <s>An easy example.</s><s> Another <i>very</i> easy
example.</s> <s><b>O</b>nly the <b>ea</b>siest ex<b>a</b>mples!</s> </text>
INTRODUCTION
is not allowed in a CWB corpus (the embedded <np> region will automatically be dropped).2 In
the recommended encoding procedure, embedded regions (up to a pre-defined level of embedding)
are automatically renamed by adding digits to the element name:
<np>the man <pp>with <np1>the telescope</np1></pp> </np>
corpus
position
(0)
(0)
(0)
(0)
0
1
2
3
(3)
(4)
4
5
6
7
8
(8)
(9)
9
10
11
12
13
(13)
(13)
(13)
(13)
word
ID part of ID lemma
form
speech
<text> value = id=42 lang="English"
<text id> value = 42
<text lang> value = English
<s>
An
0 DET
0 a
easy
1 ADJ
1 easy
example
2 NN
2 example
.
3 PUN
3 .
</s>
<s>
Another
4 DET
0 another
very
5 ADV
4 very
easy
1 ADJ
1 easy
example
2 NN
2 example
.
3 PUN
3 .
</s>
<s>
Only
6 ADV
4 only
the
7 DET
0 the
easiest
8 ADJ
1 easy
examples
9 NN
2 example
!
10 PUN
3 !
</s>
</text lang>
</text id>
</text>
ID
0
1
2
3
4
5
1
2
3
6
7
1
2
8
2
Recall that only the nesting of a <np> region within a larger <np> region constitues recursion in the CWB data model.
The nesting of <pp> within <np> (and vice versa) is unproblematic, since these regions are encoded in two independent
s-attributes (named pp and np).
1.3
INTRODUCTION
Pre-encoded versions of these corpora are distributed free of charge together with the IMS Corpus
Workbench. Perl scripts for encoding the British National Corpus (World Edition) can be provided
at request.
English corpus: DICKENS
a collection of novels by Charles Dickens
ca. 3.4 million tokens
derived from Etext editions (Project Gutenberg)
document-structure markup added semi-automatically
part-of-speech tagging and lemmatisation with TreeTagger
recursive noun and prepositional phrases from Gramotron parser
German corpus: GERMAN-LAW
a collection of freely available German law texts
ca. 816,000 tokens
part-of-speech tagging with TreeTagger
morphosyntactic information and lemmatisation from IMSLex morphology
partial syntactic analysis with YAC chunker
See Appendix A.3 for a detailed description of the token-level annotations and structural markup of
the tutorial corpora (positional and structural attributes).
2.1
Getting started
displays information file associated with the corpus, whose contents may vary; ideally, this
should give a description of the corpus composition, a summary of the positional and structural
annotations, and a brief overview of annotation codes such as the part-of-speech tagset used
activate corpus for subsequent queries (use TAB key for name completion)
[no corpus]> DICKENS;
DICKENS>
in the following examples, the CQP command prompt is indicated by a > character
list attributes of activated corpus (context descriptor)
> show cd;
2.2
search single word form (single or double quotes are required: ... or "...")
> "interesting";
The -e mode is not enabled by default for reasons of backward compatibility. When command-line editing is active,
multi-line commands are not allowed, even when the input is read from a pipe.
"\?" ! ? ;
"." ! . , ! ? a b c . . . ;
"\$\." ! $.
LATEX-style escape sequences \", \, \ and \^, followed by an appropriate ASCII letter, are
used to represent characters with diacritics when they cannot be entered directly
"B\"ar" ! B
ar ;
"d\ej\a" ! dej`
a
NB: this feature is deprecated; it works only for the Latin-1 encoding and cannot be deactivated
additional special escape sequences:
\"s ! ;
\,c ! c ;
\,C ! C
;
\~n ! n
;
;
\~N ! N
if you need to match a word form containing single or double quotes (e.g. em or 12-screen),
there are two possibilities:
if the string does not contain both single and double quotes, simply pick an appropriate
quote character: "em" vs. 12"-screen
otherwise, double every occurrence of the quote character inside the string; our two examples
could also be matched with the queries em and "12""-screen"
2.3
Display options
ry moment an
appeared to
ge , with an
rgetting the
require . My
require . My
ken a lively
been deeply
if query results do not fit on screen, they will be displayed one page at a time
press SPC (space bar) to see next page, RET (return) for next line, and q to return to CQP
some pagers support b or the backspace key to go to the previous page, as well as the use of the
cursor keys, PgUp, and PgDn
at the command prompt, use cursor keys to edit input (
repeat previous commands (" and #)
change context size
>
>
>
>
set
set
set
set
Context
Context
Context
Context
20;
5 words;
s;
3 s;
(20 characters)
(5 tokens)
(entire sentence)
(same, plus 2 sentences each on left and right)
(show)
(hide)
2.4
Useful options
10
2.5
values are interpreted as regular expressions, which the annotation string must match; add %l
flag to match literally:
> [word = "?" %l];
!= operator: annotation must not match regular expression
[pos != "N.*"] ! everything except nouns
[] matches any token () matchall pattern)
see Appendix A.2 for a list of useful part-of-speech tags and regular expressions
or find out with the /codist[] macro (more on macros in Sections 5.4 and 5.5):
> /codist["whose", pos];
! finds all occurrences of the word whose and computes frequency distribution of the
part-of-speech tags assigned to it
11
! finds all tokens whose lemma attribute has the value go and computes frequency
distribution of the corresponding word forms
abort query evaluation with Ctrl-C
(does not always work, press twice to exit CQP immediately)
2.6
operators: & (and), | (or), ! (not), -> (implication, cf. Section 4.1)
> [(lemma="under.+") & (pos="V.*")];
! verb with prefix under. . .
complex expressions:
2.7
modelling of complex word sequences with regular expressions over patterns (i.e. tokens): every [...] expression is treated like a single character (or, more precisely, a character set) in
conventional regular expressions
token-level regular expressions use a subset of the POSIX syntax
repetition operators:
? (0 or 1), * (0 or more), + (1 or more), {n} (exactly n), {n,m} (n . . . m)
grouping with parentheses: (...)
disjunction operator: | (separates alternatives)
parentheses delimit scope of disjunction: ( alt1 | alt2 | . . . )
Figure 2 shows simple queries matching prepositional phrases (PPs) in English and German.
The query strings are spread over multiple lines to improve readability, but each one has to be
entered on a single line in an interactive CQP session.
12
DICKENS>
[pos =
[pos =
(
[pos
[pos
) *
[pos =
"IN"]
"DT"]?
after
a
= "RB"]?
= "JJ.*"]
pretty
long
"N.*"]+ ;
pause
GERMAN-LAW>
(
[pos = "APPR"] [pos = "ART"] nach dem
|
[pos = "APPRART"]
zum
)
(
[pos = "ADJD|ADV"] ?
wirklich
[pos = "ADJA"]
ersten
)*
[pos ="NN"];
Mal
2.8
order-independent search
2.9
13
descending option aects ordering of word sequences with the same frequency; use reverse for
some amusing eects (note that these keywords go before the cut option)
sort by right or left context (especially useful for keyword searches)
>
>
>
>
"interesting";
sort by word %cd on matchend[1] .. matchend[42]; (right context)
sort by word %cd on match[-1] .. match[-42]; (left context, by words)
sort by word %cd on match[-42] .. match[-1] reverse; (same by characters)
see Sections 3.2 and 3.3 for an explanation of the syntax used in these examples and more
information about the sort and count commands
14
3.1
store query result in memory under specified name (should begin with capital letter)
> Go =
note that query results are not automatically displayed in this case
list named query results
> show named;
result of last query is implicitly named Last; commands such as cat, sort, and count operate
on Last by default; note that Last is always temporary and will be overwritten when a new
query is executed (or a subset command, cf. Section 3.5)
display number of results
> size Go;
the count command also sorts the named query on which it operates:
> count Go by lemma cut 5;
go
go
go
go
go
go
and
and
and
and
and
and
see [#128-#140]
sit [#144-#153]
do [#29-#37]
fetch [#42-#48]
look [#87-#93]
play [#107-#113]
3.2
NB: you need to re-activate your working corpus after setting the DataDirctory option
c 20052010 Stefan Evert & The OCWB Development Team
15
md* flags show whether a named query is loaded in memory (m), saved on disk (d), or has been
modified from the version saved on disk (*)
> show named;
discard named query results to free memory
> discard Go;
use set PrintOptions hdr; to add header with information about the corpus and the query
(previous CQP versions did this automatically)
you can also write to a pipe (this example saves only matches that occur in questions, i.e.
sentences ending in ? )
> set Context 1 s;
> cat Go > "| grep \?$ > go2.txt";
set PrintMode and PrintOptions for HTML output and other formats (see Section 2.4)
frequency counts for matches can also be written to a text file
> count Go by lemma cut 5 > "go.cnt";
3.3
Anchor points
the result of a (complex) query is a list of token sequences of variable length () matches)
each match is represented by two anchor points:
match (corpus position of first token) and matchend (corpus position of last token)
set additional target anchor with @ marker in query (prepended to a pattern)
> "in" @[pos="DT"] [lemma="case"];
! shown in bold font in KWIC display
only a single token can be marked as target; if multiple @ markers are used (or if the marker
is in the scope of a repetition operator such a +), only the rightmost matching token will be
marked
> [pos="DT"] (@[pos="JJ.*"] ","?){2,} [pos="NNS?"];
when targeted pattern is optional, check how many matches have target anchor set
> A = [pos="DT"] @[pos="JJ"]? [pos="NNS?"];
> size A;
> size A target;
16
anchor points allow a flexible specification of sort keys with the general form
> sort by attribute on start point .. end point ;
both start point and end point are specified as an anchor, plus an optional oset in square
brackets; for instance, match[-1] refers to the token before the start of the match, matchend to
the last token of the match, matchend[1] to the first token after the match, and target[-2] to
a position two tokens left from the target anchor
NB: the target anchor should only be used in the sort key when it is always defined
example: sort noun phrases by adjectives between determiner and noun
> [pos="DT"] [pos="JJ"]{2,} [pos="NNS?"];
> sort by word %cd on match[1] .. matchend[-1];
if end point refers to a corpus position before start point, the tokens in the sort keys are compared
from right to left; e.g. sort on the left context of the match (by token)
> sort by word %cd on match[-1] .. match[-42];
whereas the reverse option sorts on the left context by character
> sort by word %cd on match[-42] .. match[-1] reverse;
complex sort operations can sometimes be speeded up by using an external helper program (the
standard Unix sort tool)4
>
>
>
>
the count command accepts the same specification for the strings to be counted
> count by lemma on match[1] .. matchend[-1];
the four columns correspond to the match, matchend, target and keyword (see Section 3.7)
anchors; a value of -1 means that the anchor has not been set:
1019887
1924977
1986623
2086708
2087618
2122565
1019888
1924979
1986624
2086710
2087619
2122566
-1
1924978
-1
2086709
-1
-1
-1
-1
-1
-1
-1
-1
note that a previous sort or count command aects the ordering of the rows (so that the n-th
row corresponds to the n-th line in a KWIC display obtained with cat)
4
External sorting may also allow language-specific sort order (collation) if supported by the systems sort command.
To achieve this, set the LC COLLATE or LC ALL environment variable to an appropriate locale before running CQP. You
should not use the %c and %d flags in this case.
17
the output of a dump command can be written (>) or appended (>>) to a file, if the first character
of the filename is |, the ouput is sent to the pipe consisiting of the following command(s); use
the following trick to display the distribution of match lengths in the query result A:
> A = [pos="DT"] [pos="JJ.*"]* [pos="NNS?"];
> dump A > "| gawk {print $2 - $1 + 1} | sort -nr | uniq -c | less";
see Section 6.2 for an opposite to the dump command, which may be useful for certain tasks such
as locating a specific corpus position
3.4
Frequency distributions
set cuto threshold with cut option to reduce size of frequency table
> NP = [pos="DT"] @[pos="JJ"]? [pos="NNS?"];
> group NP target lemma cut 50;
add optional oset to anchor point, e.g. distribution of words preceding matches
> group NP match[-1] lemma cut 100;
NB: despite what the command syntax and output format suggest, results are sorted by pair
frequencies (not grouped by the second item); also note that the order of the two items in the
output is opposite to the order in the group command
you can write the output of the group command to a text file (or pipe)
> group NP target lemma cut 10 > "adjectives.go";
3.5
named queries can be copied, especially before destructive modification (see below)
> B = A;
> C = Last;
compute subset of named query result by constraint on one of the anchor points
> PP = [pos="IN"] [pos="JJ"]+ [pos="NNS?"];
> group PP matchend lemma by match word;
A=B\C
A=B[C
A=B\C
intersection (or inter) yields matches common to B and C; union (or join) matches from
either B or C; difference (or diff) matches from B that are not in C
c 20052010 Stefan Evert & The OCWB Development Team
18
for more complex queries, this modifier may not return exactly n matches (because of internal
technical reasons); its main purpose is to limit the number of query matches in Web interfaces
and similar applications, reducing memory consumption and query execution time; if a precise
reduction is desired, the cut operator should be applied to the named query result
3.6
Random subsets
it is often desirable to look at a random selection to get a quick overview (rather than just seeing
matches from the first part of the corpus); one possibility is to do a sort randomize and then
go through the first few pages of random matches:
> sort A randomize;
however, this cannot be combined with other sort options such as alphabetical sorting on match or
left/right context; it also doesnt speed up frequency lists, set target and other post-processing
operations
as an alternative to randomized ordering, the reduce command randomly selects a given number
or proportion of matches, deleting all other matches from the named query; since this operation
is destructive, it may be necessary to make a copy of the original query results first (see above)
> reduce A to 10%;
> size A;
> sort A by word %cd on match .. matchend[42];
> reduce A to 100;
> size A;
> sort A by word %cd on match .. matchend[42];
this allows arbitrary further operations to be carried out on a representative sample rather than
the full query result
set random number generator seed before reduce for reproducible selection
> randomize 42; (use any positive integer as seed)
a second method for obtaining a random subset of a named query result is to sort the matches
in random order and then take the first n matches from the sorted query; the example below
has the same eect as reduce A to 100; (though it will not select exactly the same matches)
> sort A randomize;
> cut A 100;
> sort A; (restore corpus order, as with reduce command)
19
reproducible subsets can be obtained with a suitable randomize command before the sort; the
main dierence from the reduce command is that cut cannot be used to select a percentage of
matches (i.e., you have to determine the number of matches in the desired subset yourself)
the most important advantage of the second method is that it can produce stable and incremental
random samples
for a stable random ordering, specify a positive seed value directly in the sort command:
> sort A randomize 42;
dierent seeds give dierent, reproducible orderings; if you randomize a subset of A with the
same seed value, the matches will appear exactly in the same order as in the randomized version
of A:
>
>
>
>
>
in order to build incremental random samples from a query result, sort it randomly (but with
seed value to ensure reproducibility) and then take the first n matches as sample #1, the next
n matches as sample #2, etc.; unlike two subsets generated with reduce, the first two samples
are disjoint and together form a random sample of size 2n:
>
>
>
>
>
>
A = "time";
sort A randomize 7;
Sample1 = A;
cut Sample1 0 99; (random sample of 100 matches)
Sample2 = A;
cut Sample2 100 199; (random sample of 100 matches)
note that the cut removes the randomized ordering; you can reapply the stable randomization
to achieve full correspondence to the randomized query result A:
> sort Sample2 randomize 7;
> cat Sample2;
> cat A 100 199;
stability of the randomization ensures that random samples are reproducible even after the initial
query has been refined or spurious matches have been deleted manually
3.7
additional keyword anchor can be set after query execution by searching for a token that matches
a given search pattern (see Figure 3)
example: find noun near adjective modern
keyword should be underlined in KWIC display (may not work on some terminals)
search starts from the given anchor point (excluding the anchored token itself), or from the left
and right boundaries of the match if match is specified
with inclusive, search includes the anchored token, or the entire match, respectively
c 20052010 Stefan Evert & The OCWB Development Team
20
(anchor to set)
(search strategy)
(search pattern)
(search direction)
(window)
keyword | target)
(include start token in search)
5
The keyword and target anchors are set to undefined (-1) when no match is found for the search pattern, while the
match and matchend anchors retain their previous values. In this way, a set match or set matchend command may only
modify some of the matches in a named query result.
21
4.1
Using labels
the label adj then refers to the corresponding token (i.e. its corpus position)
label references are usually evaluated within the global constraint introduced by ::
> adj:[pos = "ADJ."] :: adj < 500;
! adjectives among the first 500 tokens
to avoid error messages, test whether label is defined before accessing attributes
> [pos="DT"] a:[]? [pos="NNS?"] :: a -> a.pos="JJ";
(-> is the logical implication operator !, cf. Section 2.6)
labels are used to specify additional constraints that are beyond the scope of ordinary regular
expressions
> a:[] "and" b:[] :: a.word = b.word;
labels allow modelling of long-distance dependencies
however, a label cannot be used within the pattern it refers to; use the special this label represented by a single underscore (_) instead to refer to the current corpus position
[_.pos = "NPS"] () [pos = "NPS"]
the built-in functions distance() and distabs() compute the (absolute) distance between 2
tokens (referenced by labels)
> a:[pos="DT"] [pos="JJ"]* b:[pos="NNS?"] :: distabs(a,b) >= 5;
! simple NPs containing 6 or more tokens
the standard anchor points (match, matchend, and target) are also available as labels (with the
same names)
> [pos="DT"] [pos="JJ"]* [pos="NNS?"] :: distabs(match, matchend) >= 5;
c 20052010 Stefan Evert & The OCWB Development Team
22
4.2
Structural attributes
XML tags match start/end of s-attribute region (shown as XML tags in Figure 1)
> <s> [pos = "VBG"];
> [pos = "VBG"] [pos = "SENT"]? </s>;
! present participle at start or end of sentence
pairs of start/end tags enclose single region (if StrictRegions option is enabled)
> <np> []* ([pos="JJ.*"] []*){3,} </np>;
! NP containing at least 3 adverbs
(when StrictRegions are switched o, XML tags match any region boundaries and may skip
intervening boundaries as well as material outside the corresponding regions)
/region[] macro matches entire region
/region[np]; () <np> []* </np>;
the name of a structural attribute (e.g. np) used within a pattern evaluates to true i the
corresponding token is contained in a region of this attribute (here, a <np> region)
> [(pos = "NNS?") & !np];
! noun that is not contained in a noun phrase (NP)
built-in functions lbound() and rbound() test for start/end of a region
> [(pos = "VBG") & lbound(s)];
! present participle at start of sentence
most linguistic queries should include the restriction within s to avoid crossing sentence boundaries; note that only a single within clause may be specified
query matches can be expanded to containing regions of s-attributes
> A = [pos="JJ.*"] ([]* [pos="JJ.*"]){2} within np;
> B = A expand to np;
23
4.3
XML markup of NPs and PPs in the DICKENS corpus (cf. Appendix A.3)
<s len=9>
<np h="it" len=1> It </np>
is
<np h="story" len=6> the story
<pp h="of" len=4> of
<np h="man" len=3> an old man </np>
</pp>
</np>
.
</s>
key-value pairs within XML start tags are accessible in CQP as additional s-attributes with
annotated values (marked [A] in the show cd; listing): s_len, np_h, np_len, pp_h, pp_len (cf.
Section 1.2)
s-attribute values can be accessed through label references
> <np> a:[] []* </np> :: a.np_h = "bank";
! NPs with head lemma bank
an equivalent, but shorter version:
(recall that np_h would merely return an integer value indicating whether the current token is
contained in a <np> region, not the desired annotation string)
typecast numbers to int() for numerical comparison
> /region[np,a] :: int(a.np_len) > 30;
24
within np;
CQP queries typically use maximal NP and PP regions (e.g. to model clauses)
find any NP (regardless of embedding level):
observe how results depend on matching strategy (see Section 5.1 for details)
> set MatchingStrategy shortest;
> set MatchingStrategy longest;
> set MatchingStrategy standard;
(re-run the previous query after each set and watch out for duplicate matches)
when the query expression shown above is embedded in a longer query, the matching strategy
usually has no influence
annotations of a region at an arbitrary embedding level can only be accessed through constraints
on key-value pairs in the start tags:
> (<np_h "bank">|<np_h1 "bank">|<np_h2 "bank">) []*
(</np_h2>|</np_h1>|</np_h>);
4.4
25
use set PrintStructures command to display novel, chapter, . . . for each match
> set PrintStructures "novel_title, chapter_num";
> A = [lemma = "ghost"];
> cat A;
26
5.1
5.2
Word lists
use TAB key to complete word list names (e.g. type show $we + TAB)
word lists can be used to simulate type hierarchies, e.g. for part-of-speech tags
> define $common_noun = "NN NNS";
> define $proper_noun = "NP NPS";
> define $noun = $common_noun;
> define $noun += $proper_noun;
c 20052010 Stefan Evert & The OCWB Development Team
27
search pattern:
DET? ADJ* NN (PREP DET? ADJ* NN)*
input:
the
old
book
on
the
table
in
the
room
room
in
the
room
the
room
28
5.3
Subqueries
XML tag notation can also be used for the temporary match regions
> <match> [pos = "W.*"];
if target/keyword anchors are set in the activated query result, the corresponding XML tags
(<target>, <keyword>, . . . ) can be used, too
> </target> []* </match>;
! range from target anchor to end of match, but excluding target
<target> and <keyword> regions always have length 1 !
29
5.4
complex queries (or parts of queries) can be stored as macros and re-used
define macros in text file (e.g. macros.txt):
# this is a comment and will be ignored
MACRO np(0)
[pos = "DT"]
# another comment
([pos = "RB.*"]? [pos = "JJ.*"])*
[pos = "NNS?"]
;
(defines macro np with no arguments)
load macro definitions from file
macro invocation as part of a CQP command (use TAB key for macro name completion)
> <s> /np[] @[pos="VB.*"] /np[];
macros are interpolated as plain strings (not as elements of a query expression) and may have
to be enclosed in parentheses for proper scoping
> <s> (/np[])+ [pos="VB.*"];
it is safest to put parentheses around macro definitions:
MACRO np(0)
(
[pos = "DT"]
([pos = "RB.*"]? [pos = "JJ.*"])*
[pos = "NNS?"]
)
;
NB: The start (MACRO ...) and end (;) markers must be on separate lines in a macro definition
file.
macros accept up to 10 arguments; in the macro definition, the number of arguments must be
specified in parentheses after the macro name
30
in the macro body, each occurrence of $0, $1, . . . is replaced by the corresponding argument
value (escapes such as \$1 will not be recognised)
e.g. a simple PP macro with 2 arguments: the initial preposition and the number of adjectives
in the embedded noun phrase
MACRO pp(2)
[(pos = "IN") & (word="$0")]
[pos = "DT"]
[pos = "JJ.*"]{$1}
[pos = "NNS?"]
;
invoking macros with arguments
> /pp["under", 2];
> /pp["in", 3];
macro arguments are character strings and must be enclosed in (single or double) quotes; they
may be omitted around numbers and simple identifiers
the quotes are not part of the argument value and hence will not be interpolated into the macro
body; nested macro invocations will have to specify additional quotes
define macro with prototype ) named arguments
MACRO pp ($0=Prep $1=N_Adj)
...
;
argument names serve as reminders; they are used by the show command and the macro name
completion function (TAB key)
argument names are not used during macro definition and evaluation
in interactive definitions, prototypes must be quoted
> define macro pp($0=Prep $1=N_Adj) ... ;
CQP macros can be overloaded by the number of arguments (i.e. there can be several macros
with the same name, but with dierent numbers of arguments)
this feature is often used for unspecified or default values, e.g.
MACRO pp($0=Prep, $1=N_Adj)
...
MACRO pp($0=Prep)
(any number of adjectives)
...
MACRO pp()
(any preposition, any number of adjs)
...
macro calls can be nested (non-recursively) ) macro file defines a context-free grammar (CFG)
without recursion (see Figure 5)
note that string arguments need to be quoted when they are passed to nested macros (since
quotes from the original invocation are stripped before interpolating an argument)
single or double quote characters in macro arguments should be avoided whenever possible; while
the string s can be enclosed in double quotes ("s") in the macro invocation, the macro body
may interpolate the value between single quotes, leading to a parse error
c 20052010 Stefan Evert & The OCWB Development Team
31
MACRO adjp()
[pos = "RB.*"]?
[pos = "JJ.*"]
;
MACRO np($0=N_Adj)
[pos = "DT"]
( /adjp[] ){$0}
[pos = "NNS?"]
;
MACRO np($0=Noun $1=N_Adj)
[pos = "DT"]
( /adjp[] ){$1}
[(pos = "NN") & (lemma = "$0")]
;
MACRO pp($0=Prep $1=N_Adj)
[(word = "$0") & (pos = "IN|TO")]
/np[$1]
;
Figure 5: A sample macro definition file.
in macro definitions, use double quotes which are less likely to occur in argument values
5.5
32
CQP ensures that the generalised start and end tags nest properly
(if the StrictRegions option is enabled, cf. Sections 4.2 and 4.3)
extending built-in macros: view definitions
> show macro region(1);
> show macro codist(3);
5.6
feature set attributes use special notation, separating set members by | characters
e.g. for the alemma (ambiguous lemma) attribute
|Zeug|Zeuge|Zeugen|
|Baum|
|
(three elements)
(unique lemma)
(not in lexicon)
33
see Appendix A.3 for lists of properties annotated in the GERMAN-LAW corpus
define macro for easy experimentation with property features
> define macro find($0=Tag $1=Property)
<$0_f contains "$1"> []* </$0_f>;
> /find[np, brac];
> /find[advp, temp];
etc.
nominal agreement features of determiners, adjectives and nound are stored in the agr attribute,
using the pattern shown in Figure 7 (see Figure 8 for an example)
case:gender :number :determination
case
gender
number
determination
|Dat:F:Sg:Def|Gen:F:Pl:Def|Gen:F:Sg:Def
|Gen:M:Pl:Def|Gen:N:Pl:Def|Nom:M:Sg:Def|
|Akk:M:Pl:Def|Dat:M:Sg:Def|Gen:M:Pl:Def|Nom:M:Pl:Def
|Akk:M:Pl:Ind|Dat:M:Sg:Ind|Gen:M:Pl:Ind|Nom:M:Pl:Ind
|Akk:M:Pl:Nil|Dat:M:Sg:Nil|Gen:M:Pl:Nil|Nom:M:Pl:Nil|
both textttcontains and matches use regular expressions and accept the %c and %d flags
unification of agreement features () intersection of feature sets
use built-in /unify[] macro:
34
in the GERMAN-LAW corpus, NPs and other phrases are annotated with partially disambiguated
agreement information; these features sets can also be tested with the contains and matches
operators, either indirectly through label references or directly in XML start tags
> /region[np, a] :: a.np_agr matches "Dat:.:Pl:.*";
> <np_agr matches "Dat:.:Pl:.*"> []* </np_agr>;
35
6.1
CQP is a useful tool for interactive work, but many tasks become tedious when they have to
be carried out by hand; macros can be used as templates, providing some relief; however, full
scripting is still desirable (and in some cases essential)
similarly, the output of CQP requires post-processing at times: better formatting of KWIC lines
(especially for HTML output), dierent sort options for frequency tables, frequency counts on
normalised word forms (or other transformations of the values)
for both purposes, an external scripting tool or programming language is required, which has to
interact dynamically with CQP (which acts as a query engine)
CQP provides some support for such interfaces: when invoked with the -c flag, it switches to
child mode (which could also be called slave mode):
the init file ~/.cqprc is not automatically read at startup
CQP prints its version number after intialisation
all interactive features are deactivated (paged display and highlighting)
query results are not automatically displayed (set AutoShow off;)
after the execution of a command, CQP flushes output buers (so that the interface will
not hang waiting for output from the command)
in case of a syntax error, the string PARSE ERROR is printed on stderr
the special command .EOL.; inserts the line
-::-EOL-::as a marker into CQPs output
when the ProgressBar option is activated, progress messages are not echoed in a single
screen line (using carriage returns) on stderr, but rather printed in separate lines on
stdout; these lines have the standardized format
-::-PROGRESS-::- TAB pass TAB no. of passes TAB progress message
the CWB/Perl interface makes use of all these features to provide an efficient and robust interface
between a Perl script and the CQP backend
the output of many CQP commands is neatly formatted for human readers; this pretty printing
feature can be switched o with the command
> set PrettyPrint off;
the output of the show, group and count commands now has a simple and standardized format
that can more easily be parsed by the invoking program; output formats for the dierent uses
of the show command are documented below; see Section 6.3 for the output formats of group
and count
show corpora; prints the names of all available corpora on separate lines, in alphabetical
order
show named; lists all named query results on separate lines in the format
flags TAB query name TAB no. of matches
36
6.2
An important aspect of interfacing CQP with other software is to exchange the corpus positions of
query matches (as well as target and keyword anchors). This is a prerequisite for the extraction
of further information about the matches by direct corpus access, and it is the most efficient
way of relating query matches to external data structures (e.g. in a SQL database or spreadsheet
application).
The dump command (Section 3.3) prints the required information in a tabular ASCII format that
can easily be parsed by other tools or read into a SQL database.7 Each row of the resulting table
corresponds to one match of the query, and the four columns give the corpus positions of the
match, matchend, target and keyword anchors, respectively. The example below is reproduced
from Section 3.3
1019887
1924977
1986623
2086708
1019888
1924979
1986624
2086710
-1
1924978
-1
2086709
-1
-1
-1
-1
Since this command dumps the matches of a named query in their current sort order, the natural order should first
be restored by calling sort without a by clause. One exception is a CGI interface that uses the dumped corpus positions
for a KWIC display of the query results in their sorted order.
37
-1
-1
Undefined target anchors are represented by -1 in the third column. Even though no keywords
were set for the query, the fourth column is included in the dump table, but all values are set to
-1.
The table created by the dump command is printed on stdout by default, where it can be
captured by a program running CQP as a backend (e.g. the CWB/Perl interface, cf. Sec. 6.1).
The dump table can also be redirected to a file or pipe:
> dump A > "dump.tbl";
Common uses of pipes are to create a dump file without the superfluous keyword column
> dump A > "| awk -F\t BEGIN {OFS=FS} {print $1,$2,$3} > dump.tbl";
or to compress the dump file on the fly
> dump A > "| gzip > dump.tbl.gz";
Sometimes it is desirable to reload a dump file into CQP after it has been modified by an external
program (e.g. a database may have filtered the matches against a metadata table). The undump
command creates a new named query result (B in the example below) for the currently activated
corpus from a dump file:
> undump B < "mydump.tbl";
Note that B is silently overwritten if it already exits.
The format of the file mydump.tbl is almost identical to the output of dump, but it contains only
two columns for the match and matchend positions (in the default setting). The example below
shows a valid dump file for the DICKENS corpus, which can be read with undump to create a query
result containing 5 matches:
20681
379735
1915978
2591586
2591593
20687
379741
1915983
2591591
2591598
Save these lines to a text file named dickens.tbl, then enter the following commands:
> DICKENS;
> undump Twas < "dickens.tbl";
> cat Twas;
Further columns for the target and keyword anchors (in this order) can optionally be added. In
this case, you must append the modifier with target or with target keyword to the undump
command:
> undump B with target keyword < "mydump.tbl";
Dump files can also be read from a pipe or from standard input. However, in this case the
table of corpus positions has to be preceded by a header line that specifies the total number of
matches:
5
20681
379735
20687
379741
38
1915978 1915983
2591586 2591591
2591593 2591598
CQP uses this information to pre-allocate internal storage for the query result, as well as to
validate the file format. This format can also be used as a more efficient alternative if the dump
is read from a regular file. In this case, CQP automatically detects which of the two formats is
used.
Pipes are used e.g. to read a dump table from a compressed file:
> undump B < "gzip -cd mydump.tbl.gz |";
In an interactive CQP session, the input file can be omitted and the undump table can then be
entered directly on the command line. This feature works only if command-line editing support
is enabled with the -e switch.8 Since the dump table is read from standard input here, only the
second format is allowed, i.e. you have to enter the total number of matches first. Try entering
the example table above after typing
> undump B;
If the rows of the undump table are not sorted in their natural order (i.e. by corpus position),
they have to be re-ordered internally so that CQP can work with them. However, the original
sort order is recorded automatically and will be used by the cat and dump commands (until it
is reset by a new sort command). If you sort a query result A, save it with dump to a text file,
and then read this file back in as named query B, then A and B will be sorted in exactly the same
order.
In many cases, overlapping or unsorted matches are not intentional but rather errors in an
automatically generated dump table. In order to catch such errors, the additional keyword
ascending (or asc) can be specified before the < character:
> undump B with target ascending < "mydump.tbl";
This command will abort with an error message (indicating the row number where the error
occurred) unless the corpus matches in mydump.tbl are non-overlapping and sorted in corpus
order.
A typical use case for dump and undump is to link CQP queries to corpus metadata stored in
an external SQL database. Assume that a corpus consists of a large collection of transcribed
dialogues, which are marked as <dialogue> regions. A rich amount of metadata (about the
speakers, setting, topic, etc.) is available in a SQL database. The database entries can be linked
directly to the <dialogue> regions by recording their start and end corpus positions in the
databae.9 The following commands generate a dump table with the required information, which
can easily be loaded into the database (ignoring the third and fourth columns of the table):
> A = <dialogue> [] expand to dialogue;
> dump A > "dialogues.tbl";
Corpus queries will often be restricted to a subcorpus by specifying constraints on the metadata.
Having resolved the metadata constraints in the SQL database, they can be translated to the
corresponding regions in the corpus (again represented by start and end corpus position). The
positions are then sorted in ascending order and saved to a TAB-delimited text file. Now they can
8
For this reason, CWB/Perl and smilar interfaces cannot use the direct input option and have to create a temporary
file with the dump information.
9
Of course, it is also possible to establish an indirect link through document IDs, which are annotated as <dialogue
id=XXXX> .. </dialogue>. If the corpus contains a very large number of dialogues, the direct link approach is usually
much more efficient, though.
39
be loaded into CQP with the undump command, and the resulting query result can be activated
as a subcorpus for following queries. It is recommended to specify the ascending option in order
to ensure that the loaded query result forms a valid subcorpus:
> undump SubCorpus ascending < "subcorpus.tbl";
> SubCorpus;
Subcorpus[..]> A = ... ;
6.3
For many applications it is important to compute frequency tables for the matching strings,
tokens in the immediate context, attribute values at dierent anchor points, dierent attributes
for the same anchor, or various combinations thereof.
frequency tables for the matching strings, optionally normalised to lowercase and extended or
reduced by an oset, can easily be computed with the count command (cf. Sections 2.9 and 3.3);
when pretty-printing is deactivated (cf. Section 6.1), its output has the form
frequency TAB first line TAB string (type)
advantages of the count command:
strings of arbitrary length can be counted
frequency counts can be based on normalised strings (%cd flags)
the instances (tokens) for a given string type can easily be identified, since the underlying
query result is automatically sorted by the count command, so that these instances appear
as a block starting at match number first line
an alternative solution is the group command (cf. Section 3.4), which computes frequency distributions over single tokens (i.e. attribute values at a given anchor position) or pairs of tokens
(recall the counter-intuitive command syntax for this case); when pretty-printing is deactivated,
its output has the form
[ attribute value TAB ] attribute value TAB frequency
advantages of the group command:
can compute joint frequencies for non-adjacent tokens
faster when there are relatively few dierent types to be counted
supports frequency distributions for the values of s-attributes
the advantages of these two commands are for the most part complementary (e.g., it is not
possible to normalise the values of s-attributes, or to compute joint frequencies of two nonadjacent multi-token strings); in addition, they have some common weaknesses, such as relatively
slow execution, no options for filtering and pooling data, and limitations on the types of frequency
distributions that can be computed (only simple joint frequencies, no nested groupings)
therefore, it is often necessary (and usually more efficient) to generate frequency tables with
external programs such as dedicated software for statistical computing or a relational database;
these tools need a data table as input, which lists the relevant feature values (at specified anchor
positions) and/or multi-token strings for each match in the query result; such tables can often
be created from the output of cat (using suitable PrintOptions, Context and show settings)
40
this procedure involves a considerable amount of re-formatting (e.g. with Unix command-line
tools or Perl scripts) and can easily break when there are unusual attribute values in the data;
both cat output and the re-formatting operations are expensive, making this solution inefficient
when there is a large number of matches
in most situations, the tabulate command provides a more convenient, more robust and faster
solution; the general form is
> tabulate A column spec, column spec, . . . ;
this will print a TAB-separated table where each row corresponds to one match of the query result
A and the columns are described by one or more column spec(ification)s
just as with dump and cat, the table can be restricted to a contiguous range of matches, and the
output can be redirected to a file or pipe
> tabulate A 100 119 column spec, column spec, . . . ;
> tabulate A column spec, column spec, . . . > "data.tbl";
each column specification consists of a single anchor (with optional oset) or a range between
two anchors, using the same syntax as the sort and count commands; without an attribute
name, this will print the corpus positions for the selected anchor:
> tabulate A match, matchend, target, keyword;
produces exactly the same output as dump A; when targets and anchors are defined for the query
result A; otherwise, it will print an error message (and you need to leave out the column specs
target and/or keyword)
when an attribute name is given after the anchor, the values of this attribute for the selected
anchor point will be printed; both positional and structural attributes with annotated values
can be used; the following example prints a table of novel title, book number and chapter title
for a query result from the DICKENS corpus
> tabulate A match novel_title, match book_num, match chapter_title;
note that undefined values (for the book_num and chapter_title attributes) are represented by
the empty string; the same happens when an anchor point is not defined or outside the corpus
range (because of an oset)
a range between to anchor points prints the values of the selected attribute for all tokens in the
specified range; usually, this only makes sense for positional attributes; the following example
prints the lemma values of 5 tokens to the left and right of each match, which can be used to
identify collocates of the matching string(s)
> tabulate A match[-5]..match[-1] lemma, matchend[1]..matchend[5] lemma;
note that the attribute values for tokens within each range are separated by blanks rather than
TABs, in order to avoid ambiguities in the resulting data table
attribute values can be normalised with the flags %c (to lowercase) and %d (remove diacritics);
the command below uses Unix shell commands to compute the same frequency distribution as
count A by word %c; in a much more efficient manner
> tabulate A match .. matchend word %c > "| sort | uniq -c | sort -nr";
note that in contrast to sort and count, a range is considered empty when the end point lies
before the start point and will always be printed as an empty string
41
UNDOCUMENTED CQP
Undocumented CQP
7.1
Zero-width assertions
constraints involving labels have to be tested either in the global constraint or in one of the token
patterns; this means that macros cannot easily specify constraints on the labels they define: such
a macro would have to be interpolated in two separate places (in the sequence of token patterns
as well as in the global constraint)
zero-width assertions allow constraints to be tested during query evaluation, i.e. at a specific
point in the sequence of token patterns; an assertion uses the same Boolean expression syntax
as a pattern, but is delimited by [: ... :] rather than simple square brackets ([...]); unlike
an ordinary pattern, an assertion does not consume a token when it is matched; it can be
thought of as a part of the global constraint that is tested in between two tokens
with the help of assertions, NPs with agreement checks can be encapsulated in a macro
DEFINE MACRO np_agr(0)
a:[pos="ART"]
b:[pos="ADJA"]*
c:[pos="NN"]
[: ambiguity(/unify[agr, a,b,c]) >= 1 :]
;
(in this simple case, the constraint could also have been added to the last pattern)
when the this label (_) is used in an assertion, it refers to the corpus position of the following
token; the same holds for direct references to attributes
in this way, assertions can be used as look-ahead constraints, e.g. to match maximal sequences
of tokens without activating longest match strategy
> [pos = "NNS?"]{2,} [:pos != "NNS?":];
assertions also allow the independent combination of multiple constraints that are applied to a
single token; for instance, the region(5) macro from Section 5.5 could also have been defined
as
MACRO region($0=Att $1=Key1 $2=Val1 $3=Key2 $4=Val2)
<$0> [: _.$0_$1="$2" :] [: _.$0_$3="$4" :] []* </$0>
;
like the matchall pattern [], the matchall assertion [::] is always satisfied; since it does not
consume a token either, it is a no-op that can freely be inserted at any point in a query
expression; in this way, a label or target marker can be added to positions which are otherwise
not accessible, e.g. an XML tag or the start/end position of a disjunction
> ... @[::] /region[np] ... ;
> ... a:[::] ( ... | ... | ... ) b:[::] ...;
starting a query with a matchall assertion is extremely inefficient: use the match anchor or the
implicit match label instead
42
7.2
UNDOCUMENTED CQP
returning to the np_agr macro from Section 7.1, we note a problem with this query:
> A = /np_agr[] [pos = "VVFIN"] /np_agr[];
when the second NP does not contain any adjectives but the first does, the b label will still point
to an adjective in the first NP; consequently, the agreement check may fail even if both NPs are
really valid
in order to solve this problem, the two NPs should use dierent labels; for his purpose, every
macro has an implicit $$ argument, which is set to a unique value for each interpolation of the
macro; in this way, we can construct unique labels for each NP:
DEFINE MACRO np_agr(0)
$$_a:[pos="ART"]
$$_b:[pos="ADJA"]*
$$_c:[pos="NN"]
[: ambiguity(/unify[agr, $$_a,$$_b,$$_c]) >= 1 :]
;
a comparison with the previous results shows that this version of the /np_agr[] macro finds
additional matches that were incorrectly rejected by the first implementation
> B = /np_agr[] [pos = "VVFIN"] /np_agr[];
> diff B A;
however, the problem still persists in queries where the macro is interpolated only once, but may
be matched multiple times
> A = ( /np_agr[] ){3};
here, a solution is only possible when the scope of labels can be limited to the body of the macro
in which they are defined; i.e., the labels must be reset to undefined values at the end of the
macro block; this can be achieved with the built-in /undef[] macro, which resets the labels
passed as arguments and returns a true value
DEFINE MACRO np_agr(0)
a:[pos="ART"]
b:[pos="ADJA"]*
c:[pos="NN"]
[: ambiguity(/unify[agr, a,b,c]) >= 1 :]
[: /undef[a,b,c] :]
;
> B = ( /np_agr[] ){3};
> diff B A;
note that it may still be wise to construct unique label names (either in the form np_agr_a etc.,
or with the implicit $$ argument) in order to avoid conflicts with labels defined in other macros
or in the top-level query
7.3
Easter eggs
starting with version 3.0 of the Corpus Workbench, CQP comes with a built-in regular expression
optimiser ; this optimiser detects simple regular expressions commonly used for prefix, suffix or
infix searches such as
c 20052010 Stefan Evert & The OCWB Development Team
43
UNDOCUMENTED CQP
> "under.+";
> ".+ment";
> ".+time.+";
and replaces the normal regexp evaluation with a highly efficient Boyer-Moore search algorithm
the optimiser will also recognise some slightly more complex regular expressions; if you want to
test whether a given expression can be optimised or not, switch on debugging output with
> set CLDebug on;
some beta releases of CQP may contain hidden optimisations and/or functionality that are
disabled by default because they have not been tested thoroughly; such hidden features will
usually be documented in the release notes and can be activated with the option
> set Optimize on;
the official release v3.0 of CQP has no hidden features
44
APPENDIX
Appendix
A.1
At the character level, CQP supports POSIX 1003.2 regular expressions (as provided by the system
libraries). A full description of the regular expression syntax can be found on the regex(7) manpage.
Various books such as Mastering Regular Expressions give a gentle introduction to writing regular
expressions and provide a lot of additional information.
A regular expression is a concise descriptions of a set of character strings (which are called
words in formal language theory). Note that only certain sets of words with a relatively simple
structure can be represented in such a way. Regular expressions are said to match the words
they describe. The following examples use the notation:
<reg.exp.> ! word1 , word2 , . . .
In many programming languages, it is customary to enclose regular expressions in slashes (/).
CQP uses a dierent syntax where regular expressions are written as (single- or double-quoted)
strings. The examples below omit any delimiters.
Basic syntax of regular expressions
letters and digits are matched literally (including all non-ASCII characters)
word ! word ; C3PO ! C3PO; d
ej`
a ! dej`
a
. matches any single character (matchall)
r.ng ! ring, rung, rang, rkng, r3ng, . . .
\(\) ! ();
.{3} ! . . . ;
\$\. ! $.
\^ and \$ must be escaped although ^ and $ anchors are not useful in CQP
45
A.2
APPENDIX
46
A.3
APPENDIX
word
pos
lemma
individual novels
title of the novel
book
book num
chapter
chapter num
chapter title
chapters
number of the chapter
optional title of the chapter
title
p
p len
s
s len
paragraphs
length of the paragraph (in words)
sentences
length of the sentence (in words)
np
np h
np len
noun phrases
head lemma of the noun phrase
length of the noun phrase (in words)
pp
pp h
pp len
prepositional phrases
functional head of the PP (preposition)
length of the PP (in words)
=
=
=
=
case
gender
number
determination
47
APPENDIX
sentences
prepositional phrases
noun phrases
adjectival phrases
adverbial phrases
verbal complexes
subclauses
<s len="..">
<pp f=".." h=".." agr=".." len="..">
<np f=".." h=".." agr=".." len="..">
<ap f=".." h=".." agr=".." len="..">
<advp f=".." len="..">
<vc f=".." len="..">
<cl f=".." h=".." vlem=".." len="..">
len = length of region (in tokens)
f = properties (feature set, see next page)
h = lexical head of phrase (<pp h>: prep:noun)
agr = nominal agreement features (feature set, partially disambiguated)
vlem = lemma of main verb
<pp f>
<ap f>
<advp f>
<vc f>
<cl f>
48
A.4
a:
b:
c:
d:
e:
f:
g:
h:
i:
j:
k:
l:
m:
n:
o:
r:
s:
t:
u:
w:
y:
APPENDIX
49