0% found this document useful (0 votes)
14 views75 pages

4 - Slides Regualer Expression

The document discusses various tools and methods for annotating and searching linguistic corpora, focusing on the use of regular expressions (REs) for querying. It emphasizes the importance of understanding metadata, annotation layers, and the structure of the corpus before conducting searches. Additionally, it highlights the need for careful formulation of queries and the handling of potential annotation errors during the research process.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views75 pages

4 - Slides Regualer Expression

The document discusses various tools and methods for annotating and searching linguistic corpora, focusing on the use of regular expressions (REs) for querying. It emphasizes the importance of understanding metadata, annotation layers, and the structure of the corpus before conducting searches. Additionally, it highlights the need for careful formulation of queries and the handling of potential annotation errors during the research process.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Regular expressions

Search tools

Tools for Annotating and Searching


Corpora
4: Searching

Stefanie Dipper

Institute of Linguistics
Ruhr-University Bochum

Corpus Linguistics Fest (CLiF)


June 6-10, 2016
Indiana University, Bloomington

Stefanie Dipper Tools for annotating and searching 1 / 38


Regular expressions
Search tools

Today’s session

Yesterday: automatic tools

Today: How can we search for annotations?


different kinds of search tools
first: some general considerations

Stefanie Dipper Tools for annotating and searching 2 / 38


Regular expressions
Search tools

Searching corpora: some general considerations

Suppose you have a research question


. . . and there is an annotated corpus available, e.g. via the
internet, that seems to contain data relevant to this
question
How would you proceed?

Stefanie Dipper Tools for annotating and searching 3 / 38


Regular expressions
Search tools

Searching corpora: some general considerations

Suppose you have a research question


. . . and there is an annotated corpus available, e.g. via the
internet, that seems to contain data relevant to this
question
How would you proceed?
First look at the metadata
which primary data does the corpus contain?

Stefanie Dipper Tools for annotating and searching 3 / 38


Regular expressions
Search tools

Searching corpora: some general considerations

Suppose you have a research question


. . . and there is an annotated corpus available, e.g. via the
internet, that seems to contain data relevant to this
question
How would you proceed?
First look at the metadata
which primary data does the corpus contain?
which annotation layers?
→ are they useful for your research question?

Stefanie Dipper Tools for annotating and searching 3 / 38


Regular expressions
Search tools

Searching corpora: some general considerations

Suppose you have a research question


. . . and there is an annotated corpus available, e.g. via the
internet, that seems to contain data relevant to this
question
How would you proceed?
First look at the metadata
which primary data does the corpus contain?
which annotation layers?
→ are they useful for your research question?
how were the annotations produced? (automatic,
semi-automatic, manually)

Stefanie Dipper Tools for annotating and searching 3 / 38


Regular expressions
Search tools

Searching corpora: some general considerations

Suppose you have a research question


. . . and there is an annotated corpus available, e.g. via the
internet, that seems to contain data relevant to this
question
How would you proceed?
First look at the metadata
which primary data does the corpus contain?
which annotation layers?
→ are they useful for your research question?
how were the annotations produced? (automatic,
semi-automatic, manually)
how good are the annotations? (IAA/tool accuracy, if
available)

Stefanie Dipper Tools for annotating and searching 3 / 38


Regular expressions
Search tools

Searching corpora: some general considerations

If you think that the data is useful for you, you check
which data structures (spans, trees, . . . ) and tagsets (Penn
tagset, . . . )?

Stefanie Dipper Tools for annotating and searching 4 / 38


Regular expressions
Search tools

Searching corpora: some general considerations

If you think that the data is useful for you, you check
which data structures (spans, trees, . . . ) and tagsets (Penn
tagset, . . . )?
which search tool can be used
→ how to formulate your queries

Stefanie Dipper Tools for annotating and searching 4 / 38


Regular expressions
Search tools

Searching corpora: some general considerations


Before you start searching the data, first browse through
the corpus to get a first impression of the data and its
annotations

Stefanie Dipper Tools for annotating and searching 5 / 38


Regular expressions
Search tools

Searching corpora: some general considerations


Before you start searching the data, first browse through
the corpus to get a first impression of the data and its
annotations
When you start searching for your specific phenomenon
start with rather general queries
with high recall and low precision
if you find some relevant instances, it is a good idea e.g. to
record their sentence IDs, to use them as “test cases”

Stefanie Dipper Tools for annotating and searching 5 / 38


Regular expressions
Search tools

Searching corpora: some general considerations


Before you start searching the data, first browse through
the corpus to get a first impression of the data and its
annotations
When you start searching for your specific phenomenon
start with rather general queries
with high recall and low precision
if you find some relevant instances, it is a good idea e.g. to
record their sentence IDs, to use them as “test cases”
refine your query stepwise
by adding more and more constraints
check your test cases whether they are still covered
check whether the number of FPs diminuish

Stefanie Dipper Tools for annotating and searching 5 / 38


Regular expressions
Search tools

Searching corpora: some general considerations


Before you start searching the data, first browse through
the corpus to get a first impression of the data and its
annotations
When you start searching for your specific phenomenon
start with rather general queries
with high recall and low precision
if you find some relevant instances, it is a good idea e.g. to
record their sentence IDs, to use them as “test cases”
refine your query stepwise
by adding more and more constraints
check your test cases whether they are still covered
check whether the number of FPs diminuish
The result set needs not to be perfectly clean
→ you can export the matches and filter them offline, e.g.
manually in Excel
Stefanie Dipper Tools for annotating and searching 5 / 38
Regular expressions
Search tools

Searching corpora: some general considerations


How do you deal with annotation errors?
Check confusion matrices, if available → which tags tend
to be confused?

Stefanie Dipper Tools for annotating and searching 6 / 38


Regular expressions
Search tools

Searching corpora: some general considerations


How do you deal with annotation errors?
Check confusion matrices, if available → which tags tend
to be confused?
Anticipate errors that occur due to the architecture of the
automatic tool
e.g. HMMs use a limited window of n words → problem for
distant dependencies

Stefanie Dipper Tools for annotating and searching 6 / 38


Regular expressions
Search tools

Searching corpora: some general considerations


How do you deal with annotation errors?
Check confusion matrices, if available → which tags tend
to be confused?
Anticipate errors that occur due to the architecture of the
automatic tool
e.g. HMMs use a limited window of n words → problem for
distant dependencies
e.g. phrasal verbs in the DICKENS corpus (see http://
corpora.linguistik.uni-erlangen.de/demos/
CQP/): what POS tags do you expect?
he took off his cravat
but I took it off again
and took the covers off in such a
Take the top one off , my love
for taking the postboy ’s hat off ;
the Jack took one of his bloated shoes off ,

Stefanie Dipper Tools for annotating and searching 6 / 38


Regular expressions
Search tools

Searching corpora: some general considerations


How do you deal with annotation errors?
Check confusion matrices, if available → which tags tend
to be confused?
Anticipate errors that occur due to the architecture of the
automatic tool
e.g. HMMs use a limited window of n words → problem for
distant dependencies
e.g. phrasal verbs in the DICKENS corpus (see http://
corpora.linguistik.uni-erlangen.de/demos/
CQP/): what POS tags do you expect?
he took off/RP his cravat
but I took it off/RP again
and took the covers off/IN in such a
Take the top one off/RP , my love
for taking the postboy ’s hat off/RB ;
the Jack took one of his bloated shoes off/RB ,

Stefanie Dipper Tools for annotating and searching 6 / 38


Regular expressions
Search tools

Searching corpora: some general considerations

Phrasal verbs: You shouldn’t limit your search to RP but


also include RB/IN if you are interested in phrasal verbs

Stefanie Dipper Tools for annotating and searching 7 / 38


Regular expressions
Search tools

Searching corpora: some general considerations

Phrasal verbs: You shouldn’t limit your search to RP but


also include RB/IN if you are interested in phrasal verbs
If your phenomenon concerns a closed class of words you
could as well search for the word forms rather than their
POS

Stefanie Dipper Tools for annotating and searching 7 / 38


Regular expressions
Search tools

Searching corpora: some general considerations

How can you be sure that you got all the instances in the
corpus? (recall = 100%)?

Stefanie Dipper Tools for annotating and searching 8 / 38


Regular expressions
Search tools

Searching corpora: some general considerations

How can you be sure that you got all the instances in the
corpus? (recall = 100%)?
You never can (or you would have to go through the entire
corpus manually and check each sentence carefully . . . )

Stefanie Dipper Tools for annotating and searching 8 / 38


Regular expressions
Search tools

Searching corpora: some general considerations

How can you be sure that you got all the instances in the
corpus? (recall = 100%)?
You never can (or you would have to go through the entire
corpus manually and check each sentence carefully . . . )
Getting 0 matches does not mean:
6→ 0 instances in the corpus
6→ ungrammatical construction (remember Zipf!)

Stefanie Dipper Tools for annotating and searching 8 / 38


Regular expressions
Search tools

Searching corpora: some general considerations

How can you be sure that you got all the instances in the
corpus? (recall = 100%)?
You never can (or you would have to go through the entire
corpus manually and check each sentence carefully . . . )
Getting 0 matches does not mean:
6→ 0 instances in the corpus
6→ ungrammatical construction (remember Zipf!)
For rather frequent phenomena: you could manually check
a random sample of the corpus and see how many
instances you missed

Stefanie Dipper Tools for annotating and searching 8 / 38


Regular expressions
Search tools

Search tools

Most search tools come with their own query language


That is, if we want to search a corpus, we have to know
1 the annotated structures and the tagsets
2 and the syntax of the query language

Stefanie Dipper Tools for annotating and searching 9 / 38


Regular expressions
Search tools

Search tools

Most search tools come with their own query language


That is, if we want to search a corpus, we have to know
1 the annotated structures and the tagsets
2 and the syntax of the query language
Some tools also offer different ways of exporting the
results
the results can then be further processed offline, e.g. in
Excel

Stefanie Dipper Tools for annotating and searching 9 / 38


Regular expressions
Search tools

Query languages

All query languages make use of regular expressions


(RE)
REs allow us to search for general patterns rather than
specific words or tags

Stefanie Dipper Tools for annotating and searching 10 / 38


Regular expressions
Search tools

Query languages

All query languages make use of regular expressions


(RE)
REs allow us to search for general patterns rather than
specific words or tags
Example queries:
which words end with -atious?
are there words with 5 os in them?
are there phrases with 3 successive adjectives?

Stefanie Dipper Tools for annotating and searching 10 / 38


Regular expressions
Search tools

Query languages

All query languages make use of regular expressions


(RE)
REs allow us to search for general patterns rather than
specific words or tags
Example queries:
which words end with -atious?
are there words with 5 os in them?
are there phrases with 3 successive adjectives?
→ We first have a closer look at REs

Stefanie Dipper Tools for annotating and searching 10 / 38


Regular expressions
Search tools

Outline

1 Regular expressions

2 Search tools
Search tools for flat annotations
Search tools for deep annotations

Stefanie Dipper Tools for annotating and searching 11 / 38


Regular expressions
Search tools

Regular expressions

REs are a formal language (and equivalent to finite-state


automata)
Unfortunately, there is no standardized syntax for
formulating REs

Stefanie Dipper Tools for annotating and searching 12 / 38


Regular expressions
Search tools

Regular expressions

REs are a formal language (and equivalent to finite-state


automata)
Unfortunately, there is no standardized syntax for
formulating REs
we first have a look at a somewhat simplified version of REs
. . . and will see a full version of REs in this afternoon’s
practical session

Stefanie Dipper Tools for annotating and searching 12 / 38


Regular expressions
Search tools

Regular expressions (simplified version)

RE matching an arbitrary symbol: ?


Arbitrarily many symbols: *
One or more symbols: +
+atious → vexatious, ostentatious

Stefanie Dipper Tools for annotating and searching 13 / 38


Regular expressions
Search tools

Regular expressions (simplified version)

RE matching an arbitrary symbol: ?


Arbitrarily many symbols: *
One or more symbols: +
+atious → vexatious, ostentatious
Alternatives: [x,y]
+[able,ability] → capable, capability
neighbo[u,]r

Stefanie Dipper Tools for annotating and searching 13 / 38


Regular expressions
Search tools

Regular expressions (simplified version)

RE matching an arbitrary symbol: ?


Arbitrarily many symbols: *
One or more symbols: +
+atious → vexatious, ostentatious
Alternatives: [x,y]
+[able,ability] → capable, capability
neighbo[u,]r
To match a special character literally: \x
\?

Stefanie Dipper Tools for annotating and searching 13 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

Outline

1 Regular expressions

2 Search tools
Search tools for flat annotations
Search tools for deep annotations

Stefanie Dipper Tools for annotating and searching 14 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

Outline

1 Regular expressions

2 Search tools
Search tools for flat annotations
Search tools for deep annotations

Stefanie Dipper Tools for annotating and searching 15 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

A search tool for “flat annotations”: CQP

Ref: Christ, Schulze, Hofmann, and König (1999)

‘Corpus Query Processor’


Free, open source: http://cwb.sourceforge.net/
Very efficient, widely-used corpus search tool
e.g. by BNCweb
Focus on token-based annotations: POS, lemma, . . .
but also some support for phrasal annotations

Stefanie Dipper Tools for annotating and searching 16 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

A search tool for “flat annotations”: CQP

Ref: Christ, Schulze, Hofmann, and König (1999)

‘Corpus Query Processor’


Free, open source: http://cwb.sourceforge.net/
Very efficient, widely-used corpus search tool
e.g. by BNCweb
Focus on token-based annotations: POS, lemma, . . .
but also some support for phrasal annotations
→ We now look at CQP inside the BNCweb
in this afternoon’s practical session, we will work with the
“original” CQP (and ANNIS)

Stefanie Dipper Tools for annotating and searching 16 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

A CQP-interface for the BNC: BNCweb

BNCweb: a graphical user interface for searching the BNC,


which uses CQP as the underlying search tool
Two search modes
Simple search: easier to handle → for now
CQP search: more powerful → this afternoon

Stefanie Dipper Tools for annotating and searching 17 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

A CQP-interface for the BNC: BNCweb

BNCweb: a graphical user interface for searching the BNC,


which uses CQP as the underlying search tool
Two search modes
Simple search: easier to handle → for now
CQP search: more powerful → this afternoon
Remember: the BNC does not use the Penn Tagset

Stefanie Dipper Tools for annotating and searching 17 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

BNCweb: Simple Search

URL: http://bncweb.lancs.ac.uk/bncwebSignup/user/login.
php (free but requires registration)

Searches for:
Word forms: saw
Lemmas: {see}

Stefanie Dipper Tools for annotating and searching 18 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

BNCweb: Simple Search

URL: http://bncweb.lancs.ac.uk/bncwebSignup/user/login.
php (free but requires registration)

Searches for:
Word forms: saw
Lemmas: {see}
POS tags: _NN1 (for singular nouns)
“Coarse” POS tags: _{N} (for nouns in general)

Stefanie Dipper Tools for annotating and searching 18 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

BNCweb: Simple Search

URL: http://bncweb.lancs.ac.uk/bncwebSignup/user/login.
php (free but requires registration)

Searches for:
Word forms: saw
Lemmas: {see}
POS tags: _NN1 (for singular nouns)
“Coarse” POS tags: _{N} (for nouns in general)
Both combined: saw_{N}

Stefanie Dipper Tools for annotating and searching 18 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

BNCweb: Simple Search

URL: http://bncweb.lancs.ac.uk/bncwebSignup/user/login.
php (free but requires registration)

Searches for:
Word forms: saw
Lemmas: {see}
POS tags: _NN1 (for singular nouns)
“Coarse” POS tags: _{N} (for nouns in general)
Both combined: saw_{N}
Using regular expressions: saw_N*

Stefanie Dipper Tools for annotating and searching 18 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

Outline

1 Regular expressions

2 Search tools
Search tools for flat annotations
Search tools for deep annotations

Stefanie Dipper Tools for annotating and searching 19 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

A search tool for “deep annotations”: ANNIS

Ref: Krause and Zeldes (2014)

Free, open source: http://corpus-tools.org/


annis/
Mainly developed by Anke Lüdeling’s group (Berlin) plus
Amir Zeldes (Georgetown)
Focus on deeply-annotated corpora rather than efficiency
Web-based tool
For searching and visualizing annotated data

Stefanie Dipper Tools for annotating and searching 20 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

A search tool for “deep annotations”: ANNIS

ANNIS focuses on corpora that are “deeply annotated”


Complex structures, e.g. syntax trees
Multi-level annotations

Stefanie Dipper Tools for annotating and searching 21 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

A search tool for “deep annotations”: ANNIS

ANNIS focuses on corpora that are “deeply annotated”


Complex structures, e.g. syntax trees
Multi-level annotations
e.g. POS, syntax, information status, coreference,
document structure (GUM corpus https://corpling.
uis.georgetown.edu/gum/)

Stefanie Dipper Tools for annotating and searching 21 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

A search tool for “deep annotations”: ANNIS

ANNIS focuses on corpora that are “deeply annotated”


Complex structures, e.g. syntax trees
Multi-level annotations
e.g. POS, syntax, information status, coreference,
document structure (GUM corpus https://corpling.
uis.georgetown.edu/gum/)
annotated units (markables) can overlap
e.g. sentence boundaries and line breaks
this is a problem for many tools that are designed for
tree-like data structures, like syntactic annotations

Stefanie Dipper Tools for annotating and searching 21 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

A search tool for “deep annotations”: ANNIS

ANNIS focuses on corpora that are “deeply annotated”


Complex structures, e.g. syntax trees
Multi-level annotations
e.g. POS, syntax, information status, coreference,
document structure (GUM corpus https://corpling.
uis.georgetown.edu/gum/)
annotated units (markables) can overlap
e.g. sentence boundaries and line breaks
this is a problem for many tools that are designed for
tree-like data structures, like syntactic annotations
Demo corpora: https://corpling.uis.
georgetown.edu/annis-corpora/

Stefanie Dipper Tools for annotating and searching 21 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

ANNIS: multi-level annotations

Stefanie Dipper Tools for annotating and searching 22 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

ANNIS: some background information

ANNIS was originally created for investigating Information


Structure
‘Information Structure’: refers to the way language conveys
information, by marking information
as new or old (‘information status’)
as being in the focus or background
as the topic of a sentence or as a comment about the topic

Stefanie Dipper Tools for annotating and searching 23 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

ANNIS: some background information

ANNIS was originally created for investigating Information


Structure
‘Information Structure’: refers to the way language conveys
information, by marking information
as new or old (‘information status’)
as being in the focus or background
as the topic of a sentence or as a comment about the topic
Information structure concerns all linguistic levels
prosody, referentiality, morphology, syntax, . . .

Stefanie Dipper Tools for annotating and searching 23 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

ANNIS: some background information

ANNIS was originally created for investigating Information


Structure
‘Information Structure’: refers to the way language conveys
information, by marking information
as new or old (‘information status’)
as being in the focus or background
as the topic of a sentence or as a comment about the topic
Information structure concerns all linguistic levels
prosody, referentiality, morphology, syntax, . . .
. . . and the levels interact

Stefanie Dipper Tools for annotating and searching 23 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

ANNIS: some background information

Hence, ANNIS was developed to support data annotated


at various levels

Stefanie Dipper Tools for annotating and searching 24 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

ANNIS: some background information

Hence, ANNIS was developed to support data annotated


at various levels
data would be annotated by specialized tools
e.g. MMAX for coreference, EXMARaLDA for prosody,
Arborator for syntax (http://mmax2.sourceforge.
net/,http://www.exmaralda.org/,http://
arborator.ilpga.fr/)

Stefanie Dipper Tools for annotating and searching 24 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

ANNIS: some background information

Hence, ANNIS was developed to support data annotated


at various levels
data would be annotated by specialized tools
e.g. MMAX for coreference, EXMARaLDA for prosody,
Arborator for syntax (http://mmax2.sourceforge.
net/,http://www.exmaralda.org/,http://
arborator.ilpga.fr/)
. . . and all annotations would come together within ANNIS
so that interactions between the levels can be investigated

Stefanie Dipper Tools for annotating and searching 24 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

ANNIS: some background information

Hence, ANNIS was developed to support data annotated


at various levels
data would be annotated by specialized tools
e.g. MMAX for coreference, EXMARaLDA for prosody,
Arborator for syntax (http://mmax2.sourceforge.
net/,http://www.exmaralda.org/,http://
arborator.ilpga.fr/)
. . . and all annotations would come together within ANNIS
so that interactions between the levels can be investigated
There is also support for competing annotations, e.g. POS
tags from different taggers

Stefanie Dipper Tools for annotating and searching 24 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

SaltNPepper: format conversion


Ref: http://corpus-tools.org/pepper/
The ANNIS project has also developed a tool for converting formats of
different annotation tools
Annotations from different annotation tools can be imported into ANNIS
Usually each tool comes with its own format
ANNIS uses SaltNPepper to convert from these formats in other formats
→ You can use this software also for your conversion tasks

Stefanie Dipper Tools for annotating and searching 25 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

ANNIS: special operators (examples)

Operator for overlapping annotations:


[infstat="giv" _o_ cat="PP"]
Operator for including annotations:
[infstat="giv" _i_ cat="PP"]

Stefanie Dipper Tools for annotating and searching 26 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

ANNIS: special operators (examples)

Operator for overlapping annotations:


[infstat="giv" _o_ cat="PP"]
Operator for including annotations:
[infstat="giv" _i_ cat="PP"]
Namespaces: annotations of the same type from different
sources can be marked by namespaces:
[ptb:pos="NN" & brown:pos="NN1"]

Stefanie Dipper Tools for annotating and searching 26 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

ANNIS: token layers

Competing tokenizations: one can even define several


token layers
one layer contains the actual primary data
the other layer is a kind of annotation

Stefanie Dipper Tools for annotating and searching 27 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

ANNIS: token layers

Competing tokenizations: one can even define several


token layers
one layer contains the actual primary data
the other layer is a kind of annotation
E.g. in historical data from German:
word boundaries (marked by spaces) most of the time
correspond to modern words
but there can be mismatches

Stefanie Dipper Tools for annotating and searching 27 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

Token layers in REM

Token layers in the Reference Corpus of Middle High German


(REM)
We use two token layers
1 diplomatic tokens: the form of the tokens is very close to
the original manuscript
special characters, medieval abbreviations, original spaces

Stefanie Dipper Tools for annotating and searching 28 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

Token layers in REM

Token layers in the Reference Corpus of Middle High German


(REM)
We use two token layers
1 diplomatic tokens: the form of the tokens is very close to
the original manuscript
special characters, medieval abbreviations, original spaces
2 modernized tokens: the form of the tokens is closer to the
modern standard to some degree
only ASCII characters and modern word boundaries

Stefanie Dipper Tools for annotating and searching 28 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

Token layers in REM

Both token layers (diplomatic and modernized) are reference


points for further annotations
1 Diplomatic tokens: layout information (pages, columns,
lines) are annotated with reference to diplomatic tokens
2 Modernized tokens: all further annotations (POS,
morphology, lemma) are annotated with reference to
modernized tokens

Stefanie Dipper Tools for annotating and searching 29 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

Token mismatches

Stefanie Dipper Tools for annotating and searching 30 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

Token mismatches

Stefanie Dipper Tools for annotating and searching 31 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

ANNIS: Query Builder


For casual users who do not want to learn the query
language there is a graphical Query Builder
The graphical query is translated into an ANNIS QL
expression and can be edited

Stefanie Dipper Tools for annotating and searching 32 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

ANNIS: Simple search

We built an even simpler interface that only allows for a


restricted set of queries
Again, the query is translated into an ANNIS QL
expression and can be edited

Stefanie Dipper Tools for annotating and searching 33 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

ANNIS: export options


ANNIS supports different kinds of exports
You can either download information about the matches
e.g. in csv format
either all information or filtered
Or else ANNIS gives you some simple statistics about the
matches

Stefanie Dipper Tools for annotating and searching 34 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

Other tools for searching “deep annotations”


URL: http://clarino.uib.no/iness/

iness: “Infrastructure for the Exploration of Syntax and


Semantics”
Platform that offers a broad range of
syntactically-annotated treebanks for searching
Languages:
Ancient Greek (to 1453) (7) · Arabic (2) · Basque (3) · Bulgarian (4) · Catalan (1) · Chinese (1) · Church
Slavic (5) · Classical Armenian (1) · Croatian (4) · Czech (5) · Danish (4) · Dutch (3) · English (7) · Estonian
(3) · Finnish (6) · French (3) · Galician (1) · Georgian (4) · German (11) · Gothic (3) · Hebrew (3) · Hindi (2) ·
Hungarian (4) · Icelandic (1) · Indonesian (4) · Irish (3) · Italian (3) · Kazakh (1) · Latin (10) · Latvian (1) ·
Modern Greek (1453-) (4) · (1) · Northern Sami (15) · Norwegian Bokmål (5) · Old English (ca. 450-1100)
(5) · Old French (842-ca. 1400) (1) · Old Norse (4) · Old Russian (20) · Persian (3) · Polish (5) · Portuguese
(7) · Romanian (2) · Russian (2) · Slovenian (4) · Spanish (3) · Swedish (5) · Tamil (2) · Turkish (2) · Urdu (1)
· Wolof (1)

Frameworks: constituency: 13, dependency: 45,


dependency-CG: 125, LFG: 19
Also parallel treebanks

Stefanie Dipper Tools for annotating and searching 35 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

Summary

Corpora and annotations


What resources are there
Annotations are highly useful for accessing data
Important aspect: annotation quality
guidelines, IAA, documentation
Automatic tools
Again: quality of their analyses
accuracy, confusion matrices
How to use that knowledge for sensible searches

Stefanie Dipper Tools for annotating and searching 36 / 38


Regular expressions Search tools for flat annotations
Search tools Search tools for deep annotations

Thank you

Stefanie Dipper
Ruhr-University Bochum
[email protected]

Stefanie Dipper Tools for annotating and searching 37 / 38


Regular expressions
Search tools

References I

Christ, O., B. M. Schulze, A. Hofmann, and E. König (1999).


The IMS Corpus Workbench: Corpus Query Processor (CQP)
user’s manual.
Technical report, IMS, University of Stuttgart, Germany.

Krause, T. and A. Zeldes (2014).


Annis3: A new architecture for generic corpus query and
visualization.
Digital Scholarship in the Humanities.
http://dsh.oxfordjournals.org/cgi/content/
abstract/fqu057?ijkey=GJBr0LhNfKW1g8i&
keytype=ref.

Stefanie Dipper Tools for annotating and searching 38 / 38

You might also like