4 - Slides Regualer Expression
4 - Slides Regualer Expression
Search tools
Stefanie Dipper
Institute of Linguistics
Ruhr-University Bochum
Today’s session
If you think that the data is useful for you, you check
which data structures (spans, trees, . . . ) and tagsets (Penn
tagset, . . . )?
If you think that the data is useful for you, you check
which data structures (spans, trees, . . . ) and tagsets (Penn
tagset, . . . )?
which search tool can be used
→ how to formulate your queries
How can you be sure that you got all the instances in the
corpus? (recall = 100%)?
How can you be sure that you got all the instances in the
corpus? (recall = 100%)?
You never can (or you would have to go through the entire
corpus manually and check each sentence carefully . . . )
How can you be sure that you got all the instances in the
corpus? (recall = 100%)?
You never can (or you would have to go through the entire
corpus manually and check each sentence carefully . . . )
Getting 0 matches does not mean:
6→ 0 instances in the corpus
6→ ungrammatical construction (remember Zipf!)
How can you be sure that you got all the instances in the
corpus? (recall = 100%)?
You never can (or you would have to go through the entire
corpus manually and check each sentence carefully . . . )
Getting 0 matches does not mean:
6→ 0 instances in the corpus
6→ ungrammatical construction (remember Zipf!)
For rather frequent phenomena: you could manually check
a random sample of the corpus and see how many
instances you missed
Search tools
Search tools
Query languages
Query languages
Query languages
Outline
1 Regular expressions
2 Search tools
Search tools for flat annotations
Search tools for deep annotations
Regular expressions
Regular expressions
Outline
1 Regular expressions
2 Search tools
Search tools for flat annotations
Search tools for deep annotations
Outline
1 Regular expressions
2 Search tools
Search tools for flat annotations
Search tools for deep annotations
URL: http://bncweb.lancs.ac.uk/bncwebSignup/user/login.
php (free but requires registration)
Searches for:
Word forms: saw
Lemmas: {see}
URL: http://bncweb.lancs.ac.uk/bncwebSignup/user/login.
php (free but requires registration)
Searches for:
Word forms: saw
Lemmas: {see}
POS tags: _NN1 (for singular nouns)
“Coarse” POS tags: _{N} (for nouns in general)
URL: http://bncweb.lancs.ac.uk/bncwebSignup/user/login.
php (free but requires registration)
Searches for:
Word forms: saw
Lemmas: {see}
POS tags: _NN1 (for singular nouns)
“Coarse” POS tags: _{N} (for nouns in general)
Both combined: saw_{N}
URL: http://bncweb.lancs.ac.uk/bncwebSignup/user/login.
php (free but requires registration)
Searches for:
Word forms: saw
Lemmas: {see}
POS tags: _NN1 (for singular nouns)
“Coarse” POS tags: _{N} (for nouns in general)
Both combined: saw_{N}
Using regular expressions: saw_N*
Outline
1 Regular expressions
2 Search tools
Search tools for flat annotations
Search tools for deep annotations
Token mismatches
Token mismatches
Summary
Thank you
Stefanie Dipper
Ruhr-University Bochum
[email protected]
References I