0% found this document useful (0 votes)
20 views18 pages

Background

The document discusses background on formal language theory including finite state automata, regular expressions, context free grammar and dependency grammar. It then describes corpora, annotated corpora and other lexical resources used in natural language processing.

Uploaded by

saisuraj1510
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views18 pages

Background

The document discusses background on formal language theory including finite state automata, regular expressions, context free grammar and dependency grammar. It then describes corpora, annotated corpora and other lexical resources used in natural language processing.

Uploaded by

saisuraj1510
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Background

Terminologies to know
1. Finite state Automata
2. Regular Expressions
3. Context Free Grammar/ Phrase structure grammar
4. Dependency Grammar
5. Corpus
6. Annotated corpus
7. Other Lexical resources
Formal language theory
Formal language theory
• Alphabet is a finite, non-empty set.
• Elements of the set - symbols.
• Finite sequence of symbols a1a2...an from an alphabet - string

• Σ={0,1} is an alphabet, and 011,1010, and 1 are all strings over Σ.

• Strings are sequences of symbols.

• FSA defines a formal language by defining a set of accepted strings


Formal Definition
FSA is a 5-tuple consisting of
✓Q : set of states {q0,q1,q2,q3,q4}
✓ : an alphabet of symbols {a,b,!}
✓q0 : a start state
✓F : a set of final states in Q {q4}
✓(q,i) : a transition function
a
b a a !

q0 q1 q2 q3 q4
4
Finite State Automata
FSAs recognize the strings represented by regular expressions
• /baa!
• /baaa!
• /baaaa!
a
b a a !

q0 q1 q2 q3 q4

5
Regular Expressions
Regular Expression: Way of describing the structure of the strings in a
language (Formula in algebraic notation)
• Language (over alphabet Σ={a, b})
• L={x|x starts and ends with ‘a’}.
• Regular expression a·(a|b)∗·a is a pattern that captures this
structure and matches any string in L
• String: Any sequence of alphanumeric characters
• Letters, numbers, spaces, tabs, punctuation marks

6
Automata in Language
Automata are computational devices to solve language recognition
problems

Language recognition problem:


To determine whether a word belongs to a language.
Context-free grammar (CFG)
• Context-free grammar (CFG) is a list of rules define the set of all well-
formed sentences in a language.
• Rules with a single symbol on the left-hand side of the rewrite rules.
S ---> NP VP
• Syntactic Analysis - parsing algorithm uses CFG to convert the
sentence to parse tree.
• The parse tree breaks down the sentence into structured parts
• Computer can easily understand and process it.
CFG Parse Tree
S -> NP VP
NP -> DET N
NP -> DET ADJ N
VP -> V NP

DET -> the


ADJ -> big | fat
N -> cat | cats | rice
V -> eat | eats | ate
Dependency Grammar
• CFG: focuses on identifying phrases and their recursive structure
• Dependency Grammar: focuses on relations between words .

Dependency parse tree

CFG Parse tree


Types of Dependencies
Typed: Label indicating relationship between words

Untyped: Only which words depend


Corpus
Corpus is a large collection of texts.
• It is a body of written or spoken material upon which a linguistic analysis is
based.
• Text corpus: used as training data for many NLP applications.
Examples:
• Gutenberg Corpus
• Brown Corpus
• Reuters Corpus
• Inaugural Address Corpus.
• Google Books Ngram Corpus
• American National Corpus
• British National Corpus
• Corpus Resource Database (CoRD), more than 80 English language corpora.
• RE3D (Relationship and Entity Extraction Evaluation Dataset)
Annotated Corpus
Apart from pure text, a corpus can also be provided with additional
linguistic information, called 'annotation'.
Example :Grammatically tagged corpus.
• In a grammatically tagged corpus, the words have been assigned a word class
label (part-of-speech tag).
• The Brown Corpus and the British National Corpus (BNC) are examples of
grammatically annotated corpora.
Corpora examples
Corpus Contents
Brown Corpus 1.15M words, tagged, categorized
CoNLL Named Entity 700k words, pos and named-entity-tagged
Indian POS-Tagged Corpus 60k words, tagged (Bangla, Hindi, Marathi, Telugu)
Names Corpus 8k male and female names
Reuters Corpus 1.3M words, 10k news documents, categorized
Senseval Corpus 600k words, part-of-speech and sense tagged
SEMCOR 880k words, part-of-speech and sense tagged

More resources on : https://www.nltk.org/book/ch02.html


Corpora examples
• English stop words
• GUM - Georgetown University Multilayer corpus, multiple parses, coreference,
entities, sentence types and RST
• Groningen Meaning Bank semantically annotated corpus
• HamleDT, harmonized dependency treebanks of many languages, common
annotation style.
• UMBC Web base Corpus
• UN parallel corpora
• VP Ellipsis corpus
• TRAINS Dialogue Corpus
• Multiword Expression Resources
• Dialogue Diversity Corpus
Lexical Resources
A lexicon, or lexical resource
Collection of words and/or phrases with associated information
• Part of speech and sense definitions.
WordNet- Princeton University
• Semantically-oriented dictionary of English.
• NLTK includes the English WordNet, with 155,287 words and 117,659
synonym sets.
Lexical Resources
Wordlist Corpora
• NLTK includes some corpora that are nothing more than wordlists.
• Use it to find unusual or mis-spelt words in a text corpus
Corpus of stop words
• list of high-frequency words like the, to and also
•To be filtered out of a document before further processing.
Comparative Wordlists
•lists of about 200 common words in several languages
References
• http://www.nltk.org/book/ch02.html

You might also like