Background

The document discusses background on formal language theory including finite state automata, regular expressions, context free grammar and dependency grammar. It then describes corpora, annotated corpora and other lexical resources used in natural language processing.

Uploaded by

saisuraj1510

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views18 pages

Background

Uploaded by

saisuraj1510

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Background

Terminologies to know
1. Finite state Automata
2. Regular Expressions
3. Context Free Grammar/ Phrase structure grammar
4. Dependency Grammar
5. Corpus
6. Annotated corpus
7. Other Lexical resources
Formal language theory
Formal language theory
• Alphabet is a finite, non-empty set.
• Elements of the set - symbols.
• Finite sequence of symbols a1a2...an from an alphabet - string

• Σ={0,1} is an alphabet, and 011,1010, and 1 are all strings over Σ.

• Strings are sequences of symbols.

• FSA defines a formal language by defining a set of accepted strings

Formal Definition
FSA is a 5-tuple consisting of
✓Q : set of states {q0,q1,q2,q3,q4}
✓ : an alphabet of symbols {a,b,!}
✓q0 : a start state
✓F : a set of final states in Q {q4}
✓(q,i) : a transition function
a
b a a !

q0 q1 q2 q3 q4
4
Finite State Automata
FSAs recognize the strings represented by regular expressions
• /baa!
• /baaa!
• /baaaa!
a
b a a !

q0 q1 q2 q3 q4

5
Regular Expressions
Regular Expression: Way of describing the structure of the strings in a
language (Formula in algebraic notation)
• Language (over alphabet Σ={a, b})
• L={x|x starts and ends with ‘a’}.
• Regular expression a·(a|b)∗·a is a pattern that captures this
structure and matches any string in L
• String: Any sequence of alphanumeric characters
• Letters, numbers, spaces, tabs, punctuation marks

6
Automata in Language
Automata are computational devices to solve language recognition
problems

Language recognition problem:

To determine whether a word belongs to a language.
Context-free grammar (CFG)
• Context-free grammar (CFG) is a list of rules define the set of all well-
formed sentences in a language.
• Rules with a single symbol on the left-hand side of the rewrite rules.
S ---> NP VP
• Syntactic Analysis - parsing algorithm uses CFG to convert the
sentence to parse tree.
• The parse tree breaks down the sentence into structured parts
• Computer can easily understand and process it.
CFG Parse Tree
S -> NP VP
NP -> DET N
NP -> DET ADJ N
VP -> V NP

DET -> the

Dependency parse tree

CFG Parse tree

Types of Dependencies
Typed: Label indicating relationship between words

Untyped: Only which words depend

Corpus
Corpus is a large collection of texts.
• It is a body of written or spoken material upon which a linguistic analysis is
based.
• Text corpus: used as training data for many NLP applications.
Examples:
• Gutenberg Corpus
• Brown Corpus
• Reuters Corpus
• Inaugural Address Corpus.
• Google Books Ngram Corpus
• American National Corpus
• British National Corpus
• Corpus Resource Database (CoRD), more than 80 English language corpora.
• RE3D (Relationship and Entity Extraction Evaluation Dataset)
Annotated Corpus
Apart from pure text, a corpus can also be provided with additional
linguistic information, called 'annotation'.
Example :Grammatically tagged corpus.
• In a grammatically tagged corpus, the words have been assigned a word class
label (part-of-speech tag).
• The Brown Corpus and the British National Corpus (BNC) are examples of
grammatically annotated corpora.
Corpora examples
Corpus Contents
Brown Corpus 1.15M words, tagged, categorized
CoNLL Named Entity 700k words, pos and named-entity-tagged
Indian POS-Tagged Corpus 60k words, tagged (Bangla, Hindi, Marathi, Telugu)
Names Corpus 8k male and female names
Reuters Corpus 1.3M words, 10k news documents, categorized
Senseval Corpus 600k words, part-of-speech and sense tagged
SEMCOR 880k words, part-of-speech and sense tagged

More resources on : https://www.nltk.org/book/ch02.html

Corpora examples
• English stop words
• GUM - Georgetown University Multilayer corpus, multiple parses, coreference,
entities, sentence types and RST
• Groningen Meaning Bank semantically annotated corpus
• HamleDT, harmonized dependency treebanks of many languages, common
annotation style.
• UMBC Web base Corpus
• UN parallel corpora
• VP Ellipsis corpus
• TRAINS Dialogue Corpus
• Multiword Expression Resources
• Dialogue Diversity Corpus
Lexical Resources
A lexicon, or lexical resource
Collection of words and/or phrases with associated information
• Part of speech and sense definitions.
WordNet- Princeton University
• Semantically-oriented dictionary of English.
• NLTK includes the English WordNet, with 155,287 words and 117,659
synonym sets.
Lexical Resources
Wordlist Corpora
• NLTK includes some corpora that are nothing more than wordlists.
• Use it to find unusual or mis-spelt words in a text corpus
Corpus of stop words
• list of high-frequency words like the, to and also
•To be filtered out of a document before further processing.
Comparative Wordlists
•lists of about 200 common words in several languages
References
• http://www.nltk.org/book/ch02.html

Reservoir Types. Classification Methodology
100% (1)
Reservoir Types. Classification Methodology
2 pages
4.chapter5 - Syntactic and Semantic Representations
No ratings yet
4.chapter5 - Syntactic and Semantic Representations
47 pages
3nlp Computer
No ratings yet
3nlp Computer
83 pages
21cse356t NLP Unit 2
No ratings yet
21cse356t NLP Unit 2
89 pages
Lecture-10:: - Module 2
No ratings yet
Lecture-10:: - Module 2
32 pages
NLP Chapter 3
No ratings yet
NLP Chapter 3
50 pages
Artificial Intelligence: Natural Language Processing II
No ratings yet
Artificial Intelligence: Natural Language Processing II
51 pages
Lecture 6
No ratings yet
Lecture 6
43 pages
NLP Unit 2
No ratings yet
NLP Unit 2
48 pages
Natural Language Processing: Dr. Ahmed El-Bialy
100% (1)
Natural Language Processing: Dr. Ahmed El-Bialy
49 pages
Feature Systems and Augmented Grammars
No ratings yet
Feature Systems and Augmented Grammars
7 pages
CIT316 Summary
No ratings yet
CIT316 Summary
21 pages
Unit 3 NLP
No ratings yet
Unit 3 NLP
7 pages
Ai Unit 5
No ratings yet
Ai Unit 5
19 pages
Syntax Parsing
No ratings yet
Syntax Parsing
95 pages
214 Grammar 2014
No ratings yet
214 Grammar 2014
50 pages
19 Parsing
No ratings yet
19 Parsing
122 pages
Grammars: Before You Can Parse You Need A Grammar. So Where Do Grammars Come From?
No ratings yet
Grammars: Before You Can Parse You Need A Grammar. So Where Do Grammars Come From?
32 pages
5th Unit NLP
No ratings yet
5th Unit NLP
32 pages
NLP - Shortnotes Unit 3
No ratings yet
NLP - Shortnotes Unit 3
16 pages
Computational Linguistics Notes
No ratings yet
Computational Linguistics Notes
17 pages
Ch4-Phrase-Structure Grammars and Dependency Grammars PDF
No ratings yet
Ch4-Phrase-Structure Grammars and Dependency Grammars PDF
48 pages
Module 4
No ratings yet
Module 4
7 pages
ACFrOgBKMtkrKQXYgwzYfGAQxQ0GJjQ4MloahBs6vi5pwqo xRZUN6IRgh8lAAyR2U7sguAn6becvxh174Y RYo84nZ3K9mm OlN3Q JrDvd18FxMzMkCBuxruzd1tH0C6XqndKXsCSXuwHIWVT7olg5FKOstIhFYq-Kh6hMBg
No ratings yet
ACFrOgBKMtkrKQXYgwzYfGAQxQ0GJjQ4MloahBs6vi5pwqo xRZUN6IRgh8lAAyR2U7sguAn6becvxh174Y RYo84nZ3K9mm OlN3Q JrDvd18FxMzMkCBuxruzd1tH0C6XqndKXsCSXuwHIWVT7olg5FKOstIhFYq-Kh6hMBg
32 pages
Overview of Linguistics
No ratings yet
Overview of Linguistics
19 pages
FALLSEM2019-20 CSE4022 ETH VL2019201002590 Reference Material I 17-Jul-2019 NLP1-Lecture 4
No ratings yet
FALLSEM2019-20 CSE4022 ETH VL2019201002590 Reference Material I 17-Jul-2019 NLP1-Lecture 4
34 pages
Natural Language Processing Dossier 20231110 141736 0000
No ratings yet
Natural Language Processing Dossier 20231110 141736 0000
114 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
74 pages
NLP Chapter 3
No ratings yet
NLP Chapter 3
23 pages
Ai Phases in NLP Sem Vi
No ratings yet
Ai Phases in NLP Sem Vi
3 pages
Unit Iii - NLP
No ratings yet
Unit Iii - NLP
36 pages
Lecture 2 Hierarchy of NLP & TF-IDF
No ratings yet
Lecture 2 Hierarchy of NLP & TF-IDF
48 pages
Unit - 5 Natural Language Processing
No ratings yet
Unit - 5 Natural Language Processing
66 pages
Constituency Parsing
No ratings yet
Constituency Parsing
94 pages
NLP Unit 2
No ratings yet
NLP Unit 2
20 pages
Natural Language Processing
No ratings yet
Natural Language Processing
47 pages
Lecture-8. Only For This Batch
No ratings yet
Lecture-8. Only For This Batch
46 pages
Kuhlmann - Introduction To Computational Linguistics (Slides) (2015)
100% (1)
Kuhlmann - Introduction To Computational Linguistics (Slides) (2015)
66 pages
Lecture 2
No ratings yet
Lecture 2
28 pages
Chapter 12
No ratings yet
Chapter 12
16 pages
14 Syntax 1
No ratings yet
14 Syntax 1
22 pages
CSC 305: Programming Paradigm: Introduction To Language, Syntax and Semantics
No ratings yet
CSC 305: Programming Paradigm: Introduction To Language, Syntax and Semantics
38 pages
Pert24 - NLP For Communication
No ratings yet
Pert24 - NLP For Communication
30 pages
Natural Language Processing
No ratings yet
Natural Language Processing
57 pages
Unit-3 Notes Part-1
No ratings yet
Unit-3 Notes Part-1
48 pages
cs224n 2023 Lecture04 Dep Parsing
No ratings yet
cs224n 2023 Lecture04 Dep Parsing
45 pages
Introduction To Natural Language Processing and NLTK
No ratings yet
Introduction To Natural Language Processing and NLTK
23 pages
UNIT-1 Notes
No ratings yet
UNIT-1 Notes
19 pages
Natural Language Processing
No ratings yet
Natural Language Processing
13 pages
Study MAterial Unit 2
No ratings yet
Study MAterial Unit 2
16 pages
Corpus Linguistics
No ratings yet
Corpus Linguistics
40 pages
Natural Language Processing
No ratings yet
Natural Language Processing
21 pages
8 Parsing
No ratings yet
8 Parsing
40 pages
Natural Language Processing
No ratings yet
Natural Language Processing
11 pages
NLP Unit-2
No ratings yet
NLP Unit-2
42 pages
NLP Digital Notes
No ratings yet
NLP Digital Notes
128 pages
W11 Natural Language Processing Lecture
No ratings yet
W11 Natural Language Processing Lecture
9 pages
Constituency Parsing PPT 2
No ratings yet
Constituency Parsing PPT 2
33 pages
Arabic in a Flash Kit Ebook Volume 2
From Everand
Arabic in a Flash Kit Ebook Volume 2
Fethi Mansouri, Dr.
5/5 (2)
Translation, Linguistics, Culture: A French-English Handbook
From Everand
Translation, Linguistics, Culture: A French-English Handbook
Nigel Armstrong
No ratings yet
English Grammar and Verbal Reasoning: The Toolkit for Success
From Everand
English Grammar and Verbal Reasoning: The Toolkit for Success
Simbo Nuga
No ratings yet
Book Review of Lewis Vaughn's "The Power of Critical Thinking"
No ratings yet
Book Review of Lewis Vaughn's "The Power of Critical Thinking"
6 pages
Mooring Design FPSO HUST
No ratings yet
Mooring Design FPSO HUST
10 pages
Earth Science Reviewer
No ratings yet
Earth Science Reviewer
13 pages
PDMS Procedure: 2D DRAFT Intermediate - Structural Discipline
No ratings yet
PDMS Procedure: 2D DRAFT Intermediate - Structural Discipline
14 pages
Repeatability & Reproducibility of Determination of Nitrogen Content of Fishmeal by Combustion Dumas & Comparison With Kjeldahl
No ratings yet
Repeatability & Reproducibility of Determination of Nitrogen Content of Fishmeal by Combustion Dumas & Comparison With Kjeldahl
15 pages
Geosynthetic Lining System For Modern Waste Facilities - Experiences in Developing Asia
No ratings yet
Geosynthetic Lining System For Modern Waste Facilities - Experiences in Developing Asia
8 pages
Credit Scoring Using Machine Learning
No ratings yet
Credit Scoring Using Machine Learning
381 pages
2-3btc of Freebitco - in
100% (1)
2-3btc of Freebitco - in
2 pages
Iron FerroVer + TPTZ Methods
No ratings yet
Iron FerroVer + TPTZ Methods
15 pages
International GCSE Biology (4BI1) - Grade Characteristics: Holistic Approach To Grades
No ratings yet
International GCSE Biology (4BI1) - Grade Characteristics: Holistic Approach To Grades
7 pages
Analysis of Lutein
No ratings yet
Analysis of Lutein
15 pages
Errecom Cat.a.05 19.en
No ratings yet
Errecom Cat.a.05 19.en
88 pages
Pervaporation Ketazine Aq Layer Prodn HH Peroxide Proc PDF
No ratings yet
Pervaporation Ketazine Aq Layer Prodn HH Peroxide Proc PDF
6 pages
47 Exp2 Dav
No ratings yet
47 Exp2 Dav
15 pages
Automated Learning of Interpretable Models With Quantified Uncertainty
No ratings yet
Automated Learning of Interpretable Models With Quantified Uncertainty
18 pages
Kubernetes Container
No ratings yet
Kubernetes Container
7 pages
Wind Turbine Blade Design On SolidWorks
No ratings yet
Wind Turbine Blade Design On SolidWorks
6 pages
Office Automation
No ratings yet
Office Automation
14 pages
COMP1001 LAB5.ipynb
No ratings yet
COMP1001 LAB5.ipynb
4 pages
Chemical Engineering - Why in A Normal Distillation Column Does Temperature and Pressure Gradient Exist From Bottom To Top - Quora PDF
No ratings yet
Chemical Engineering - Why in A Normal Distillation Column Does Temperature and Pressure Gradient Exist From Bottom To Top - Quora PDF
6 pages
Kinetic AppStudioExtensionsUserGuide
No ratings yet
Kinetic AppStudioExtensionsUserGuide
144 pages
Chem Lab 2
No ratings yet
Chem Lab 2
6 pages
Unit3: Problem Well Analysis: Well Performance Prediction: Prepared By: MR Saddam Al-Sadi
No ratings yet
Unit3: Problem Well Analysis: Well Performance Prediction: Prepared By: MR Saddam Al-Sadi
29 pages
Vdocuments - MX Event Medical Inspiration Ventilator Service Manual
No ratings yet
Vdocuments - MX Event Medical Inspiration Ventilator Service Manual
209 pages
Code
No ratings yet
Code
13 pages
Permeability Determination From Stoneley Waves in
No ratings yet
Permeability Determination From Stoneley Waves in
18 pages
AERO3000 Equation List
No ratings yet
AERO3000 Equation List
19 pages
Analisa Sifat Material
No ratings yet
Analisa Sifat Material
10 pages
Operations Research: Dr. Sarat K Jena
No ratings yet
Operations Research: Dr. Sarat K Jena
98 pages

Background

Uploaded by

Background

Uploaded by

Background

• Σ={0,1} is an alphabet, and 011,1010, and 1 are all strings over Σ.

• Strings are sequences of symbols.

• FSA defines a formal language by defining a set of accepted strings

Language recognition problem:

DET -> the

Dependency parse tree

CFG Parse tree

Untyped: Only which words depend

More resources on : https://www.nltk.org/book/ch02.html

You might also like