0% found this document useful (1 vote)

473 views24 pages

Natural Language Processing Tutorial

This My files

Uploaded by

Beyene Tsegaye

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (1 vote)

473 views24 pages

Natural Language Processing Tutorial

This My files

Uploaded by

Beyene Tsegaye

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Natural Language Processing

i
Natural Language Processing

About the Tutorial

Language is a method of communication with the help of which we can speak, read and
write. Natural Language Processing (NLP) is a subfield of Computer Science that deals with
Artificial Intelligence (AI), which enables computers to understand and process human
language.

Audience
This tutorial is designed to benefit graduates, postgraduates, and research students who
either have an interest in this subject or have this subject as a part of their curriculum.
The reader can be a beginner or an advanced learner.

Prerequisites
The reader must have basic knowledge about Artificial Intelligence. He/she should also be
aware about basic terminologies used in English grammar and Python programming
concepts.

Copyright & Disclaimer

All the content and graphics published in this e-book are the property of Tutorials Point (I)
Pvt. Ltd. The user of this e-book is prohibited to reuse, retain, copy, distribute or republish
any contents or a part of contents of this e-book in any manner without written consent
of the publisher.

We strive to update the contents of our website and tutorials as timely and as precisely as
possible, however, the contents may contain inaccuracies or errors. Tutorials Point (I) Pvt.
Ltd. provides no guarantee regarding the accuracy, timeliness or completeness of our
website or its contents including this tutorial. If you discover any errors on our website or
in this tutorial, please notify us at contact@[Link]

i
Natural Language Processing

Table of Contents
About the Tutorial ............................................................................................................................................ i

Audience ........................................................................................................................................................... i

Prerequisites ..................................................................................................................................................... i

Copyright & Disclaimer ..................................................................................................................................... i

Table of Contents ............................................................................................................................................ ii

1. Natural Language Processing – Introduction ............................................................................................. 1

History of NLP .................................................................................................................................................. 1

Study of Human Languages ............................................................................................................................. 2

Ambiguity and Uncertainty in Language ......................................................................................................... 3

NLP Phases....................................................................................................................................................... 5

2. Natural Language Processing — Linguistic Resources ................................................................................ 7

Corpus ............................................................................................................................................................. 7

Elements of Corpus Design .............................................................................................................................. 7

TreeBank Corpus ............................................................................................................................................. 8

Types of TreeBank Corpus ............................................................................................................................... 9

Applications of TreeBank Corpus .................................................................................................................... 9

PropBank Corpus ............................................................................................................................................. 9

VerbNet(VN) .................................................................................................................................................. 10

WordNet ........................................................................................................................................................ 10

3. Natural Language Processing — Word Level Analysis.............................................................................. 11

Regular Expressions ....................................................................................................................................... 11

Properties of Regular Expressions ................................................................................................................. 11

Examples of Regular Expressions .................................................................................................................. 12

Regular Sets & Their Properties..................................................................................................................... 12

Finite State Automata.................................................................................................................................... 13

Relation between Finite Automata, Regular Grammars and Regular Expressions ....................................... 13

ii
Natural Language Processing

Types of Finite State Automation (FSA) ......................................................................................................... 14

Morphological Parsing ................................................................................................................................... 16

Types of Morphemes ..................................................................................................................................... 17

4. Natural Language Processing — Syntactic Analysis ................................................................................. 19

Concept of Parser .......................................................................................................................................... 19

Types of Parsing ............................................................................................................................................. 19

Concept of Derivation.................................................................................................................................... 20

Types of Derivation........................................................................................................................................ 20

Concept of Parse Tree ................................................................................................................................... 20

Concept of Grammar ..................................................................................................................................... 20

Phrase Structure or Constituency Grammar ................................................................................................. 21

Dependency Grammar .................................................................................................................................. 22

Context Free Grammar .................................................................................................................................. 23

Definition of CFG ........................................................................................................................................... 24

5. Natural Language Processing — Semantic Analysis ................................................................................. 25

Elements of Semantic Analysis ...................................................................................................................... 25

Difference between Polysemy and Homonymy ............................................................................................ 26

Meaning Representation ............................................................................................................................... 26

Approaches to Meaning Representations ..................................................................................................... 27

Need of Meaning Representations ................................................................................................................ 27

Lexical Semantics ........................................................................................................................................... 27

6. Natural Language Processing — Word Sense Disambiguation ................................................................. 29

Evaluation of WSD ......................................................................................................................................... 29

Approaches and Methods to Word Sense Disambiguation (WSD) ............................................................... 30

Applications of Word Sense Disambiguation (WSD) ..................................................................................... 30

Difficulties in Word Sense Disambiguation (WSD) ........................................................................................ 31

7. Natural Language Processing — Discourse Processing ............................................................................ 33

Concept of Coherence ................................................................................................................................... 33

iii
Natural Language Processing

Discourse structure ....................................................................................................................................... 33

Algorithms for Discourse Segmentation ........................................................................................................ 33

Text Coherence.............................................................................................................................................. 34

Building Hierarchical Discourse Structure ..................................................................................................... 35

Reference Resolution .................................................................................................................................... 35

Terminology Used in Reference Resolution .................................................................................................. 36

Types of Referring Expressions ...................................................................................................................... 36

Reference Resolution Tasks ........................................................................................................................... 37

8. Natural Language Processing — Part of Speech (PoS) Tagging ................................................................ 38

Rule-based POS Tagging ................................................................................................................................ 38

Properties of Rule-Based POS Tagging .......................................................................................................... 38

Stochastic POS Tagging .................................................................................................................................. 39

Properties of Stochastic POS Tagging ............................................................................................................ 39

Transformation-based Tagging ...................................................................................................................... 39

Working of Transformation Based Learning (TBL) ......................................................................................... 40

Advantages of Transformation-based Learning (TBL) ................................................................................... 40

Disadvantages of Transformation-based Learning (TBL) ............................................................................... 40

Hidden Markov Model (HMM) POS Tagging ................................................................................................. 40

Hidden Markov Model ................................................................................................................................... 40

Use of HMM for POS Tagging ........................................................................................................................ 42

9. Natural Language Processing — Natural Language Inception .................................................................. 44

Natural Language Grammar .......................................................................................................................... 44

Components of Language .............................................................................................................................. 44

Grammatical Categories ................................................................................................................................ 45

Spoken Language Syntax ............................................................................................................................... 48

10. Natural Language Processing — Information Retrieval ........................................................................... 49

Classical Problem in Information Retrieval (IR) System................................................................................. 49

Aspects of Ad-hoc Retrieval ........................................................................................................................... 50

iv
Natural Language Processing

Information Retrieval (IR) Model................................................................................................................... 50

Types of Information Retrieval (IR) Model .................................................................................................... 50

Design features of Information retrieval (IR) systems ................................................................................... 51

The Boolean Model ....................................................................................................................................... 51

Advantages of the Boolean Model ................................................................................................................ 52

Disadvantages of the Boolean Model............................................................................................................ 52

Vector Space Model ...................................................................................................................................... 52

Cosine Similarity Measure Formula ............................................................................................................... 53

Vector Space Representation with Query and Document ............................................................................ 53

Term Weighting ............................................................................................................................................. 54

Forms of Document Frequency Weighting .................................................................................................... 54

User Query Improvement .............................................................................................................................. 55

Relevance Feedback ...................................................................................................................................... 55

11. Natural Language Processing — Applications of NLP............................................................................... 57

Types of Machine Translation Systems ......................................................................................................... 59

Approaches to Machine Translation (MT) ..................................................................................................... 59

Fighting Spam ................................................................................................................................................ 60

Existing NLP models for spam filtering .......................................................................................................... 60

Automatic Summarization ............................................................................................................................. 61

Question-answering ...................................................................................................................................... 61

Sentiment Analysis ........................................................................................................................................ 61

12. Natural Language Processing — Language Processing and Python .......................................................... 62

Prerequisites .................................................................................................................................................. 62

Getting Started with NLTK ............................................................................................................................. 62

Downloading NLTK’s Data ............................................................................................................................. 63

Other Necessary Packages............................................................................................................................. 63

Tokenization .................................................................................................................................................. 64

Stemming ...................................................................................................................................................... 64
v
Natural Language Processing

Lemmatization ............................................................................................................................................... 65

Counting POS Tags – Chunking ...................................................................................................................... 66

Running the NLP Script .................................................................................................................................. 66

vi
1. Natural Language Processing — Introduction Natural Language Processing

Language is a method of communication with the help of which we can speak, read and
write. For example, we think, we make decisions, plans and more in natural language;
precisely, in words. However, the big question that confronts us in this AI era is that can
we communicate in a similar manner with computers. In other words, can human beings
communicate with computers in their natural language? It is a challenge for us to develop
NLP applications because computers need structured data, but human speech is
unstructured and often ambiguous in nature.

In this sense, we can say that Natural Language Processing (NLP) is the sub-field of
Computer Science especially Artificial Intelligence (AI) that is concerned about enabling
computers to understand and process human language. Technically, the main task of NLP
would be to program computers for analyzing and processing huge amount of natural
language data.

History of NLP
We have divided the history of NLP into four phases. The phases have distinctive concerns
and styles.

First Phase (Machine Translation Phase) – Late 1940s to late 1960s

The work done in this phase focused mainly on machine translation (MT). This phase was
a period of enthusiasm and optimism.

Let us now see all that the first phase had in it:

 The research on NLP started in early 1950s after Booth & Richens’ investigation
and Weaver’s memorandum on machine translation in 1949.

 1954 was the year when a limited experiment on automatic translation from
Russian to English demonstrated in the Georgetown-IBM experiment.

 In the same year, the publication of the journal MT (Machine Translation) started.

 The first international conference on Machine Translation (MT) was held in 1952
and second was held in 1956.

 In 1961, the work presented in Teddington International Conference on Machine

Translation of Languages and Applied Language analysis was the high point of this
phase.

Second Phase (AI Influenced Phase) – Late 1960s to late 1970s

In this phase, the work done was majorly related to world knowledge and on its role in the
construction and manipulation of meaning representations. That is why, this phase is also
called AI-flavored phase.

The phase had in it, the following:

1
Natural Language Processing

 In early 1961, the work began on the problems of addressing and constructing data
or knowledge base. This work was influenced by AI.

 In the same year, a BASEBALL question-answering system was also developed.

The input to this system was restricted and the language processing involved was
a simple one.

 A much advanced system was described in Minsky (1968). This system, when
compared to the BASEBALL question-answering system, was recognized and
provided for the need of inference on the knowledge base in interpreting and
responding to language input.

Third Phase (Grammatico-logical Phase) – Late 1970s to late 1980s

This phase can be described as the grammatico-logical phase. Due to the failure of
practical system building in last phase, the researchers moved towards the use of logic for
knowledge representation and reasoning in AI.

The third phase had the following in it:

 The grammatico-logical approach, towards the end of decade, helped us with

powerful general-purpose sentence processors like SRI’s Core Language Engine
and Discourse Representation Theory, which offered a means of tackling more
extended discourse.

 In this phase we got some practical resources & tools like parsers, e.g. Alvey
Natural Language Tools along with more operational and commercial systems, e.g.
for database query.

 The work on lexicon in 1980s also pointed in the direction of grammatico-logical

approach.

Fourth Phase (Lexical & Corpus Phase) – The 1990s

We can describe this as a lexical & corpus phase. The phase had a lexicalized approach to
grammar that appeared in late 1980s and became an increasing influence. There was a
revolution in natural language processing in this decade with the introduction of machine
learning algorithms for language processing.

Study of Human Languages

Language is a crucial component for human lives and also the most fundamental aspect
of our behavior. We can experience it in mainly two forms – written and spoken. In the
written form, it is a way to pass our knowledge from one generation to the next. In the
spoken form, it is the primary medium for human beings to coordinate with each other in
their day-to-day behavior. Language is studied in various academic disciplines. Each
discipline comes with its own set of problems and a set of solution to address those.

2
Natural Language Processing

Consider the following table to understand this:

Discipline Problems Tools

Linguists How phrases and sentences can Intuitions about well-formedness

be formed with words? and meaning.

What curbs the possible meaning Mathematical model of structure.

for a sentence? For example, model theoretic
semantics, formal language
theory.

Psycholinguists How human beings can identify Experimental techniques mainly

the structure of sentences? for measuring the performance of
human beings.
How the meaning of words can be
identified? Statistical analysis of observations.

When does understanding take

place?

Philosophers How do words and sentences Natural language argumentation

acquire the meaning? by using intuition.

How the objects are identified by Mathematical models like logic and
the words? model theory.

What is meaning?

Computational How can we identify the structure Algorithms

Linguists of a sentence
Data structures
How knowledge and reasoning
Formal models of representation
can be modeled?
and reasoning.
How we can use language to
AI techniques like search &
accomplish specific tasks?
representation methods.

Ambiguity and Uncertainty in Language

Ambiguity, generally used in natural language processing, can be referred as the ability of
being understood in more than one way. In simple terms, we can say that ambiguity is
the capability of being understood in more than one way. Natural language is very
ambiguous. NLP has the following types of ambiguities:

Lexical Ambiguity
The ambiguity of a single word is called lexical ambiguity. For example, treating the word
silver as a noun, an adjective, or a verb.

Syntactic Ambiguity
This kind of ambiguity occurs when a sentence is parsed in different ways. For example,
the sentence “The man saw the girl with the telescope”. It is ambiguous whether the man
saw the girl carrying a telescope or he saw her through his telescope.
3
Natural Language Processing

Semantic Ambiguity
This kind of ambiguity occurs when the meaning of the words themselves can be
misinterpreted. In other words, semantic ambiguity happens when a sentence contains an
ambiguous word or phrase. For example, the sentence “The car hit the pole while it was
moving” is having semantic ambiguity because the interpretations can be “The car, while
moving, hit the pole” and “The car hit the pole while the pole was moving”.

Anaphoric Ambiguity
This kind of ambiguity arises due to the use of anaphora entities in discourse. For example,
the horse ran up the hill. It was very steep. It soon got tired. Here, the anaphoric reference
of “it” in two situations cause ambiguity.

Pragmatic ambiguity
Such kind of ambiguity refers to the situation where the context of a phrase gives it
multiple interpretations. In simple words, we can say that pragmatic ambiguity arises
when the statement is not specific. For example, the sentence “I like you too” can have
multiple interpretations like I like you (just like you like me), I like you (just like someone
else dose).

4
Natural Language Processing

NLP Phases
Following diagram shows the phases or logical steps in natural language processing:

Input sentence

Morphological
Processing

Lexicon
Syntax
analysis
Grammar

Semantic Semantic
rules analysis

Contextual Pragmatic
information analysis

Target representation

Morphological Processing
It is the first phase of NLP. The purpose of this phase is to break chunks of language input
into sets of tokens corresponding to paragraphs, sentences and words. For example, a
word like “uneasy” can be broken into two sub-word tokens as “un-easy”.

Syntax Analysis
It is the second phase of NLP. The purpose of this phase is two folds: to check that a
sentence is well formed or not and to break it up into a structure that shows the syntactic
relationships between the different words. For example, the sentence like “The school
goes to the boy” would be rejected by syntax analyzer or parser.

Semantic Analysis

5
Natural Language Processing

It is the third phase of NLP. The purpose of this phase is to draw exact meaning, or you
can say dictionary meaning from the text. The text is checked for meaningfulness. For
example, semantic analyzer would reject a sentence like “Hot ice-cream”.

Pragmatic Analysis
It is the fourth phase of NLP. Pragmatic analysis simply fits the actual objects/events,
which exist in a given context with object references obtained during the last phase
(semantic analysis). For example, the sentence “Put the banana in the basket on the shelf”
can have two semantic interpretations and pragmatic analyzer will choose between these
two possibilities.

6
2. Natural Language Processing — Linguistic Natural Language Processing

Resources

In this chapter, we will learn about the linguistic resources in Natural Language Processing.

Corpus
A corpus is a large and structured set of machine-readable texts that have been produced
in a natural communicative setting. Its plural is corpora. They can be derived in different
ways like text that was originally electronic, transcripts of spoken language and optical
character recognition, etc.

Elements of Corpus Design

Language is infinite but a corpus has to be finite in size. For the corpus to be finite in size,
we need to sample and proportionally include a wide range of text types to ensure a good
corpus design.

Let us now learn about some important elements for corpus design:

Corpus Representativeness
Representativeness is a defining feature of corpus design. The following definitions from
two great researchers – Leech and Biber, will help us understand corpus
representativeness:

 According to Leech (1991), “A corpus is thought to be representative of the

language variety it is supposed to represent if the findings based on its contents
can be generalized to the said language variety”.

 According to Biber (1993), “Representativeness refers to the extent to which a

sample includes the full range of variability in a population”.

In this way, we can conclude that representativeness of a corpus are determined by the
following two factors:

 Balance – The range of genre include in a corpus.

 Sampling – How the chunks for each genre are selected.

Corpus Balance
Another very important element of corpus design is corpus balance – the range of genre
included in a corpus. We have already studied that representativeness of a general corpus
depends upon how balanced the corpus is. A balanced corpus covers a wide range of text
categories, which are supposed to be representatives of the language. We do not have
any reliable scientific measure for balance but the best estimation and intuition works in
this concern. In other words, we can say that the accepted balance is determined by its
intended uses only.

Sampling
7
Natural Language Processing

Another important element of corpus design is sampling. Corpus representativeness and

balance is very closely associated with sampling. That is why we can say that sampling is
inescapable in corpus building.

 According to Biber(1993), “Some of the first considerations in constructing a

corpus concern the overall design: for example, the kinds of texts included, the
number of texts, the selection of particular texts, the selection of text samples from
within texts, and the length of text samples. Each of these involves a sampling
decision, either conscious or not.”

While obtaining a representative sample, we need to consider the following:

 Sampling unit: It refers to the unit which requires a sample. For example, for
written text, a sampling unit may be a newspaper, journal or a book.

 Sampling frame: The list of al sampling units is called a sampling frame.

 Population: It may be referred as the assembly of all sampling units. It is defined

in terms of language production, language reception or language as a product.

Corpus Size
Another important element of corpus design is its size. How large the corpus should be?
There is no specific answer to this question. The size of the corpus depends upon the
purpose for which it is intended as well as on some practical considerations as follows:

 Kind of query anticipated from the user.

 The methodology used by the users to study the data.
 Availability of the source of data.

With the advancement in technology, the corpus size also increases. The following table
of comparison will help you understand how the corpus size works:

Year Name of the Corpus Size (in words)

1960s-70s Brown and LOB 1 Million words

1980s The Birmingham corpora 20 Million words

1990s The British National corpus 100 Million words

Early 21st century The Bank of English corpus 650 Million words

In our subsequent sections, we will look at a few examples of corpus.

TreeBank Corpus
It may be defined as linguistically parsed text corpus that annotates syntactic or semantic
sentence structure. Geoffrey Leech coined the term ‘treebank’, which represents that the
most common way of representing the grammatical analysis is by means of a tree
structure. Generally, Treebanks are created on the top of a corpus, which has already been
annotated with part-of-speech tags.

8
Natural Language Processing

Types of TreeBank Corpus

Semantic and Syntactic Treebanks are the two most common types of Treebanks in
linguistics. Let us now learn more about these types -

Semantic Treebanks
These Treebanks use a formal representation of sentence’s semantic structure. They vary
in the depth of their semantic representation. Robot Commands Treebank, Geoquery,
Groningen Meaning Bank, RoboCup Corpus are some of the examples of Semantic
Treebanks.

Syntactic Treebanks
Opposite to the semantic Treebanks, inputs to the Syntactic Treebank systems are
expressions of the formal language obtained from the conversion of parsed Treebank data.
The outputs of such systems are predicate logic based meaning representation. Various
syntactic Treebanks in different languages have been created so far. For example, Penn
Arabic Treebank, Columbia Arabic Treebank are syntactic Treebanks created in Arabia
language. Sininca syntactic Treebank created in Chinese language. Lucy, Susane and
BLLIP WSJ syntactic corpus created in English language.

Applications of TreeBank Corpus

Followings are some of the applications of TreeBanks:

In Computational Linguistics
If we talk about Computational Linguistic then the best use of TreeBanks is to engineer
state-of-the-art natural language processing systems such as part-of-speech taggers,
parsers, semantic analyzers and machine translation systems.

In Corpus Linguistics
In case of Corpus linguistics, the best use of Treebanks is to study syntactic phenomena.

In Theoretical Linguistics and Psycholinguistics

The best use of Treebanks in theoretical and psycholinguistics is interaction evidence.

PropBank Corpus
PropBank more specifically called “Proposition Bank” is a corpus, which is annotated with
verbal propositions and their arguments. The corpus is a verb-oriented resource; the
annotations here are more closely related to the syntactic level. Martha Palmer et al.,
Department of Linguistic, University of Colorado Boulder developed it. We can use the
term PropBank as a common noun referring to any corpus that has been annotated with
propositions and their arguments.

In Natural Language Processing (NLP), the PropBank project has played a very significant
role. It helps in semantic role labeling.

9
Natural Language Processing

VerbNet(VN)
VerbNet(VN) is the hierarchical domain-independent and largest lexical resource present
in English that incorporates both semantic as well as syntactic information about its
contents. VN is a broad-coverage verb lexicon having mappings to other lexical resources
such as WordNet, Xtag and FrameNet. It is organized into verb classes extending Levin
classes by refinement and addition of subclasses for achieving syntactic and semantic
coherence among class members.

Each VerbNet (VN) class contains:

A set of syntactic descriptions or syntactic frames

For depicting the possible surface realizations of the argument structure for constructions
such as transitive, intransitive, prepositional phrases, resultatives, and a large set of
diathesis alternations.

A set of semantic descriptions such as animate, human, organization

For constraining, the types of thematic roles allowed by the arguments, and further
restrictions may be imposed. This will help in indicating the syntactic nature of the
constituent likely to be associated with the thematic role.

WordNet
WordNet, created by Princeton is a lexical database for English language. It is the part of
the NLTK corpus. In WordNet, nouns, verbs, adjectives and adverbs are grouped into sets
of cognitive synonyms called Synsets. All the synsets are linked with the help of
conceptual-semantic and lexical relations. Its structure makes it very useful for natural
language processing (NLP).

In information systems, WordNet is used for various purposes like word-sense

disambiguation, information retrieval, automatic text classification and machine
translation. One of the most important uses of WordNet is to find out the similarity among
words. For this task, various algorithms have been implemented in various packages like
Similarity in Perl, NLTK in Python and ADW in Java.

10
3. Natural Language Processing — Word Level Natural Language Processing

Analysis

In this chapter, we will understand world level analysis in Natural Language Processing.

Regular Expressions
A regular expression (RE) is a language for specifying text search strings. RE helps us to
match or find other strings or sets of strings, using a specialized syntax held in a pattern.
Regular expressions are used to search texts in UNIX as well as in MS WORD in identical
way. We have various search engines using a number of RE features.

Properties of Regular Expressions

Followings are some of the important properties of RE:

 American Mathematician Stephen Cole Kleene formalized the Regular Expression

language.

 RE is a formula in a special language, which can be used for specifying simple

classes of strings, a sequence of symbols. In other words, we can say that RE is an
algebraic notation for characterizing a set of strings.

 Regular expression requires two things, one is the pattern that we wish to search
and other is a corpus of text from which we need to search.

Mathematically, A Regular Expression can be defined as follows −

 ε is a Regular Expression, which indicates that the language is having an empty

string.

 φ is a Regular Expression which denotes that it is an empty language.

 If X and Y are Regular Expressions, then

o X, Y

o X.Y(Concatenation of XY)

o X+Y (Union of X and Y)

o X, Y (Kleen Closure of X and Y)

are also regular expressions.

 If a string is derived from above rules then that would also be a regular expression.

11
Natural Language Processing

Examples of Regular Expressions

The following table shows a few examples of Regular Expressions:

Regular Expressions Regular Set

(0 + 10*) {0, 1, 10, 100, 1000, 10000, … }

(010) {1, 01, 10, 010, 0010, …}

(0 + ε)(1 + ε) {ε, 0, 1, 01}

(a+b)* It would be set of strings of a’s and b’s of any length

which also includes the null string i.e. {ε, a, b, aa , ab
, bb , ba, aaa…….}

(a+b)*abb It would be set of strings of a’s and b’s ending with

the string abb i.e. {abb, aabb, babb, aaabb, ababb,
…………..}

(11)* It would be set consisting of even number of 1’s which

also includes an empty string i.e. {ε, 11, 1111,
111111, ……….}

(aa)(bb)b It would be set of strings consisting of even number

of a’s followed by odd number of b’s i.e. {b, aab,
aabbb, aabbbbb, aaaab, aaaabbb, …………..}

(aa + ab + ba + bb)* It would be string of a’s and b’s of even length that
can be obtained by concatenating any combination of
the strings aa, ab, ba and bb including null i.e. {aa,
ab, ba, bb, aaab, aaba, …………..}

Regular Sets & Their Properties

It may be defined as the set that represents the value of the regular expression and
consists specific properties.

Properties of regular sets

 If we do the union of two regular sets then the resulting set would also be regular.

 If we do the intersection of two regular sets then the resulting set would also be
regular.

12
Natural Language Processing

 If we do the complement of regular sets, then the resulting set would also be
regular.

 If we do the difference of two regular sets, then the resulting set would also be
regular.

 If we do the reversal of regular sets, then the resulting set would also be regular.

 If we take the closure of regular sets, then the resulting set would also be regular.

 If we do the concatenation of two regular sets, then the resulting set would also be
regular.

Finite State Automata

The term automata, derived from the Greek word "αὐτόματα" meaning "self-acting", is the
plural of automaton which may be defined as an abstract self-propelled computing device
that follows a predetermined sequence of operations automatically.

An automaton having a finite number of states is called a Finite Automaton (FA) or Finite
State automata (FSA).

Mathematically, an automaton can be represented by a 5-tuple (Q, ∑, δ, q0, F), where −

 Q is a finite set of states.

 ∑ is a finite set of symbols, called the alphabet of the automaton.
 δ is the transition function.
 q0 is the initial state from where any input is processed (q0 ∈ Q).
 F is a set of final state/states of Q (F ⊆ Q).

Relation between Finite Automata, Regular Grammars and Regular

Expressions
Following points will give us a clear view about the relationship between finite automata,
regular grammars and regular expressions:

 As we know that finite state automata are the theoretical foundation of

computational work and regular expressions is one way of describing them.

 We can say that any regular expression can be implemented as FSA and any FSA
can be described with a regular expression.

 On the other hand, regular expression is a way to characterize a kind of language

called regular language. Hence, we can say that regular language can be described
with the help of both FSA and regular expression.

 Regular grammar, a formal grammar that can be right-regular or left-regular, is

another way to characterize regular language.

Following diagram shows that finite automata, regular expressions and regular grammars
are the equivalent ways of describing regular languages.

13
Natural Language Processing

Regular Finite
Expressions Automata

Regular
Languages
Regular Grammars

Types of Finite State Automation (FSA)

Finite state automation is of two types. Let us see what the types are.

Deterministic Finite automation (DFA)

It may be defined as the type of finite automation wherein, for every input symbol we can
determine the state to which the machine will move. It has a finite number of states that
is why the machine is called Deterministic Finite Automaton (DFA).

Mathematically, a DFA can be represented by a 5-tuple (Q, ∑, δ, q0, F), where −

 Q is a finite set of states.

 ∑ is a finite set of symbols, called the alphabet of the automaton.
 δ is the transition function where δ: Q × ∑ → Q .
 q0 is the initial state from where any input is processed (q0 ∈ Q).
 F is a set of final state/states of Q (F ⊆ Q).

Whereas graphically, a DFA can be represented by diagraphs called state diagrams where
−

 The states are represented by vertices.

 The transitions are shown by labeled arcs.
 The initial state is represented by an empty incoming arc.
 The final state is represented by double circle.

Example of DFA
Suppose a DFA be

 Q = {a, b, c},
 ∑ = {0, 1},
 q0 = {a},
 F = {c},
 Transition function δ is shown in the table as follows:-
14
Natural Language Processing

Current State Next State for Input 0 Next State for Input 1

A a B

B c A

C b C

The graphical representation of this DFA would be as follows:

1 0

a 1
b 0
1
1

Non-deterministic Finite Automation (NDFA)

It may be defined as the type of finite automation where for every input symbol we cannot
determine the state to which the machine will move i.e. the machine can move to any
combination of the states. It has a finite number of states that is why the machine is
called Non-deterministic Finite Automation (NDFA).

Mathematically, NDFA can be represented by a 5-tuple (Q, ∑, δ, q0, F), where −

 Q is a finite set of states.

 ∑ is a finite set of symbols, called the alphabet of the automaton.
 δ :-is the transition function where δ: Q × ∑ → 2Q.
 q0 :-is the initial state from where any input is processed (q0 ∈ Q).
 F :-is a set of final state/states of Q (F ⊆ Q).

Whereas graphically (same as DFA), a NDFA can be represented by diagraphs called state
diagrams where −

 The states are represented by vertices.

 The transitions are shown by labeled arcs.
 The initial state is represented by an empty incoming arc.
 The final state is represented by double circle.
15
Natural Language Processing

Example of NDFA
Suppose a NDFA be

 Q = {a, b, c},
 ∑ = {0, 1},
 q0 = {a},
 F = {c},
 Transition function δ is shown in the table as follows: -

Current State Next State for Input 0 Next State for Input 1

A a, b B

B C a, c

C b, c C

The graphical representation of this NDFA would be as follows:

1 0

a 0,1
b 0,
11
-
0,

16
Natural Language Processing

End of ebook preview

If you liked what you saw…

Buy it from our store @ [Link]

Natural Language Toolkit Tutorial
100% (1)
Natural Language Toolkit Tutorial
109 pages
Natural Language Processing Syllabus
No ratings yet
Natural Language Processing Syllabus
159 pages
NLP Notes
No ratings yet
NLP Notes
203 pages
Natural Language Processing Overview
No ratings yet
Natural Language Processing Overview
26 pages
Machine Learning
100% (5)
Machine Learning
35 pages
Natural Language Processing: All You Need To Know About
No ratings yet
Natural Language Processing: All You Need To Know About
45 pages
Understanding Natural Language Processing
No ratings yet
Understanding Natural Language Processing
30 pages
49 Machine Learning
No ratings yet
49 Machine Learning
300 pages
10 Natural Language Processing
No ratings yet
10 Natural Language Processing
27 pages
Introduction To Natural Language Processing
100% (3)
Introduction To Natural Language Processing
111 pages
Symbolic Machine Learning: M.S.Kaysar, M.Engg Cse, Iub
100% (2)
Symbolic Machine Learning: M.S.Kaysar, M.Engg Cse, Iub
112 pages
Deep Learning Notes
No ratings yet
Deep Learning Notes
110 pages
Deep Learning Course Overview
100% (2)
Deep Learning Course Overview
639 pages
Natural Language Processing
No ratings yet
Natural Language Processing
12 pages
Natural Language Processing
No ratings yet
Natural Language Processing
22 pages
Natural Language Processing
No ratings yet
Natural Language Processing
49 pages
NLP Topper
100% (1)
NLP Topper
71 pages
Introduction to Natural Language Processing
No ratings yet
Introduction to Natural Language Processing
12 pages
Natural Language Processing
100% (7)
Natural Language Processing
309 pages
Machine Learning Absolute Beginners
100% (2)
Machine Learning Absolute Beginners
52 pages
Machine Learning Notes 2 - TutorialsDuniya PDF
No ratings yet
Machine Learning Notes 2 - TutorialsDuniya PDF
92 pages
Overview of Small Language Models
No ratings yet
Overview of Small Language Models
3 pages
Deep Learning Basics Overview
90% (10)
Deep Learning Basics Overview
69 pages
Natural Language Processing in The Real World Text Processing, Analytics, and Classification
100% (11)
Natural Language Processing in The Real World Text Processing, Analytics, and Classification
393 pages
Natural Language Processing: Dr. Tulasi Prasad Sariki SCOPE, VIT Chennai
No ratings yet
Natural Language Processing: Dr. Tulasi Prasad Sariki SCOPE, VIT Chennai
29 pages
Machine Learning Basics for Beginners
100% (5)
Machine Learning Basics for Beginners
134 pages
Natural Language Processing
100% (1)
Natural Language Processing
3 pages
Deep Learning Decoding Problems
100% (1)
Deep Learning Decoding Problems
103 pages
RAG (Generative AI) - A "Rags To Riches" Moment For Artificial Intelligence - by Kanishk Khatter - Medium
No ratings yet
RAG (Generative AI) - A "Rags To Riches" Moment For Artificial Intelligence - by Kanishk Khatter - Medium
12 pages
NLP Basics for AI Enthusiasts
100% (1)
NLP Basics for AI Enthusiasts
21 pages
Machine Learning and Deep Learning With Python A Beginners Guide To Programming - 2 Books in 1
No ratings yet
Machine Learning and Deep Learning With Python A Beginners Guide To Programming - 2 Books in 1
132 pages
NLP and Generative AI Syllabus - 2025
No ratings yet
NLP and Generative AI Syllabus - 2025
5 pages
MachineLearningSimplified 200401 005435
100% (1)
MachineLearningSimplified 200401 005435
96 pages
NLP StudyMaterial
No ratings yet
NLP StudyMaterial
540 pages
Introduction to Machine Learning Basics
No ratings yet
Introduction to Machine Learning Basics
32 pages
Machine Learning Handouts
No ratings yet
Machine Learning Handouts
110 pages
Semantic Web
No ratings yet
Semantic Web
318 pages
Machine Learning Essentials
100% (1)
Machine Learning Essentials
155 pages
Overview of Natural Language Processing
No ratings yet
Overview of Natural Language Processing
21 pages
Artificial Neural Network Tutorial
100% (2)
Artificial Neural Network Tutorial
69 pages
AI ML Book
100% (1)
AI ML Book
20 pages
Tensorflow Tutorial PDF
100% (6)
Tensorflow Tutorial PDF
90 pages
Langmodel PDF
0% (1)
Langmodel PDF
69 pages
Natural Language Processing Tutorial
0% (1)
Natural Language Processing Tutorial
24 pages
Introduction To Natural Language Processing
No ratings yet
Introduction To Natural Language Processing
211 pages
NLP Guide: Theory & Practice
No ratings yet
NLP Guide: Theory & Practice
26 pages
NLP Notes 2
No ratings yet
NLP Notes 2
137 pages
Introduction to Natural Language Processing
No ratings yet
Introduction to Natural Language Processing
24 pages
Unit 1 NLP
No ratings yet
Unit 1 NLP
76 pages
Introduction To Natural Language Processing
No ratings yet
Introduction To Natural Language Processing
11 pages
Module 1
No ratings yet
Module 1
49 pages
NLP Module 1
No ratings yet
NLP Module 1
124 pages
NLP PPT1
No ratings yet
NLP PPT1
29 pages
NLP Session I-Unit I and II
No ratings yet
NLP Session I-Unit I and II
50 pages
A Beginner's Introduction To Natural Language Processing (NLP)
100% (1)
A Beginner's Introduction To Natural Language Processing (NLP)
15 pages
Natural Language Processing UNIT 1
No ratings yet
Natural Language Processing UNIT 1
130 pages
NLP Foundations in AI and Machine Learning
No ratings yet
NLP Foundations in AI and Machine Learning
28 pages
NLP for AI and Tech Enthusiasts
No ratings yet
NLP for AI and Tech Enthusiasts
30 pages
Introduction To Natural Language Processing
No ratings yet
Introduction To Natural Language Processing
21 pages
NLP Textbook Star Edu
No ratings yet
NLP Textbook Star Edu
103 pages
UGC NET OS Exam Prep
No ratings yet
UGC NET OS Exam Prep
6 pages
Appendix Dspace and Real Time Interface in Simulink
No ratings yet
Appendix Dspace and Real Time Interface in Simulink
9 pages
W3Schools Python Quiz
No ratings yet
W3Schools Python Quiz
1 page
8051 Microcontroller Interrupts & Power Modes
No ratings yet
8051 Microcontroller Interrupts & Power Modes
5 pages
Angela Sareen - Blake IA
No ratings yet
Angela Sareen - Blake IA
2 pages
COBOL Structured Program Design Guide
No ratings yet
COBOL Structured Program Design Guide
14 pages
Verbs for Describing Trends in Data
No ratings yet
Verbs for Describing Trends in Data
5 pages
Digital Business and Innovation MCQ PDF
No ratings yet
Digital Business and Innovation MCQ PDF
13 pages
Rilke and The Modernist Tradition: A Brief Look at "Archaic Torso of Apollo" Pyeaam ABBASI
No ratings yet
Rilke and The Modernist Tradition: A Brief Look at "Archaic Torso of Apollo" Pyeaam ABBASI
7 pages
ShipConstructor Hull and Structure Documentation
No ratings yet
ShipConstructor Hull and Structure Documentation
50 pages
Lesson Plan: Parakeet Story Activities
No ratings yet
Lesson Plan: Parakeet Story Activities
5 pages
M-Ii-Qb-Cie-I Sem-Ii 2025-26
No ratings yet
M-Ii-Qb-Cie-I Sem-Ii 2025-26
7 pages
Petunjuk Mengerjakan Soal:: Panca Bhakti Magetan
No ratings yet
Petunjuk Mengerjakan Soal:: Panca Bhakti Magetan
2 pages
The Great Delusion - How The World Was Deceived Into Taking The Mark of The Beast
No ratings yet
The Great Delusion - How The World Was Deceived Into Taking The Mark of The Beast
84 pages
Guia de Estudio Unit3
No ratings yet
Guia de Estudio Unit3
2 pages
Numerical Modelling of The Anisotropic Mechanical Behaviour of Opalinus Clay at The Laboratory-Scale Using FEM/DEM
No ratings yet
Numerical Modelling of The Anisotropic Mechanical Behaviour of Opalinus Clay at The Laboratory-Scale Using FEM/DEM
20 pages
Daniel 5
No ratings yet
Daniel 5
5 pages
Meeting 13 - English 1
No ratings yet
Meeting 13 - English 1
7 pages
Activate and Monitor UEH Exceptions
No ratings yet
Activate and Monitor UEH Exceptions
5 pages
Advanced Assembly Extension (AAX)
No ratings yet
Advanced Assembly Extension (AAX)
20 pages
To The Filipino Youth: (A La Juventud Filipina)
No ratings yet
To The Filipino Youth: (A La Juventud Filipina)
8 pages
Ottoman Influence in Philippine Islam
No ratings yet
Ottoman Influence in Philippine Islam
28 pages
No-Detention Policy
No ratings yet
No-Detention Policy
12 pages
Web Scraping by Using R
No ratings yet
Web Scraping by Using R
3 pages
RCM ResumeParsingJobAid 1402
No ratings yet
RCM ResumeParsingJobAid 1402
7 pages
Ramachandrudu's Professional Profile
No ratings yet
Ramachandrudu's Professional Profile
3 pages
Analyzing Data With SQL Server Reporting Services
No ratings yet
Analyzing Data With SQL Server Reporting Services
63 pages
Sentimentalism Presentation
No ratings yet
Sentimentalism Presentation
13 pages
Digital Signature in Adobe Forms - FIORI
100% (1)
Digital Signature in Adobe Forms - FIORI
11 pages
Kainan Wesley Batista Meira Overview
No ratings yet
Kainan Wesley Batista Meira Overview
7 pages