Natural Language Processing (NLP) : Chapter 1: Introduction To NLP

WOLAITA SODO UNIVERSITY
School of Informatics
Department of Information Technology
Natural Language Processing(NLP)

Chapter 1: Introduction to NLP
by Akalenesh.A (MSc.)
07/06/2022 December, 2021 1
Chapter Outline:
Introduction to NLP
Levels of NLP
Natural Language Generation
History of NLP
Applications of NLP
Study of Human Languages
Ambiguity and Uncertainty in Language
NLP Phases
2
Chapter 1
1. Introduction to
Natural Language Processing
3
1. Introduction
• Language is a method of communication with the help of
which we can speak, read and write.
• A language can be defined as a set of rules or set of symbol.
• Symbol are combined and used for conveying information or
broadcasting the information. Symbols are tyrannized by the
Rules.
• For example, we think, we make decisions, plans and more in
natural language; precisely, in words.
• Can human beings communicate with computers in their
natural language?
• It is a challenge for us to develop NLP applications because :
• Computers need structured data,
• but human speech is unstructured and often ambiguous in
nature.
4
Cont..
• In this sense, we can say that Natural Language Processing

(NLP) is the sub-field of Computer Science especially
Artificial Intelligence (AI) that is concerned about enabling
computers to understand and process human language.
• Technically, the main task of NLP would be to program
computers for analyzing and processing huge amount of
natural language data.
• Natural Language Processing (NLP) is a tract of Artificial
Intelligence and Linguistics, devoted to make computers
understand the statements or words written in human
languages.
• Natural language processing came into existence to ease the
user’s work and to satisfy the wish to communicate with
the computer in natural language.
5
Cont…
• Since all the users may not be well-versed in machine

specific language, NLP caters those users who do not
have enough time to learn new languages or get
perfection in it.
• Natural Language Processing basically can be classified
into two parts i.e.
• Natural Language Understanding : which evolves the
task to understand linguistic .
• Natural Language Generation : which evolves the
task to generate the text.
6
Classification of NLP
7
Cont…
• Linguistics is the science of language which includes

Phonology that refers to sound, Morphology word
formation, Syntax sentence structure, Semantics , syntax and
Pragmatics which refers to understanding.
• Noah Chomsky, one of the first linguists of 12 century that
started syntactic theories, marked a unique position in the
field of theoretical linguistics because he revolutionized the
area of syntax.
• Which can be broadly categorized into two levels Higher
Level which include speech recognition and Lower Level
which corresponds to natural language.
• Some of these tasks have direct real world applications such
as Machine translation, Named entity recognition,
Optical character recognition.
8
2. Levels of NLP
• The ‘levels of language’ are one of the most explanatory

method for representing the Natural Language processing
which helps to generate the NLP text by realizing Content
Planning, Sentence Planning and Surface Realization
phases.
Figure 2. Phases of NLP architecture 9

Cont…
• Linguistic is the science which involves meaning of

language, language context and various forms of the
language.
• The various important terminologies of Natural
Language Processing are:
1. Phonology:
Phonology is the part of Linguistics which refers to the
systematic arrangement of sound.
The term phonology comes from Ancient Greek:
• The term phono- which means voice or sound, and
• The suffix –logy refers to word or speech.
• In 1993 Nikolai Trubetzkoy stated that Phonology is “the
study of sound pertaining to the system of language". 10
Cont…
• Whereas Lass in 1998 wrote that phonology refers

broadly with the sounds of language, concerned with the
to lathe sub discipline of linguistics, whereas it could be
explained as, "phonology proper is concerned with the
function, behavior and organization of sounds as
linguistic items.
• Phonology include semantic use of sound to encode
meaning of any Human language.
2. Morphology
The different parts of the word represent the smallest
units of meaning known as Morphemes.
Morphology which comprise of Nature of words, are
initiated by morphemes.
11
Cont…
• An example of Morpheme could be, the word precancellation can be
morphologically scrutinized into three separate morphemes: the prefix
pre, the root cancella, and the suffix -tion.
• The interpretation of morpheme stays same across all the words, just to
understand the meaning humans can break any unknown word into
morphemes.
• For example, adding the suffix –ed to a verb, conveys that the action of
the verb took place in the past.
• The words that cannot be divided and have meaning by themselves are
called Lexical morpheme (e.g.: table, chair).
• The words (e.g. -ed, -ing, -est, -ly, -ful) that are combined with the
lexical morpheme are known as Grammatical morphemes (eg. Worked,
Consulting, Smallest, Likely, Use).
• Those grammatical morphemes that occurs in combination called bound
morphemes( eg. -ed, -ing) .
• Grammatical morphemes can be divided into bound morphemes and
derivational morphemes.
12
3. Lexical
• In Lexical, humans, as well as NLP systems, interpret the
meaning of individual words.
• Sundry types of processing bestow to word-level
understanding – the first of these being a part-of-speech
tag to each word.
• In this processing, words that can act as more than one
part-of-speech are assigned the most probable part-of
speech tag based on the context in which they occur.
• At the lexical level, Semantic representations can be
replaced by the words that have one meaning.
• In NLP system, the nature of the representation varies
according to the semantic theory deployed.
13
4. Syntactic
• This level emphasis to analyze the words in a sentence so

as to uncover the grammatical structure of the sentence.
• Both grammar and parser are required in this level.
• The output of this level of processing is representation of
the sentence that make known the structural dependency
relationships between the words.
• There are various grammars that can be impeded, and
which in twirl, whack the option of a parser.
• Not all NLP applications require a full parse of sentences,
therefore the abide challenges in parsing of prepositional
phrase attachment and conjunction audit no longer impede
that plea for which phrasal and clausal dependencies are
adequate.
14
Cont…
• Syntax conveys meaning in most languages because order

and dependency contribute to connotation.
• For example, the two sentences: ‘The cat chased the
mouse.’ and ‘The mouse chased the cat.’ differ only in
terms of syntax, yet convey quite different meanings.
5. Semantic
In semantic most people think that meaning is
determined, however, this is not it is all the levels that
bestow to meaning.
Semantic processing determines the possible meanings of
a sentence by pivoting on the interactions among word-
level meanings in the sentence.
15
Cont..
• This level of processing can incorporate the semantic

disambiguation of words with multiple senses; in a
cognate way to how syntactic disambiguation of words
that can errand as multiple parts-of-speech is adroit at the
syntactic level.
• For example, amongst other meanings, ‘file’ as a noun can
mean either a binder for gathering papers, or a tool to
form one’s fingernails, or a line of individuals in a queue .
• The semantic level analyses words for their dictionary
elucidation, but also for the elucidation they derive from
the milieu of the sentence.
• Semantics milieu that most words have more than one
elucidation but that we can spot the appropriate one by
looking at the rest of the sentence. 16
6. Discourse
• While syntax and semantics travail with sentence-length units,

the discourse level of NLP travail with units of text longer than a
sentence
• i.e, it does not interpret multi sentence texts as just sequence
sentences, apiece of which can be elucidated singly.
• Rather, discourse focuses on the properties of the text as a whole
that convey meaning by making connections between component
sentences.
• The two of the most common levels are :
• Anaphora Resolution - Anaphora resolution is the replacing of
words such as pronouns, which are semantically stranded, with
the pertinent entity to which they refer.
• Discourse/Text Structure Recognition - Discourse/text structure
recognition control the functions of sentences in the text, which,
in turn, adds to the meaningful representation of the text.
17
7. Pragmatic:
• Pragmatic is concerned with the firm use of language in

situations and utilizes nub over and above the nub of the
text for understanding the goal and to explain how extra
meaning is read into texts without literally being encoded
in them.
• This requisite much world knowledge, including the
understanding of intentions, plans, and goals.
• For example, the following two sentences need aspiration
of the anaphoric term ‘they’, but this aspiration requires
pragmatic or world knowledge
18
3. Natural Language Generation
• Natural Language Generation (NLG) is the process of

producing phrases, sentences and paragraphs that are
meaningful from an internal representation.
• It is a part of Natural Language Processing and happens
in four phases:
• Identifying the goals,
• Identifying planning on how goals maybe achieved by
evaluating the situation and
• Available communicative sources and
• Realizing the plans as a text .
• It is opposite to Understanding
19
4. Components of NLG
20
A. Speaker and Generator
To generate a text we need to have a speaker or an

application and a generator or a program that renders the
application’s intentions into fluent phrase relevant to the
situation.
B. Components and Levels of Representation
Content selection: Information should be selected and
included in the set. Depending on how this information is
parsed into representational units, parts of the units may
have to be removed while some others may be added by
default.
Textual Organization: The information must be
textually organized according the grammar, it must be
ordered both sequentially and in terms of linguistic
relations like modifications. 21
Cont..
• Linguistic Resources: To support the information’s realization,

linguistic resources must be chosen. In the end these resources
will come down to choices of particular words, idioms,
syntactic constructs etc.
• Realization: The selected and organized resources must be
realized as an actual text or voice output.
C. Application or Speaker
This is only for maintaining the model of the situation. Here the
speaker just initiates the process doesn’t take part in the
language generation.
• It stores the history, structures the content that is potentially
relevant and deploys a representation of what it actually knows.
• All these form the situation, while selecting subset of
propositions that speaker has. The only requirement is the
speaker has to make sense of the situation. 22
5. History of NLP
We have divided the history of NLP into four phases. The
phases have distinctive concerns and styles.
First Phase (Machine Translation Phase) : Late 1940s
to late 1960s .The work done in this phase focused mainly
on machine translation (MT).
This phase was a period of enthusiasm and optimism.
The research on NLP started in early 1950s after Booth &
Richens’ investigation and Weaver’s memorandum on
machine translation in 1949.
1954 was the year when a limited experiment on
automatic translation from Russian to English
demonstrated in the Georgetown-IBM experiment.
23
Cont…
In the same year, the publication of the journal MT

(Machine Translation) started.
The first international conference on Machine
Translation (MT) was held in 1952 and second was held
in 1956.
 In 1961, the work presented in Teddington International
Conference on Machine Translation of Languages and
Applied Language analysis was the high point of this
phase.
24
Second Phase (AI Influenced Phase) :Late 1960s to late 1970s
• In this phase, the work done was majorly related to world
knowledge and on its role in the construction and manipulation of
meaning representations.
• That is why, this phase is also called AI-flavored phase.
In early 1961, the work began on the problems of addressing and
constructing data or knowledge base. This work was influenced by
AI.
In the same year, a BASEBALL question-answering system was
also developed. The input to this system was restricted and the
language processing involved was a simple one.
A much advanced system was described in Minsky (1968).
This system, when compared to the BASEBALL question-
answering system, was recognized and provided for the need of
inference on the knowledge base in interpreting and responding to
language input. 25
Third Phase (Grammatico-logical Phase)
• Late 1970s to late 1980s This phase can be described as the
grammatico-logical phase.
• Due to the failure of practical system building in last phase, the
researchers moved towards the use of logic for knowledge
representation and reasoning in AI.
The grammatico-logical approach, towards the end of decade,
helped us with powerful general-purpose sentence processors
like Core Language Engine and Discourse Representation
Theory, which offered a means of tackling more extended
discourse.
In this phase they got some practical resources & tools like
parsers, e.g. Alvey Natural Language Tools along with more
operational and commercial systems, e.g. for database query.
The work on lexicon in 1980s also pointed in the direction of
grammatico-logical approach.
26
Cont…
• Fourth Phase (Lexical & Corpus Phase) :The 1990s

We can describe this as a lexical & corpus phase.
• The phase had a lexicalized approach to grammar that
appeared in late 1980s and became an increasing
influence.
• There was a revolution in natural language processing in
this decade with the introduction of machine learning
algorithms for language processing.
27
6. Applications of NLP
• Natural Language Processing can be applied into various
areas like :
• Machine Translation,
• Email Spam detection,
• Information Extraction,
• Summarization,
• Question Answering etc.
28
6.1 Machine Translation
• Machine Translation is generally translating phrases from

one language to another with the help of a statistical engine
like Google Translate.
• The challenge with machine translation technologies is not
directly translating words but keeping the meaning of
sentences intact along with grammar and tenses.
• The statistical machine learning gathers as many data as they
can find that seems to be parallel between two languages and
they crunch their data to find the likelihood that something
in Language A corresponds to something in Language B
29
6.2 Text Categorization
• Categorization systems inputs a large flow of data like
official documents, military casualty reports, market data,
newswires etc. and assign them to predefined categories
or indices.
• Some companies have been using categorization systems
to categorize trouble tickets or complaint requests and
routing to the appropriate desks.
• Another application of text categorization is email spam
filters. Spam filters is becoming important as the first line
of defense against the unwanted emails.
• A false negative and false positive issues of spam filters
are at the heart of NLP technology, its brought down to
the challenge of extracting meaning from strings of text.
30
Cont…
• A filtering solution that is applied to an email system uses

a set of protocols to determine which of the incoming
messages are spam and which are not.
• There are several types of spam filters available.
• Content filters: Review the content within the message to
determine whether it is a spam or not.
• Header filters: Review the email header looking for fake
information.
• General Blacklist filters: Stops all emails from
blacklisted recipients.
• Rules Based Filters: It uses user-defined criteria. Such as
stopping mails from specific person or stopping mail
including a specific word.
31
Cont…
• Permission Filters: Require anyone sending a message to be pre-

approved by the recipient.
• Challenge Response Filters: Requires anyone sending a message
to enter a code in order to gain permission to send email.
6.3 Information Extraction
• Information extraction is concerned with identifying phrases of
interest of textual data.
• For many applications, extracting entities such as names, places,
events, dates, times and prices is a powerful way of summarize
the information relevant to a user’s needs.
• In the case of a domain specific search engine, the automatic
identification of important information can increase accuracy and
efficiency of a directed search.
• There is use of Hidden Markov models (HMMs) to extract the
relevant fields of research papers.
32
7. Study of Human Languages
• Language is a crucial component for human lives and
also the most fundamental aspect of our behavior.
• We can experience it in mainly two forms written and
spoken.
• In the written form, it is a way to pass our knowledge
from one generation to the next.
• In the spoken form, it is the primary medium for human
beings to coordinate with each other in their day-to-day
behavior.
• Language is studied in various academic disciplines.
• Each discipline comes with its own set of problems and
a set of solution to address those.
33
Consider the following table to understand this:
34
Ambiguity and Uncertainty in Language
• The field of Natural Language Processing is related with

different theories and techniques that deal with the problem
of natural language of communicating with the computers.
• Ambiguity, generally used in natural language processing,
can be referred as the ability of being understood in more
than one way.
• In simple terms, we can say that ambiguity is the capability
of being understood in more than one way.
• Ambiguity is one of the major problem of natural language
which is usually faced in syntactic level which has subtask
as lexical and morphology which are concerned with the
study of words and word formation.
• Each of these levels can produce ambiguities that can be
solved by the knowledge of the complete sentence.
35
Cont…
The ambiguity can be solved by various methods such as

• Minimizing Ambiguity,
• Preserving Ambiguity,
• Interactive Disambiguity and
• Weighting Ambiguity
Some of the methods proposed by researchers to remove
ambiguity is preserving .
• Natural language is very ambiguous. NLP has the
following types of ambiguities:
1. Lexical Ambiguity
The ambiguity of a single word is called lexical
ambiguity. For example, treating the word silver as a
noun, an adjective, or a verb.
36
2. Syntactic Ambiguity
This kind of ambiguity occurs when a sentence is parsed in different
ways. For example, the sentence “The man saw the girl with the
telescope”.
It is ambiguous whether the man saw the girl carrying a telescope or
he saw her through his telescope .
3. Semantic Ambiguity
This kind of ambiguity occurs when the meaning of the words
themselves can be misinterpreted.
In other words, semantic ambiguity happens when a sentence
contains an ambiguous word or phrase.
For example, the sentence “The car hit the pole while it was moving”
is having semantic ambiguity because the interpretations can be
“Thecar, while moving, hit the pole” and
“The car hit the pole while the pole was moving”.
37
Cont…
4. Anaphoric Ambiguity
This kind of ambiguity arises due to the use of anaphora entities in
discourse.
For example, the horse ran up the hill. It was very steep. It soon got
tired.
Here, the anaphoric reference of “it” in two situations cause
ambiguity.
5. Pragmatic ambiguity
Such kind of ambiguity refers to the situation where the context of a
phrase gives it multiple interpretations.
In simple words, we can say that pragmatic ambiguity arises when
the statement is not specific.
For example, the sentence “I like you too” can have multiple
interpretations like:
I like you (just like you like me),
I like you (just like someone else dose).
38
7. NLP Phases
• Following diagram shows the phases or logical steps in

natural language processing:
39
Morphological Processing
• It is the first phase of NLP.
• The purpose of this phase is to break chunks of language input into
sets of tokens corresponding to paragraphs, sentences and words.
• For example, a word like “uneasy” can be broken into two sub-
word tokens as “un-easy”.
Syntax Analysis
• It is the second phase of NLP.
• The purpose of this phase is two folds:
• To check that a sentence is well formed or not and
• To break it up into a structure that shows the syntactic
relationships between the different words.
• For example, the sentence like “The school goes to the boy”
would be rejected by syntax analyzer or parser .
40
Semantic Analysis
• It is the third phase of NLP.

• The purpose of this phase is to draw exact meaning, or you can
say dictionary meaning from the text.
• The text is checked for meaningfulness.
• For example, semantic analyzer would reject a sentence like
“Hot ice-cream”.
Pragmatic Analysis
• It is the fourth phase of NLP.
• Pragmatic analysis simply fits the actual objects/events, which
exist in a given context with object references obtained during
the last phase (semantic analysis).
• For example, the sentence “Put the banana in the basket on the
shelf” can have two semantic interpretations and pragmatic
analyzer will choose between these two possibilities.
41
42
r1
pt e
ha !
f C ou
do ky
En han
T
Chapter 2
Linguistic Resources
43
2. Linguistic Resources
Corpus
A corpus is a large and structured set of machine-readable texts
that have been produced in a natural communicative setting.
Its plural is corpora. They can be derived in different ways
like text that was originally electronic, transcripts of spoken
language and optical character recognition, etc.
Elements of Corpus Design
Language is infinite but a corpus has to be finite in size.
For the corpus to be finite in size, we need to sample and
proportionally include a wide range of text types to ensure a
good corpus design.
44
Corpus Representativeness
• Representativeness is a defining feature of corpus design.
• The following definitions from two great researchers.
• Leech and Biber, will help us understand corpus
representativeness:
• According to Leech (1991), “A corpus is thought to be
representative of the language variety it is supposed to represent if
the findings based on its contents can be generalized to the said
language variety”.
• According to Biber (1993), “Representativeness refers to the
extent to which a sample includes the full range of variability in a
population”.
In this way, we can conclude that representativeness of a corpus
are determined by the following two factors:
Balance – The range of genre include in a corpus.
Sampling – How the chunks for each genre are selected.
45
Corpus Balance
• Another very important element of corpus design is

corpus balance : the range of genre included in a
corpus.
• A balanced corpus covers a wide range of text
categories, which are supposed to be representatives of
the language.
• We do not have any reliable scientific measure for
balance but the best estimation and intuition works in
this concern.
• Accepted balance is determined by its planned uses only.
46
Sampling
• Another important element of corpus design is sampling.

• Corpus representativeness and balance is very closely
associated with sampling. That is why we can say that
sampling is inescapable in corpus building.
• According to Biber(1993), “Some of the first
considerations in constructing a corpus concern the overall
design:
• for example, the kinds of texts included, the number of
texts, the selection of particular texts, the selection of text
samples from within texts, and the length of text samples.
• Each of these involves a sampling decision, either aware or
not.”
47
Cont..
• While obtaining a representative sample, we need to

consider the following:
Sampling unit: It refers to the unit which requires a
sample.
For example, for written text, a sampling unit may be a
newspaper, journal or a book.
Sampling frame: The list of all sampling units is called a
sampling frame.
Population: It may be referred as the assembly of all
sampling units.
It is defined in terms of language production, language
reception or language as a product.
48
Corpus Size
• Another important element of corpus design is its size.
• How large the corpus should be?
There is no specific answer to this question. The size of the
corpus depends upon the purpose for which it is intended as well
as on some practical considerations as follows:
Kind of query anticipated from the user.
The methodology used by the users to study the data.
Availability of the source of data.
With the advancement in technology, the corpus size also
increases.
The following table of comparison will help you
understand how the corpus size works:
49
Corpus size
50
TreeBank Corpus
• It may be defined as linguistically parsed text corpus that

annotates syntactic or semantic sentence structure.
• Geoffrey Leech coined the term ‘treebank’, which represents
that the most common way of representing the grammatical
analysis is by means of a tree structure.
• Generally, Treebanks are created on the top of a corpus,
which has already been annotated with part-of-speech tags.
Part of Speech (PoS) Tagging
• Tagging is a kind of classification that may be defined as the
automatic assignment of description to the tokens. Here the
descriptor is called tag, which may represent one of the part-
of-speech, semantic information.
51
Cont…
• POS tagging is a task of labelling each word in a
sentence with its appropriate part of speech.
• parts of speech include nouns, verb, adverbs, adjectives,
pronouns, conjunction and their sub-categories.
• Most of the POS tagging falls under Rule Base POS
tagging, Stochastic POS tagging and Transformation
based tagging.
52
Types of TreeBank Corpus
• Semantic and Syntactic Treebanks are the two most

common types of Treebanks in linguistics.
A. Semantic Treebanks: These Treebanks use a formal
representation of sentence’s semantic structure.
• They vary in the depth of their semantic representation.
• Some of the examples of Semantic Treebanks
• Robot Commands Treebank
• Geoquery
• Groningen Meaning Bank, and
• RoboCup Corpus.
53
Cont..
B. Syntactic Treebanks
Opposite to the semantic Treebanks, inputs to the Syntactic
Treebank systems are expressions of the formal language obtained
from the conversion of parsed Treebank data.
• The outputs of such systems are predicate logic based meaning
representation.
• Various syntactic Treebanks in different languages have been
created so far.
• For example,
• Penn Arabic Treebank, Columbia Arabic Treebank are
syntactic Treebanks created in Arabia language.
• Sininca syntactic Treebank created in Chinese language.
• Lucy, Susane and BLLIP WSJ syntactic corpus created in
English language.
54
Applications of TreeBankCorpus
• Followings are some of the applications of TreeBanks:
In Computational Linguistics
• The best use of TreeBanks is to engineer state-of-the-art natural
language processing systems such as part-of-speech taggers,
parsers, semantic analyzers and machine translation systems.
In Corpus Linguistics
• In case of Corpus linguistics, the best use of Treebanks is to
study syntactic phenomena.
In Theoretical Linguistics and Psycholinguistics
• The best use of Treebanks in theoretical and psycholinguistics
is interaction evidence.
55
PropBank Corpus
• PropBank more specifically called “Proposition Bank” is a
corpus, which is annotated with verbal propositions and
their arguments.
• The corpus is a verb-oriented resource; the annotations here
are more closely related to the syntactic level.
• Martha Palmer et al., Department of Linguistic, University
of Colorado Boulder developed it.
• We can use the term PropBank as a common noun referring
to any corpus that has been annotated with propositions
and their arguments.
• In Natural Language Processing (NLP), the PropBank
project has played a very significant role. It helps in
semantic role labeling.
56
VerbNet(VN)
•VerbNet(VN) is the hierarchical domain-independent and

largest lexical resource present in English that incorporates both
semantic as well as syntactic information about its contents.
•VN is a broad-coverage verb lexicon having mappings to other
lexical resources such as WordNet, Xtag and FrameNet.
• It is organized into verb classes extending Levin classes by
refinement and addition of subclasses for achieving syntactic and
semantic coherence among class members.
•Each VerbNet (VN) class contains:
•A set of syntactic descriptions or syntactic frames:
•For representing the possible surface realizations of the argument
structure for constructions such as transitive, intransitive,
prepositional phrases.
57
Cont…
• A set of semantic descriptions such as animate, human,

organization:
• For constraining, the types of thematic roles allowed by the
arguments, and further restrictions may be imposed.
• This will help in indicating the syntactic nature of the constituent
likely to be associated with the thematic role.
WordNet
• WordNet, created by Princeton is a lexical database for English
language.
• It is the part of the NLTK corpus.
• In WordNet, nouns, verbs, adjectives and adverbs are grouped into
sets of cognitive synonyms called Synsets.
• All the synsets are linked with the help of conceptual semantic and
lexical relations.
58
cont…
• Its structure makes it very useful for natural language
processing (NLP).
• In information systems, WordNet is used for various
purposes like word-sense disambiguation, information
retrieval, automatic text classification and machine
translation.
• One of the most important uses of WordNet is to find out
the similarity among words.
• For this task, various algorithms have been implemented
in various packages like Similarity in Perl, NLTK in
Python and ADW in Java.
59
Chapter 3
Word Level Analysis
Word analysis is a process of learning more about

word meanings by studying their origins and parts.
A “morpheme” is the smallest meaningful part of a
word. Other terms for word analysis: Morphemic
analysis.
60
Regular Expressions
• A regular expression (RE) is a language for specifying

text search strings.
• RE helps us to match or find other strings or sets of
strings, using a specialized syntax held in a pattern.
• Regular expressions are used to search texts in UNIX as
well as in MS WORD in identical way.
• We have various search engines using a number of RE
features.
61
Properties of Regular Expressions
• Followings are some of the important properties of RE:

American Mathematician Stephen Cole Kleene formalized the
Regular Expression language.
RE is a formula in a special language, which can be used for
specifying simple classes of strings, a sequence of symbols.
 RE is an algebraic notation for characterizing a set of strings.
Regular expression requires two things,
 One is the pattern that we wish to search and
Other is a corpus of text from which we need to search.
62
Cont…..
Mathematically, A Regular Expression can be defined as

follows -
ε is a Regular Expression, which indicates that the
language is having an empty string.
φ is a Regular Expression which denotes that it is an empty
language.
If X and Y are Regular Expressions, then
 X, Y
 X.Y(Concatenation of XY)
 X+Y (Union of X and Y)
 X*, Y* (Kleen Closure of X and Y) are also regular expressions.
If a string is derived from above rules then that would also
be a regular expression
63
Examples of Regular Expressions
64
Regular Sets & Their Properties
• It may be defined as the set that represents the value of

the regular expression and consists specific properties.
Properties of regular sets
If we do the union of two regular sets then the resulting
set would also be regular.
If we do the intersection of two regular sets then the
resulting set would also be regular.
If we do the complement of regular sets, then the
If we do the difference of two regular sets, then the
65
Cont..
If we do the reversal of regular sets, then the resulting
If we take the closure of regular sets, then the resulting
 If we do the concatenation of two regular sets, then
the resulting set would also be regular .
66
Finite State Automata
• The term automata, derived from the Greek word "αὐτόματα"

meaning "self-acting", is the plural of automaton which may be
defined as an abstract self-propelled computing device that follows
a predetermined sequence of operations automatically.
• An automaton having a finite number of states is called a Finite
Automaton (FA) or Finite State automata (FSA).
Mathematically, an automaton can be represented by a 5-tuple (Q,
∑, δ, q0, F), where
Q is a finite set of states.
∑ is a finite set of symbols, called the alphabet of the
automaton.
δ is the transition function.
q0 is the initial state from where any input is processed (q0 ∈
Q).
F is a set of final state/states of Q (F ⊆ Q).
67
Relation between Finite Automata, Regular Grammars and Regular Expressions
Finite state automata are the theoretical foundation of

computational work and regular expressions is one way of
describing them.
Any regular expression can be implemented as FSA and
any FSA can be described with a regular expression.
On the other hand, regular expression is a way to
characterize a kind of language called regular language.
Regular language can be described with the help of both
FSA and regular expression.
Regular grammar, a formal grammar that can be right-
regular or left-regular, is another way to characterize
regular language.
68
Cont…
• Following diagram shows that finite automata, regular
expressions and regular grammars are the equivalent
ways of describing regular languages.
69
Types of Finite State Automation (FSA)
• Finite state automation is of two types

1. Deterministic Finite automation (DFA)
It may be defined as the type of finite automation wherein, for every
input symbol we can determine the state to which the machine will
move.
 It has a finite number of states that is why the machine is called
Deterministic Finite Automaton (DFA).
Mathematically, a DFA can be represented by a 5-tuple (Q, ∑, δ, q0,
F), where –
 Q is a finite set of states.
∑ is a finite set of symbols, called the alphabet of the automaton.
δ is the transition function where δ: Q × ∑ → Q .
q0 is the initial state from where any input is processed (q0 ∈ Q).
F is a set of final state/states of Q (F ⊆ Q).
70
Cont…
Whereas graphically, a DFA can be represented by
diagraphs called state diagrams where
The states are represented by vertices.
The transitions are shown by labeled arcs.
The initial state is represented by an empty incoming
arc.
The final state is represented by double circle.
71
Non-deterministic Finite Automation (NDFA)
• It may be defined as the type of finite automation where for every

input symbol we cannot determine the state to which the machine
will move i.e. the machine can move to any
combination of the states.
• It has a finite number of states that is why the machine is called
Non-deterministic Finite Automation (NDFA).
• Mathematically, NDFA can be represented by a 5-tuple (Q, ∑, δ,
q0, F), where –
• Q is a finite set of states.
• ∑ is a finite set of symbols, called the alphabet of the automaton.
• δ :-is the transition function where δ: Q × ∑ → 2Q.
• q0 :-is the initial state from where any input is processed (q0 ∈ Q).
• F :-is a set of final state/states of Q (F ⊆ Q)
72
Cont..
Whereas graphically (same as DFA), a NDFA can be
represented by diagraphs called state diagrams
where
The states are represented by vertices.
The transitions are shown by labeled arcs.
The initial state is represented by an empty
incoming arc.
The final state is represented by double circle.
73
Morphological Parsing
• The term morphological parsing is related to the parsing

of morphemes.
• We can define morphological parsing as the problem of
recognizing that a word breaks down into smaller
meaningful units called morphemes producing some
sort of linguistic structure for it.
• For example, we can break the word foxes into
two, fox and -es.
• We can see that the word foxes, is made up of two
morphemes, one is fox and other is -es.
74
Cont…
• In other sense, we can say that morphology is the study of

• The formation of words.
• The origin of the words.
• Grammatical forms of the words.
• Use of prefixes and suffixes in the formation of words.
• How parts-of-speech (PoS) of a language are formed.
Types of Morphemes
• Morphemes, the smallest meaning-bearing units, can be
divided into two types
Stems
Word Order
75
Stems
• It is the core meaningful unit of a word. We can also say
that it is the root of the word.
• For example, in the word foxes, the stem is fox.
• Affixes − As the name suggests, they add some additional
meaning and grammatical functions to the words. For
example, in the word foxes, the affix is − es.
• Further, affixes can also be divided into following four
types −
• Prefixes − As the name suggests, prefixes precede the
stem. For example, in the word unbuckle, un is the
prefix.
• Suffixes − As the name suggests, suffixes follow the
stem. For example, in the word cats, -s is the suffix.
76
Cont…
• Infixes − As the name suggests, infixes are inserted inside the stem.
For example, the word cupful, can be pluralized as cupsful by using
-s as the infix.
• Circumfixes :They precede and follow the stem. There are very less
examples of circumfixes in English language.
• A very common example is ‘A-ing’ where we can use -A precede
and -ing follows the stem
Word Order
• The order of the words would be decided by morphological parsing.
Let us now see the requirements for building a morphological parser
• Lexicon: The very first requirement for building a morphological
parser is lexicon, which includes the list of stems and affixes along
with the basic information about them.
• For example, the information like whether the stem is Noun stem
or Verb stem, etc.
77
Morphotactics
• It is basically the model of morpheme ordering.
• In other sense, the model explaining which classes of
morphemes can follow other classes of morphemes inside a
word.
• For example, the morphotactic fact is that the English plural
morpheme always follows the noun rather than preceding it.
Orthographic rules
• These spelling rules are used to model the changes
occurring in a word.
• For example, the rule of converting y to ie in word like
city+s = cities not citys.
78
Assignment 2
What is the difference between DFA and NDFA?
explain with an appropriate examples
• Deterministic Finite automation (DFA)
• Non-deterministic Finite Automation (NDFA)
What are the functions of Rule Base POS tagging,
Stochastic POS tagging and Transformation based
tagging?
79
Chapter 4
Syntactic Analysis
80
Syntactic analysis
• Syntactic analysis or parsing or syntax analysis is the third
phase of NLP.
• The purpose of this phase is to draw exact meaning, or
dictionary meaning from the text.
• Syntax analysis checks the text for meaningfulness
comparing to the rules of formal grammar.
• For example, the sentence like “hot ice-cream” would be
rejected by semantic analyzer.
• In this sense, syntactic analysis or parsing may be defined as
the process of analyzing the strings of symbols in natural
language conforming to the rules of formal grammar.
• The origin of the word ‘parsing’ is from Latin
word ‘pars’ which means ‘part’.
81
Concept of Parser
• It is used to implement the task of parsing.
• It may be defined as the software component designed for
taking input data (text) and giving structural
representation of the input after checking for correct
syntax as per formal grammar.
• It also builds a data structure generally in the form of
parse tree or abstract syntax tree or other hierarchical
structure.
82
The main roles of the parse
To report any syntax error.

To recover from commonly occurring error so that the
processing of the remainder of program can be continued.
To create parse tree.
To create symbol table.
Types of Parsing
• Derivation divides parsing into the followings two types
• Top-down Parsing
• Bottom-up Parsing
83
A. Top-down Parsing
• In this kind of parsing, the parser starts constructing the
parse tree from the start symbol and then tries to
transform the start symbol to the input.
• The most common form of top down parsing uses
recursive procedure to process the input.
• The main disadvantage of recursive descent parsing is
backtracking.
B. Bottom-up Parsing
• In this kind of parsing, the parser starts with the input
symbol and tries to construct the parser tree up to the
start symbol.
84
Concept of Derivation
• In order to get the input string, we need a sequence of
production rules.
• Derivation is a set of production rules.
• During parsing, we need to decide the non-terminal, which is to
be replaced along with deciding the production rule with the
help of which the non-terminal will be replaced.
Types of Derivation
• In this section, we will learn about the two types of derivations,
which can be used to decide which non-terminal to be replaced
with production rule
Left-most Derivation
• In the left-most derivation, the sentential form of an input is
scanned and replaced from the left to the right.
• The sentential form in this case is called the left-sentential form.
85
 Right-most Derivation
• In the left-most derivation, the sentential form of an input is
scanned and replaced from right to left.
• The sentential form in this case is called the right-sentential
form.
Concept of Parse Tree
• It may be defined as the graphical depiction of a derivation.
• The start symbol of derivation serves as the root of the
parse tree.
• In every parse tree, the leaf nodes are terminals and
interior nodes are non-terminals.
• A property of parse tree is that in-order traversal will
produce the original input string.
86
Concept of Grammar
• Grammar is very essential and important to describe the

syntactic structure of well-formed programs.
• In the literary sense, they denote syntactical rules for
conversation in natural languages.
• Linguistics have attempted to define grammars since the
inception of natural languages like English, Hindi, etc.
• The theory of formal languages is also applicable in the
fields of Computer Science mainly in programming
languages and data structure.
• For example, in ‘C’ language, the precise grammar rules
state how functions are made from lists and statements.
87
Cont…
• A mathematical model of grammar was given by Noam
Chomsky in 1956, which is effective for writing
computer languages.
• Mathematically, a grammar G can be formally written as
a 4-tuple (N, T, S, P) where
N or VN = set of non-terminal symbols, i.e., variables.
T or ∑ = set of terminal symbols.
S = Start symbol where S ∈ N
P denotes the Production rules for Terminals as well
as Non-terminals. It has the form α → β, where α and
β are strings on VN ∪ ∑ and least one symbol of α
belongs to VN
88
Phrase Structure or Constituency Grammar
• Phrase structure grammar, introduced by Noam
Chomsky, is based on the constituency relation.
• That is why it is also called constituency grammar. It is
opposite to dependency grammar.
• All the related frameworks view the sentence structure in
terms of constituency relation.
• The constituency relation is derived from the subject-
predicate division of Latin as well as Greek grammar.
• The basic clause structure is understood in terms of noun
phrase NP and verb phrase VP.
89
Cont..
• We can write the sentence “This tree is illustrating the
constituency relation” as follows
90
Dependency Grammar
• It is opposite to the constituency grammar and based on

dependency relation. It was introduced by Lucien
Tesniere.
• Dependency grammar (DG) is opposite to the
constituency grammar because it lacks phrasal nodes.
• In DG, the linguistic units, i.e., words are connected to
each other by directed links.
• The verb becomes the center of the clause structure.
• Every other syntactic units are connected to the verb in
terms of directed link.
• These syntactic units are called dependencies.
91
Cont..
• We can write the sentence “This tree is illustrating the
dependency relation” as follows;
 Parse tree that uses Constituency grammar is called

constituency-based parse tree; and
 the parse trees that uses dependency grammar is called
dependency-based parse tree. 92
Context Free Grammar
• Context free grammar, also called CFG, is a notation
for describing languages and a superset of Regular
grammar.
• It can be seen in the following diagram
93
Definition of CFG
CFG consists of finite set of grammar rules with the

following four components
Set of Non-terminals
• It is denoted by V.
• The non-terminals are syntactic variables that denote the
sets of strings, which further help defining the language,
generated by the grammar.
Set of Terminals
• It is also called tokens and defined by Σ.
• Strings are formed with the basic symbols of terminals.
94
Cont…
Set of Productions
• It is denoted by P.
• The set defines how the terminals and non-terminals can
be combined.
• Every production(P) consists of non-terminals, an arrow,
and terminals (the sequence of terminals).
• Non-terminals are called the left side of the production
and terminals are called the right side of the production.
Start Symbol
• The production begins from the start symbol.
• It is denoted by symbol S.
• Non-terminal symbol is always designated as start symbol.
95
End of chapter 4
96

Natural Language Processing (NLP) : Chapter 1: Introduction To NLP

Uploaded by

Copyright:

Available Formats

Natural Language Processing (NLP) : Chapter 1: Introduction To NLP

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Natural Language Processing (NLP) : Chapter 1: Introduction To NLP

Uploaded by

Copyright:

Available Formats

WOLAITA SODO UNIVERSITY

Natural Language Processing(NLP)

• In this sense, we can say that Natural Language Processing

• Since all the users may not be well-versed in machine

• Linguistics is the science of language which includes

• The ‘levels of language’ are one of the most explanatory

Figure 2. Phases of NLP architecture 9

• Linguistic is the science which involves meaning of

• Whereas Lass in 1998 wrote that phonology refers

• This level emphasis to analyze the words in a sentence so

• Syntax conveys meaning in most languages because order

• This level of processing can incorporate the semantic

• While syntax and semantics travail with sentence-length units,

• Pragmatic is concerned with the firm use of language in

• Natural Language Generation (NLG) is the process of

To generate a text we need to have a speaker or an

• Linguistic Resources: To support the information’s realization,

In the same year, the publication of the journal MT

• Fourth Phase (Lexical & Corpus Phase) :The 1990s

• Machine Translation is generally translating phrases from

• A filtering solution that is applied to an email system uses

• Permission Filters: Require anyone sending a message to be pre-

• The field of Natural Language Processing is related with

The ambiguity can be solved by various methods such as

• Following diagram shows the phases or logical steps in

• It is the third phase of NLP.

• Another very important element of corpus design is

• Another important element of corpus design is sampling.

• While obtaining a representative sample, we need to

• It may be defined as linguistically parsed text corpus that

• Semantic and Syntactic Treebanks are the two most

•VerbNet(VN) is the hierarchical domain-independent and

• A set of semantic descriptions such as animate, human,

Word analysis is a process of learning more about

• A regular expression (RE) is a language for specifying

• Followings are some of the important properties of RE:

Mathematically, A Regular Expression can be defined as

• It may be defined as the set that represents the value of

• The term automata, derived from the Greek word "αὐτόματα"

Finite state automata are the theoretical foundation of

• Finite state automation is of two types

• It may be defined as the type of finite automation where for every

• The term morphological parsing is related to the parsing

• In other sense, we can say that morphology is the study of

To report any syntax error.

• Grammar is very essential and important to describe the

• It is opposite to the constituency grammar and based on

 Parse tree that uses Constituency grammar is called

CFG consists of finite set of grammar rules with the

You might also like