Natural Language Processing (NLP) : Chapter 1: Introduction To NLP
Natural Language Processing (NLP) : Chapter 1: Introduction To NLP
Natural Language Processing (NLP) : Chapter 1: Introduction To NLP
School of Informatics
Department of Information Technology
by Akalenesh.A (MSc.)
07/06/2022 December, 2021 1
Chapter Outline:
Introduction to NLP
Levels of NLP
Natural Language Generation
History of NLP
Applications of NLP
Study of Human Languages
Ambiguity and Uncertainty in Language
NLP Phases
2
Chapter 1
1. Introduction to
Natural Language Processing
3
1. Introduction
• Language is a method of communication with the help of
which we can speak, read and write.
• A language can be defined as a set of rules or set of symbol.
• Symbol are combined and used for conveying information or
broadcasting the information. Symbols are tyrannized by the
Rules.
• For example, we think, we make decisions, plans and more in
natural language; precisely, in words.
• Can human beings communicate with computers in their
natural language?
• It is a challenge for us to develop NLP applications because :
• Computers need structured data,
• but human speech is unstructured and often ambiguous in
nature.
4
Cont..
6
Classification of NLP
7
Cont…
13
4. Syntactic
15
Cont..
18
3. Natural Language Generation
19
4. Components of NLG
20
A. Speaker and Generator
We have divided the history of NLP into four phases. The
phases have distinctive concerns and styles.
First Phase (Machine Translation Phase) : Late 1940s
to late 1960s .The work done in this phase focused mainly
on machine translation (MT).
This phase was a period of enthusiasm and optimism.
The research on NLP started in early 1950s after Booth &
Richens’ investigation and Weaver’s memorandum on
machine translation in 1949.
1954 was the year when a limited experiment on
automatic translation from Russian to English
demonstrated in the Georgetown-IBM experiment.
23
Cont…
24
Second Phase (AI Influenced Phase) :Late 1960s to late 1970s
• In this phase, the work done was majorly related to world
knowledge and on its role in the construction and manipulation of
meaning representations.
• That is why, this phase is also called AI-flavored phase.
In early 1961, the work began on the problems of addressing and
constructing data or knowledge base. This work was influenced by
AI.
In the same year, a BASEBALL question-answering system was
also developed. The input to this system was restricted and the
language processing involved was a simple one.
A much advanced system was described in Minsky (1968).
This system, when compared to the BASEBALL question-
answering system, was recognized and provided for the need of
inference on the knowledge base in interpreting and responding to
language input. 25
Third Phase (Grammatico-logical Phase)
• Late 1970s to late 1980s This phase can be described as the
grammatico-logical phase.
• Due to the failure of practical system building in last phase, the
researchers moved towards the use of logic for knowledge
representation and reasoning in AI.
The grammatico-logical approach, towards the end of decade,
helped us with powerful general-purpose sentence processors
like Core Language Engine and Discourse Representation
Theory, which offered a means of tackling more extended
discourse.
In this phase they got some practical resources & tools like
parsers, e.g. Alvey Natural Language Tools along with more
operational and commercial systems, e.g. for database query.
The work on lexicon in 1980s also pointed in the direction of
grammatico-logical approach.
26
Cont…
27
6. Applications of NLP
• Natural Language Processing can be applied into various
areas like :
• Machine Translation,
• Email Spam detection,
• Information Extraction,
• Summarization,
• Question Answering etc.
28
6.1 Machine Translation
29
6.2 Text Categorization
• Categorization systems inputs a large flow of data like
official documents, military casualty reports, market data,
newswires etc. and assign them to predefined categories
or indices.
• Some companies have been using categorization systems
to categorize trouble tickets or complaint requests and
routing to the appropriate desks.
• Another application of text categorization is email spam
filters. Spam filters is becoming important as the first line
of defense against the unwanted emails.
• A false negative and false positive issues of spam filters
are at the heart of NLP technology, its brought down to
the challenge of extracting meaning from strings of text.
30
Cont…
34
Ambiguity and Uncertainty in Language
37
Cont…
4. Anaphoric Ambiguity
This kind of ambiguity arises due to the use of anaphora entities in
discourse.
For example, the horse ran up the hill. It was very steep. It soon got
tired.
Here, the anaphoric reference of “it” in two situations cause
ambiguity.
5. Pragmatic ambiguity
Such kind of ambiguity refers to the situation where the context of a
phrase gives it multiple interpretations.
In simple words, we can say that pragmatic ambiguity arises when
the statement is not specific.
For example, the sentence “I like you too” can have multiple
interpretations like:
I like you (just like you like me),
I like you (just like someone else dose).
38
7. NLP Phases
39
Morphological Processing
• It is the first phase of NLP.
• The purpose of this phase is to break chunks of language input into
sets of tokens corresponding to paragraphs, sentences and words.
• For example, a word like “uneasy” can be broken into two sub-
word tokens as “un-easy”.
Syntax Analysis
• It is the second phase of NLP.
• The purpose of this phase is two folds:
• To check that a sentence is well formed or not and
• To break it up into a structure that shows the syntactic
relationships between the different words.
• For example, the sentence like “The school goes to the boy”
would be rejected by syntax analyzer or parser .
40
Semantic Analysis
Linguistic Resources
43
2. Linguistic Resources
Corpus
A corpus is a large and structured set of machine-readable texts
that have been produced in a natural communicative setting.
Its plural is corpora. They can be derived in different ways
like text that was originally electronic, transcripts of spoken
language and optical character recognition, etc.
Elements of Corpus Design
Language is infinite but a corpus has to be finite in size.
For the corpus to be finite in size, we need to sample and
proportionally include a wide range of text types to ensure a
good corpus design.
44
Corpus Representativeness
• Representativeness is a defining feature of corpus design.
• The following definitions from two great researchers.
• Leech and Biber, will help us understand corpus
representativeness:
• According to Leech (1991), “A corpus is thought to be
representative of the language variety it is supposed to represent if
the findings based on its contents can be generalized to the said
language variety”.
• According to Biber (1993), “Representativeness refers to the
extent to which a sample includes the full range of variability in a
population”.
In this way, we can conclude that representativeness of a corpus
are determined by the following two factors:
Balance – The range of genre include in a corpus.
Sampling – How the chunks for each genre are selected.
45
Corpus Balance
46
Sampling
47
Cont..
48
Corpus Size
• Another important element of corpus design is its size.
• How large the corpus should be?
There is no specific answer to this question. The size of the
corpus depends upon the purpose for which it is intended as well
as on some practical considerations as follows:
Kind of query anticipated from the user.
The methodology used by the users to study the data.
Availability of the source of data.
With the advancement in technology, the corpus size also
increases.
The following table of comparison will help you
understand how the corpus size works:
49
Corpus size
50
TreeBank Corpus
51
Cont…
• POS tagging is a task of labelling each word in a
sentence with its appropriate part of speech.
• parts of speech include nouns, verb, adverbs, adjectives,
pronouns, conjunction and their sub-categories.
• Most of the POS tagging falls under Rule Base POS
tagging, Stochastic POS tagging and Transformation
based tagging.
52
Types of TreeBank Corpus
53
Cont..
B. Syntactic Treebanks
Opposite to the semantic Treebanks, inputs to the Syntactic
Treebank systems are expressions of the formal language obtained
from the conversion of parsed Treebank data.
• The outputs of such systems are predicate logic based meaning
representation.
• Various syntactic Treebanks in different languages have been
created so far.
• For example,
• Penn Arabic Treebank, Columbia Arabic Treebank are
syntactic Treebanks created in Arabia language.
• Sininca syntactic Treebank created in Chinese language.
• Lucy, Susane and BLLIP WSJ syntactic corpus created in
English language.
54
Applications of TreeBankCorpus
• Followings are some of the applications of TreeBanks:
In Computational Linguistics
• The best use of TreeBanks is to engineer state-of-the-art natural
language processing systems such as part-of-speech taggers,
parsers, semantic analyzers and machine translation systems.
In Corpus Linguistics
• In case of Corpus linguistics, the best use of Treebanks is to
study syntactic phenomena.
In Theoretical Linguistics and Psycholinguistics
• The best use of Treebanks in theoretical and psycholinguistics
is interaction evidence.
55
PropBank Corpus
• PropBank more specifically called “Proposition Bank” is a
corpus, which is annotated with verbal propositions and
their arguments.
• The corpus is a verb-oriented resource; the annotations here
are more closely related to the syntactic level.
• Martha Palmer et al., Department of Linguistic, University
of Colorado Boulder developed it.
• We can use the term PropBank as a common noun referring
to any corpus that has been annotated with propositions
and their arguments.
• In Natural Language Processing (NLP), the PropBank
project has played a very significant role. It helps in
semantic role labeling.
56
VerbNet(VN)
57
Cont…
58
cont…
• Its structure makes it very useful for natural language
processing (NLP).
• In information systems, WordNet is used for various
purposes like word-sense disambiguation, information
retrieval, automatic text classification and machine
translation.
• One of the most important uses of WordNet is to find out
the similarity among words.
• For this task, various algorithms have been implemented
in various packages like Similarity in Perl, NLTK in
Python and ADW in Java.
59
Chapter 3
Word Level Analysis
60
Regular Expressions
61
Properties of Regular Expressions
62
Cont…..
63
Examples of Regular Expressions
64
Regular Sets & Their Properties
65
Cont..
If we do the reversal of regular sets, then the resulting
set would also be regular.
If we take the closure of regular sets, then the resulting
set would also be regular.
If we do the concatenation of two regular sets, then
the resulting set would also be regular .
66
Finite State Automata
69
Types of Finite State Automation (FSA)
71
Non-deterministic Finite Automation (NDFA)
73
Morphological Parsing
74
Cont…
75
Stems
• It is the core meaningful unit of a word. We can also say
that it is the root of the word.
• For example, in the word foxes, the stem is fox.
• Affixes − As the name suggests, they add some additional
meaning and grammatical functions to the words. For
example, in the word foxes, the affix is − es.
• Further, affixes can also be divided into following four
types −
• Prefixes − As the name suggests, prefixes precede the
stem. For example, in the word unbuckle, un is the
prefix.
• Suffixes − As the name suggests, suffixes follow the
stem. For example, in the word cats, -s is the suffix.
76
Cont…
• Infixes − As the name suggests, infixes are inserted inside the stem.
For example, the word cupful, can be pluralized as cupsful by using
-s as the infix.
• Circumfixes :They precede and follow the stem. There are very less
examples of circumfixes in English language.
• A very common example is ‘A-ing’ where we can use -A precede
and -ing follows the stem
Word Order
• The order of the words would be decided by morphological parsing.
Let us now see the requirements for building a morphological parser
• Lexicon: The very first requirement for building a morphological
parser is lexicon, which includes the list of stems and affixes along
with the basic information about them.
• For example, the information like whether the stem is Noun stem
or Verb stem, etc.
77
Morphotactics
• It is basically the model of morpheme ordering.
• In other sense, the model explaining which classes of
morphemes can follow other classes of morphemes inside a
word.
• For example, the morphotactic fact is that the English plural
morpheme always follows the noun rather than preceding it.
Orthographic rules
• These spelling rules are used to model the changes
occurring in a word.
• For example, the rule of converting y to ie in word like
city+s = cities not citys.
78
Assignment 2
What is the difference between DFA and NDFA?
explain with an appropriate examples
• Deterministic Finite automation (DFA)
• Non-deterministic Finite Automation (NDFA)
What are the functions of Rule Base POS tagging,
Stochastic POS tagging and Transformation based
tagging?
79
Chapter 4
Syntactic Analysis
80
Syntactic analysis
• Syntactic analysis or parsing or syntax analysis is the third
phase of NLP.
• The purpose of this phase is to draw exact meaning, or
dictionary meaning from the text.
• Syntax analysis checks the text for meaningfulness
comparing to the rules of formal grammar.
• For example, the sentence like “hot ice-cream” would be
rejected by semantic analyzer.
• In this sense, syntactic analysis or parsing may be defined as
the process of analyzing the strings of symbols in natural
language conforming to the rules of formal grammar.
• The origin of the word ‘parsing’ is from Latin
word ‘pars’ which means ‘part’.
81
Concept of Parser
• It is used to implement the task of parsing.
• It may be defined as the software component designed for
taking input data (text) and giving structural
representation of the input after checking for correct
syntax as per formal grammar.
• It also builds a data structure generally in the form of
parse tree or abstract syntax tree or other hierarchical
structure.
82
The main roles of the parse
83
A. Top-down Parsing
• In this kind of parsing, the parser starts constructing the
parse tree from the start symbol and then tries to
transform the start symbol to the input.
• The most common form of top down parsing uses
recursive procedure to process the input.
• The main disadvantage of recursive descent parsing is
backtracking.
B. Bottom-up Parsing
• In this kind of parsing, the parser starts with the input
symbol and tries to construct the parser tree up to the
start symbol.
84
Concept of Derivation
• In order to get the input string, we need a sequence of
production rules.
• Derivation is a set of production rules.
• During parsing, we need to decide the non-terminal, which is to
be replaced along with deciding the production rule with the
help of which the non-terminal will be replaced.
Types of Derivation
• In this section, we will learn about the two types of derivations,
which can be used to decide which non-terminal to be replaced
with production rule
Left-most Derivation
• In the left-most derivation, the sentential form of an input is
scanned and replaced from the left to the right.
• The sentential form in this case is called the left-sentential form.
85
Right-most Derivation
• In the left-most derivation, the sentential form of an input is
scanned and replaced from right to left.
• The sentential form in this case is called the right-sentential
form.
Concept of Parse Tree
• It may be defined as the graphical depiction of a derivation.
• The start symbol of derivation serves as the root of the
parse tree.
• In every parse tree, the leaf nodes are terminals and
interior nodes are non-terminals.
• A property of parse tree is that in-order traversal will
produce the original input string.
86
Concept of Grammar
87
Cont…
• A mathematical model of grammar was given by Noam
Chomsky in 1956, which is effective for writing
computer languages.
• Mathematically, a grammar G can be formally written as
a 4-tuple (N, T, S, P) where
N or VN = set of non-terminal symbols, i.e., variables.
T or ∑ = set of terminal symbols.
S = Start symbol where S ∈ N
P denotes the Production rules for Terminals as well
as Non-terminals. It has the form α → β, where α and
β are strings on VN ∪ ∑ and least one symbol of α
belongs to VN
88
Phrase Structure or Constituency Grammar
• Phrase structure grammar, introduced by Noam
Chomsky, is based on the constituency relation.
• That is why it is also called constituency grammar. It is
opposite to dependency grammar.
• All the related frameworks view the sentence structure in
terms of constituency relation.
• The constituency relation is derived from the subject-
predicate division of Latin as well as Greek grammar.
• The basic clause structure is understood in terms of noun
phrase NP and verb phrase VP.
89
Cont..
• We can write the sentence “This tree is illustrating the
constituency relation” as follows
90
Dependency Grammar
91
Cont..
• We can write the sentence “This tree is illustrating the
dependency relation” as follows;
93
Definition of CFG
94
Cont…
Set of Productions
• It is denoted by P.
• The set defines how the terminals and non-terminals can
be combined.
• Every production(P) consists of non-terminals, an arrow,
and terminals (the sequence of terminals).
• Non-terminals are called the left side of the production
and terminals are called the right side of the production.
Start Symbol
• The production begins from the start symbol.
• It is denoted by symbol S.
• Non-terminal symbol is always designated as start symbol.
95
End of chapter 4
96