Natural Language Processing Tutorial
Natural Language Processing Tutorial
i
Natural Language Processing
Audience
This tutorial is designed to benefit graduates, postgraduates, and research students who
either have an interest in this subject or have this subject as a part of their curriculum.
The reader can be a beginner or an advanced learner.
Prerequisites
The reader must have basic knowledge about Artificial Intelligence. He/she should also be
aware about basic terminologies used in English grammar and Python programming
concepts.
All the content and graphics published in this e-book are the property of Tutorials Point (I)
Pvt. Ltd. The user of this e-book is prohibited to reuse, retain, copy, distribute or republish
any contents or a part of contents of this e-book in any manner without written consent
of the publisher.
We strive to update the contents of our website and tutorials as timely and as precisely as
possible, however, the contents may contain inaccuracies or errors. Tutorials Point (I) Pvt.
Ltd. provides no guarantee regarding the accuracy, timeliness or completeness of our
website or its contents including this tutorial. If you discover any errors on our website or
in this tutorial, please notify us at [email protected]
i
Natural Language Processing
Table of Contents
About the Tutorial ............................................................................................................................................ i
Audience ........................................................................................................................................................... i
Prerequisites ..................................................................................................................................................... i
NLP Phases....................................................................................................................................................... 5
Corpus ............................................................................................................................................................. 7
VerbNet(VN) .................................................................................................................................................. 10
WordNet ........................................................................................................................................................ 10
Relation between Finite Automata, Regular Grammars and Regular Expressions ....................................... 13
ii
Natural Language Processing
Concept of Derivation.................................................................................................................................... 20
Types of Derivation........................................................................................................................................ 20
Text Coherence.............................................................................................................................................. 34
Question-answering ...................................................................................................................................... 61
Prerequisites .................................................................................................................................................. 62
Tokenization .................................................................................................................................................. 64
Stemming ...................................................................................................................................................... 64
v
Natural Language Processing
Lemmatization ............................................................................................................................................... 65
vi
1. Natural Language Processing — Introduction Natural Language Processing
Language is a method of communication with the help of which we can speak, read and
write. For example, we think, we make decisions, plans and more in natural language;
precisely, in words. However, the big question that confronts us in this AI era is that can
we communicate in a similar manner with computers. In other words, can human beings
communicate with computers in their natural language? It is a challenge for us to develop
NLP applications because computers need structured data, but human speech is
unstructured and often ambiguous in nature.
In this sense, we can say that Natural Language Processing (NLP) is the sub-field of
Computer Science especially Artificial Intelligence (AI) that is concerned about enabling
computers to understand and process human language. Technically, the main task of NLP
would be to program computers for analyzing and processing huge amount of natural
language data.
History of NLP
We have divided the history of NLP into four phases. The phases have distinctive concerns
and styles.
Let us now see all that the first phase had in it:
The research on NLP started in early 1950s after Booth & Richens’ investigation
and Weaver’s memorandum on machine translation in 1949.
1954 was the year when a limited experiment on automatic translation from
Russian to English demonstrated in the Georgetown-IBM experiment.
In the same year, the publication of the journal MT (Machine Translation) started.
The first international conference on Machine Translation (MT) was held in 1952
and second was held in 1956.
1
Natural Language Processing
In early 1961, the work began on the problems of addressing and constructing data
or knowledge base. This work was influenced by AI.
A much advanced system was described in Minsky (1968). This system, when
compared to the BASEBALL question-answering system, was recognized and
provided for the need of inference on the knowledge base in interpreting and
responding to language input.
In this phase we got some practical resources & tools like parsers, e.g. Alvey
Natural Language Tools along with more operational and commercial systems, e.g.
for database query.
2
Natural Language Processing
How the objects are identified by Mathematical models like logic and
the words? model theory.
What is meaning?
Lexical Ambiguity
The ambiguity of a single word is called lexical ambiguity. For example, treating the word
silver as a noun, an adjective, or a verb.
Syntactic Ambiguity
This kind of ambiguity occurs when a sentence is parsed in different ways. For example,
the sentence “The man saw the girl with the telescope”. It is ambiguous whether the man
saw the girl carrying a telescope or he saw her through his telescope.
3
Natural Language Processing
Semantic Ambiguity
This kind of ambiguity occurs when the meaning of the words themselves can be
misinterpreted. In other words, semantic ambiguity happens when a sentence contains an
ambiguous word or phrase. For example, the sentence “The car hit the pole while it was
moving” is having semantic ambiguity because the interpretations can be “The car, while
moving, hit the pole” and “The car hit the pole while the pole was moving”.
Anaphoric Ambiguity
This kind of ambiguity arises due to the use of anaphora entities in discourse. For example,
the horse ran up the hill. It was very steep. It soon got tired. Here, the anaphoric reference
of “it” in two situations cause ambiguity.
Pragmatic ambiguity
Such kind of ambiguity refers to the situation where the context of a phrase gives it
multiple interpretations. In simple words, we can say that pragmatic ambiguity arises
when the statement is not specific. For example, the sentence “I like you too” can have
multiple interpretations like I like you (just like you like me), I like you (just like someone
else dose).
4
Natural Language Processing
NLP Phases
Following diagram shows the phases or logical steps in natural language processing:
Input sentence
Morphological
Processing
Lexicon
Syntax
analysis
Grammar
Semantic Semantic
rules analysis
Contextual Pragmatic
information analysis
Target representation
Morphological Processing
It is the first phase of NLP. The purpose of this phase is to break chunks of language input
into sets of tokens corresponding to paragraphs, sentences and words. For example, a
word like “uneasy” can be broken into two sub-word tokens as “un-easy”.
Syntax Analysis
It is the second phase of NLP. The purpose of this phase is two folds: to check that a
sentence is well formed or not and to break it up into a structure that shows the syntactic
relationships between the different words. For example, the sentence like “The school
goes to the boy” would be rejected by syntax analyzer or parser.
Semantic Analysis
5
Natural Language Processing
It is the third phase of NLP. The purpose of this phase is to draw exact meaning, or you
can say dictionary meaning from the text. The text is checked for meaningfulness. For
example, semantic analyzer would reject a sentence like “Hot ice-cream”.
Pragmatic Analysis
It is the fourth phase of NLP. Pragmatic analysis simply fits the actual objects/events,
which exist in a given context with object references obtained during the last phase
(semantic analysis). For example, the sentence “Put the banana in the basket on the shelf”
can have two semantic interpretations and pragmatic analyzer will choose between these
two possibilities.
6
2. Natural Language Processing — Linguistic Natural Language Processing
Resources
In this chapter, we will learn about the linguistic resources in Natural Language Processing.
Corpus
A corpus is a large and structured set of machine-readable texts that have been produced
in a natural communicative setting. Its plural is corpora. They can be derived in different
ways like text that was originally electronic, transcripts of spoken language and optical
character recognition, etc.
Let us now learn about some important elements for corpus design:
Corpus Representativeness
Representativeness is a defining feature of corpus design. The following definitions from
two great researchers – Leech and Biber, will help us understand corpus
representativeness:
In this way, we can conclude that representativeness of a corpus are determined by the
following two factors:
Corpus Balance
Another very important element of corpus design is corpus balance – the range of genre
included in a corpus. We have already studied that representativeness of a general corpus
depends upon how balanced the corpus is. A balanced corpus covers a wide range of text
categories, which are supposed to be representatives of the language. We do not have
any reliable scientific measure for balance but the best estimation and intuition works in
this concern. In other words, we can say that the accepted balance is determined by its
intended uses only.
Sampling
7
Natural Language Processing
Sampling unit: It refers to the unit which requires a sample. For example, for
written text, a sampling unit may be a newspaper, journal or a book.
Corpus Size
Another important element of corpus design is its size. How large the corpus should be?
There is no specific answer to this question. The size of the corpus depends upon the
purpose for which it is intended as well as on some practical considerations as follows:
With the advancement in technology, the corpus size also increases. The following table
of comparison will help you understand how the corpus size works:
Early 21st century The Bank of English corpus 650 Million words
TreeBank Corpus
It may be defined as linguistically parsed text corpus that annotates syntactic or semantic
sentence structure. Geoffrey Leech coined the term ‘treebank’, which represents that the
most common way of representing the grammatical analysis is by means of a tree
structure. Generally, Treebanks are created on the top of a corpus, which has already been
annotated with part-of-speech tags.
8
Natural Language Processing
Semantic Treebanks
These Treebanks use a formal representation of sentence’s semantic structure. They vary
in the depth of their semantic representation. Robot Commands Treebank, Geoquery,
Groningen Meaning Bank, RoboCup Corpus are some of the examples of Semantic
Treebanks.
Syntactic Treebanks
Opposite to the semantic Treebanks, inputs to the Syntactic Treebank systems are
expressions of the formal language obtained from the conversion of parsed Treebank data.
The outputs of such systems are predicate logic based meaning representation. Various
syntactic Treebanks in different languages have been created so far. For example, Penn
Arabic Treebank, Columbia Arabic Treebank are syntactic Treebanks created in Arabia
language. Sininca syntactic Treebank created in Chinese language. Lucy, Susane and
BLLIP WSJ syntactic corpus created in English language.
In Computational Linguistics
If we talk about Computational Linguistic then the best use of TreeBanks is to engineer
state-of-the-art natural language processing systems such as part-of-speech taggers,
parsers, semantic analyzers and machine translation systems.
In Corpus Linguistics
In case of Corpus linguistics, the best use of Treebanks is to study syntactic phenomena.
PropBank Corpus
PropBank more specifically called “Proposition Bank” is a corpus, which is annotated with
verbal propositions and their arguments. The corpus is a verb-oriented resource; the
annotations here are more closely related to the syntactic level. Martha Palmer et al.,
Department of Linguistic, University of Colorado Boulder developed it. We can use the
term PropBank as a common noun referring to any corpus that has been annotated with
propositions and their arguments.
In Natural Language Processing (NLP), the PropBank project has played a very significant
role. It helps in semantic role labeling.
9
Natural Language Processing
VerbNet(VN)
VerbNet(VN) is the hierarchical domain-independent and largest lexical resource present
in English that incorporates both semantic as well as syntactic information about its
contents. VN is a broad-coverage verb lexicon having mappings to other lexical resources
such as WordNet, Xtag and FrameNet. It is organized into verb classes extending Levin
classes by refinement and addition of subclasses for achieving syntactic and semantic
coherence among class members.
WordNet
WordNet, created by Princeton is a lexical database for English language. It is the part of
the NLTK corpus. In WordNet, nouns, verbs, adjectives and adverbs are grouped into sets
of cognitive synonyms called Synsets. All the synsets are linked with the help of
conceptual-semantic and lexical relations. Its structure makes it very useful for natural
language processing (NLP).
10
3. Natural Language Processing — Word Level Natural Language Processing
Analysis
In this chapter, we will understand world level analysis in Natural Language Processing.
Regular Expressions
A regular expression (RE) is a language for specifying text search strings. RE helps us to
match or find other strings or sets of strings, using a specialized syntax held in a pattern.
Regular expressions are used to search texts in UNIX as well as in MS WORD in identical
way. We have various search engines using a number of RE features.
Regular expression requires two things, one is the pattern that we wish to search
and other is a corpus of text from which we need to search.
o X, Y
o X.Y(Concatenation of XY)
If a string is derived from above rules then that would also be a regular expression.
11
Natural Language Processing
(aa + ab + ba + bb)* It would be string of a’s and b’s of even length that
can be obtained by concatenating any combination of
the strings aa, ab, ba and bb including null i.e. {aa,
ab, ba, bb, aaab, aaba, …………..}
If we do the intersection of two regular sets then the resulting set would also be
regular.
12
Natural Language Processing
If we do the complement of regular sets, then the resulting set would also be
regular.
If we do the difference of two regular sets, then the resulting set would also be
regular.
If we do the reversal of regular sets, then the resulting set would also be regular.
If we take the closure of regular sets, then the resulting set would also be regular.
If we do the concatenation of two regular sets, then the resulting set would also be
regular.
An automaton having a finite number of states is called a Finite Automaton (FA) or Finite
State automata (FSA).
We can say that any regular expression can be implemented as FSA and any FSA
can be described with a regular expression.
Following diagram shows that finite automata, regular expressions and regular grammars
are the equivalent ways of describing regular languages.
13
Natural Language Processing
Regular Finite
Expressions Automata
Regular
Languages
Regular Grammars
Whereas graphically, a DFA can be represented by diagraphs called state diagrams where
−
Example of DFA
Suppose a DFA be
Q = {a, b, c},
∑ = {0, 1},
q0 = {a},
F = {c},
Transition function δ is shown in the table as follows:-
14
Natural Language Processing
Current State Next State for Input 0 Next State for Input 1
A a B
B c A
C b C
1 0
a 1
b 0
1
1
Whereas graphically (same as DFA), a NDFA can be represented by diagraphs called state
diagrams where −
Example of NDFA
Suppose a NDFA be
Q = {a, b, c},
∑ = {0, 1},
q0 = {a},
F = {c},
Transition function δ is shown in the table as follows: -
Current State Next State for Input 0 Next State for Input 1
A a, b B
B C a, c
C b, c C
1 0
a 0,1
b 0,
11
-
0,
16
Natural Language Processing
17