NLP M1 Students
NLP M1 Students
Computers “see” text in English the same you have seen the figure 1.
Normally, People have no trouble understanding natural language as they have
Common sense knowledge, Reasoning capacity, Experience for understanding.the
context of the text. But this is not the case with Computers. Computers don’t have inbuilt
Common-sense knowledge, Reasoning capacity, Experience. Unless we teach
computers to do so, they will not understand any natural language.
= 2
| 1 I 1
[ Database ] nincel
Intelligence
' Algorithms ’ Networking
Web NLP
]
Information Machine [ aText Language
Retrieval Translation Categorization Morphological
(using ontology) (E-M & M-E) summarization Analysis
=3 Abstractive Extractive
______________ i Summarization Summarization
5. History of NLP
NLP began in the 1950s as the intersection of artificial intelligence and linguistics.
NLP was originally distinct from text information retrieval (IR), which employs highly
scalable statistics-based techniques to index and search large volumes of text efficiently:
Manning et al1 provide an excellent introduction to IR. With time, however, NLP and IR
have converged somewhat. Currently, NLP borrows from several, very diverse fields,
requiring today's NLP researchers and developers to broaden their mental knowledge-
base significantly.
Early simplistic approaches, for example, word-for-word Russian-to-English machine
translation, were defeated by homographs—identically spelled words with multiple
meanings—and metaphor, leading to the apocryphal story of the Biblical, ‘the spirit is
willing, but the flesh is weak’ being translated to ‘the vodka is agreeable, but the meat is
spoiled.’
Chomsky's 1956 theoretical analysis of language grammars provided an estimate of
the problem's difficulty, influencing the creation (1963) of Backus-Naur Form (BNF)
notation. BNF is used to specify a ‘context-free grammar (CFG), and is commonly used
Modern NLP consists of speech recognition, machine learning, machine text reading, and
machine translation. These parts when combined would allow for artificial intelligence to
gain real knowledge of the world, not just playing chess or moving around an obstacle
course. In the near future computers will be able to read all of the information online and
learn from it and solve problems and possibly cure diseases. There limit for NLP and
Al is humanity, research will not stop until both are at a human level of awareness and
understanding.
Database Update
Message Text] E o —
—_— o Natural =
g Language — Meaning—» =3
Processor a'fl
g o Spoken Response
—
Speech
l Other
Recog.
Any natural language processing should start with some input and ends with effective
and accurate output. The inputs for natural language processor can be text or speech.
There are a variety of output that can be generated by the system. Output may be in
the form of answer when input is a question. Similarly outputs can be Database update,
Spoken response, Semantics, Part of speech, Morphology of word, Semantics of the
word/ Sentences etc.
Module: 01 Introduction
7. Levels of NLP
Natural Language Processing works on multiple levels and most often, these different
areas synergize well with each other. The NLP can broadly be divided into various levels
as shown in figure.
spoocn || UM
ys! analysis
L paing [} Comorua ng
Application
reasoning
and execution
AT
\_/
)
Pronunciation morphological lexicon and domain
P P ST
Phonology: It deals with interpretation of speech sound within and across words.
Morphology: It is a study of the way words are built up from smaller meaning-bearing units
called morphemes. For example, the word ‘fox’ has single morpheme while the word ‘cats’
have two morphemes ‘cat' and morpheme *-s’ represents singular and plural concepts.
Morphological lexicon is the list of stem and affixes together with basic information,
whether the stem is a Noun stem or a Verb stem [21]. The detailed analysis of this level
is discussed in chapter 4. Syntax: Itis a study of formal relationships between words. It
is a study of: how words are clustered in classes in the form of Part-of-Speech (POS),
how they are grouped with their neighbours into phrases, and the way words depend on
each other in a sentence. ’
Semantics: It is a study of the meaning of words that are associated with grammatical
structure. It consists of two kinds of approaches: syntax-driven semantic analysis and
semantic grammar. The detailed explanation of this level is discussed in chapter 4. In
discourse context, the level of NLP works with text longer than a sentence. There are two
types of discourse- anaphora resolution and discourse/text structure recognition. Anaphora
9. Ambiguity in NLP
Natural Language Processing (NLP) is an area of research and application that explores
how computers can be used to understand and manipulate natural language text or
speech to do useful things. The Text based NLP has been regarded as consisting of
various levels.
They are:
o Lexical Analysis:- Analysis of word forms
* Syntactic Analysis:-Structure processing
* Semantic Analysis:- Meaning representation
« Discourse Analysis:- Processing of interrelated sentences
* Pragmatic Analysis:-The purposeful use of sentences in situations.
Ambiguity can occur at all these levels. It is a property of linguistic expressions. If an
expression (word/phrase/sentence) has more than one interpretation we can refer it as
ambiguous.
For e.g: Consider the sentence, “The chicken is ready to eat”.
The interpretations in the above phrase can be:
e The chicken(bird) is ready to be fed or
« The chicken (food) is ready to be eaten.
Consider another sentence: “There was not a single man at the party”
The interpretations in this case can be:
* Lack of bachelors at the party or
* Lack of men altogether
There are different types of ambiguities
1. Lexical Ambiguity: is the ambiguity of a single word. A word can be ambiguous
with respect to its syntactic class. Eg: book, study.
For eg: The word “silver” can be used as a noun, an adjective, or a verb.
3. ‘Semantic Ambiguity: This occurs when the meaning of the words themselves can
be misinterpreted. Even after the syntax and the meanings of the individual words
have been resolved, there are two ways of reading the sentence.
Consider the example: “Seema loves her mother and Sriya does too”
The interpretations can be Sriya loves Seema’s mother or Sriya likes her own mother.
Semantic ambiguities born from the fact that generally a computer is not in a position to
distinguishing what is logical from what is not.
Consider the example: “The car hit the pole while it was moving”.
The interpretations can be:
— The car, while moving, hit the pole
— The car hit the pole while the pole was moving.
The first interpretation is preferred than the second one because we have a model of the
world that helps us to distinguish what is logical (or possible) from what is not. To supply
to a computer model of the world is not so easy.
Consider the example: “We saw his duck”
Duck can refer to the person’s bird or to a motion he made.
Semantic ambiguity happens when a sentence contains an ambiguous word or phrase.
Waiter (running upstairs and coming back panting): Yes sir, they are there.
Clearly, the waiter is falling short of the expectation of the tourist, since he does not
understand the pragmatics of the situation.
Pragmatic ambiguity arises when the statement is not specific, and the context
does not provide the information needed to clarify the statement. Information is
missing, and must be inferred. Consider the example: “I love you too.”
Module: 01 Introduction
“Duck”, for example, can take the form of a noun or a verb but its part-of-speech and lexical
meaning can only be derived in context with other words used in the phrase/sentence.
This, in fact, is an early step towards a more sophisticated Information Retrieval system
where precision is improved through part-of-speech tagging.
Syntactic Analysis (Parsing): It involves analysis of words in the sentence for grammar
and arranging words in a manner that shows the relationship among the words. The
sentence such as “The school goes to boy" is rejected by English syntactic analyzer.
Semantic Analysis: It concerns what words mean and how these meanings combine
in sentences to form sentence meanings. It draws the exact meaning or the dictionary
meaning from the text. The text is checked for meaningfulness. It is done by mapping
syntactic structures and objects in the task domain. The semantic analyzer disregards
sentence such as “hot ice-cream”. Another example can be (plant : industrial plant/ living
organism)
Discourse Integration: This concemns how the immediately preceding sentences affect
the interpretation of the next sentence. The meaning of any sentence depends upon the
meaning of the sentence just before it. In addition, it also brings about the meaning of the
immediately succeeding sentence.
Pragmatic Analysis: This concerns how sentences are used in different situations
and how it affects the interpretation of the sentence. During this, what was said is re-
interpreted on what it actually meant. It involves deriving those aspects of language
which require real world knowledge.
Morphological Analysis:
The morphological level of linguistic processing deals with the study of word structures
and word formation, focusing on the analysis of the individual components of words. The
most important unit of morphology, defined as having the “minimal unit of meaning” is
referred to as the morphemes. For example, the word: “unhappiness”. It can be broken
down into three morphemes (prefix, stem, and suffix), with each conveying some form
Module: 01 Introduction
Stages of N
Pragmatic Analysis
To reinterpret what
was said to what was
axtually meant
e
Syntactic Analysis Semantic Analysis
Linear sequences of Atransformation is
words are transformed —> made from the input
p
into structures that text to an intemal
show how the words representation that
relate to each other reflects the meaning
of meaning: the prefix un- refers to “not being”, while the suffix -ness refers to “a state
of being". The stem happy is considered as a free morphemes since it is a “word” in its
own right. Bound morphemes (prefixes and suffixes) require a free morphemes to which
it can be attached to, and can therefore not appear as a “word" on their own.
In Information Retrieval, document and query terms can be stemmed to match the
morphological variants of terms between the documents and query; such that the singular
form of a noun in a query will match even with its plural form in the document, and vice
versa, thereby increasing recall.
Surface form
e - | Stems
I want to print
I (pronoun)
| Ali's initfile |
7 want (verb)
to (prep)
print (verb)
L‘ Ali (noun)
Morphological ‘s (possessive)
analysis .init (adj)
file (noun)
Module: 01 Introduction
Syntactic Analysis
The part-of-speech tagging output of the lexical analysis can be used at the syntactic
level of linguistic processing to group words into phrase and clause brackets. Syntactic
Analysis also referred to as “parsing”, allows the extraction of phrases which convey
more meaning than just the individual words by themselves, such as in a noun phrase.
In Information Retrieval, parsing can be leveraged to improve indexing since phrases
can be used as representations of documents which provide better information than just
single-word indices. In the same way, phrases that are syntactically derived from the
query offers better search keys to match with documents that are similarly parsed.
Nevertheless, syntax can still be ambiguous at times as in the case of the news headline:
“Boy paralyzed after tumor fights back to gain black belt”— which actually refers to how
a boy was paralyzed because of a tumor but endured the fight against the disease and
ultimately gained a high level of competence in martial arts.
Syntactic
Example S analysis
/ \
NP VP
Stems | Farse
s tree
1 (pronoun) PRO VY 7 N\
want (verb) l' ' NP VP
to (prep) want b
print (verb) PRo /NP\
Al (noun) I 1 eo i
‘s (possessive) Uopnt HP
init (adj) N
Ars | |
file (noun) m] nti file
Semantic Analysis
The semantic level of linguistic processing deals with the determination of what a
sentence really means by relating syntactic features and disambiguating words with
multiple definitions to the given context. This level entails the appropriate interpretation
of the meaning of sentences, rather than the analysis at the level or individual words or
phrases.
In Information Retrieval, the query and document rr{atching process can be performed
on a conceptual level, as opposed to simple terms, thereby further increasing system
precision. Moreover, by applying semantic analysis to the query, term expansion would
be possible with the use of lexical sources, offering improved retrieval of the relevant
documents even if exact terms are not used in the query. Precision may increase with
query expansion, as with recall probably increasing as well.
Module: 01 Introduction
Syntactic Net
A
v Example @
i M
sD
N A e H—( ®=® :
O
| " P \S ' ! | tpe
PRO )
! | é W G>
want )
PRO NIPARN
v NP
\%
1 Pro NP Syntactic
I print bt‘ « \N analysis
Pragmatic Analysis
The pragmatic level of linguistic processing deals with the use of real-world knowledge
and understanding how this impacts the meaning of what is being communicated. By
analyzing the contextual dimension of the documents and queries, a more detailed
representation is derived.
In Information Retrieval, this level of Natural Language Processing primarily engages
query processing and understanding by integrating the user’s history and goals as well
as the context upon which the query is being made. Contexts may include time and
location.
This level of analysis enables major breakthroughs in Information Retrieval as it
facilitates the conversation between the IR system and the users, allowing the elicitation
of the purpose upon which the information being sought is planned to be used, thereby
ensuring that the information retrieval system is fit for purpose.
Discourse Analysis
The discourse level of linguistic processing deals with the analysis of structure and meaning
of text beyond a single sentence, making connections between words and sentences. At
this fevel, Anaphora Resolution is also achieved by identifying the entity referenced by an
anaphor (most commonly in the form of, but not limited to, a pronoun). An example is shown
below.
¥
aligned with my values," she said.
With the capability to recognize and resolve anaphora relationships, document and query
representations are improved, since, at the lexical level, the implicit presence of concepts is
accounted for throughout the document as well as in the query, while at the semantic and
discourse levels, an integrated content repres'entatian of the documents and queries are
generated.
Structured documents also benefit from the analysis at the discourse level since sections
can be broken down into (1) title, (2) abstract, (3) introduction, (4) body, (5) results,
(6), analysis, (7) conclusion, and (8) references. Information Retrieval systems are
significantly improved, as the specific roles of pieces of information are determined as
for whether it is a conclusion, an opinion, a prediction, or a fact.
Information Retrieval (IR): Itis a scientific discipline that deals with analysis, design and
implementation of a computerized system that addresses representation, organization,
and access to large amounts of heterogeneous information encoded in digital format.
The search engine is the well known application of IR which accepts query from user and
returns the relevant document to user. It returns the document, not the relevant answers;
users are left to extract answers from the returned documents. The research area in IR
includes: information searching, information extraction, information categorization and
information summarization from unstructured information.
Information Extraction: Itincludes extraction of structured information from unstructured
text. Itis an activity of filling predefined template from natural language text. The research
area in this category includes identifying named entity, resolving anaphora and identifying
relationships between entities.
Question Answering (QA): It is passage retrieval in specific domain. it is a process of
finding answers for a given question from a large collection of documents.
Natural Language Interface to Database (NLIDB): It is a process of finding answers
from database by asking questions in natural language.