i] Pragmatic
a wt DP rae a
SyntacticNet_ >) Execute the command
. Ipr /ali/stuff.init
level of linguistic processing deals with the analysis of structure and meaning
Single sentence, making connections between words and sentences. Af
Resolution is also achieved by Identifying the entity referenced by an
iy in the form of, but not limited to, a Pronoun). An example Is shown
Voted for Obama because he was most
5: Anaphora Resolution Wlustration
resolve anaphora relat
» at the
effi
Lar
ano
user:
inclu
inforn
Infor)
text. |
area ir
relatio
Quest
finding
Natura
from dé11. Applications of NLP
, information extraction, machine learning systems
question answering system, dialogue system, email fouting, telephone banking, speech
management, multilingual query processing, and natural language interface to database
system. Currently interactive applications may be classified into following categories
Speech Recognition / Speech Understanding and Synthesis / Speech Generation:
Speech understanding system attempts to perform a semantic and Pragmatic processing
Of spoken utterance to understand what the user is saying and act on what is being said
The research area in this category includes: linguistic analysis, design & developing
efficient and effective algorithms for speech recognition and synthesis.
Language Translator: It is a task of automatically converting one natural language into
Snother preserving the meaning of input text and producing an equivalent text in the
‘Output language. The research area in this category includes language modelling.
Information Retrieval (IR): Itis a scientific discipline that deals with analysis, design and
implementation of a computerized system that addresses representation, organization,
and access to large amounts of heterogeneous information encoded in digital format.
The search engine is the well known application of IR which accepts query from user and
returns the relevant document to user. It returns the document, not the relevant answers;
users are left to extract answers from the returned documents. The research area in IR
includes: information searching, information extraction, information categorization and
information summarization from unstructured information.
Information Extraction: It includes extraction of structured information from unstructured
xt. It is an activity of filling predefined template from natural language text. The:
rea in this category includes identifying named entity, resolving anaphora
lationships between entities.
lestion Answering (QA): It is passage retrieval in specific domain, It
ing answers for a given question from a large collection of
‘ural Language Interface to Database (NLIDB): It is a process:
database by asking questions in natural language.ters. It determineg
4 study of dialog between hu
Dia stems The research
grammar and style of the sentence based on that it giv + human-robot dialog ang
area in this category includes the design of conventional agent, hu
analysis of human-human dialog
Text Generation: The task off
Discourse Management / Story Understanding
oe i discourse relationship
Identifying the discourse structure is to identify the nature o Staci
ast a oc
between sentences such as elaboration, explanation, contrast and also
Speech acts in a chunk of text (For example, yes-no, statement and assertion).
Expected Questions ‘ SS
7 What is Natural language Processing ( NLP) ? Discuss various stages involved in
NLP process with suitable example
_-2— What is Natural Language Understanding? Discuss various levels of analysis
under it with example.
_—4— What do you mean by ambiguity in Natural language? Explain with suitable
example. Discuss various ways to resolve ambiguity in NL.
_4What do mean by lexical ambiguity and syntactic ambiguity in Natural language?
What are different ways to resolve these ambiguities?
ot tie
_-3< List various applications of NLP and discuss any 2 applications in detail,wa —
1. Morphology anal.
What are words?
2. E uman language
jords are the fundamental building block of language. Every h z ——
and language
spoken, signed, or written, is composed of words. Every area of speec
processing, from speech recognition to machine translation to information revo
the web, requires extensive knowledge about words. Psycholinguistic models 0 —
language processing and models from generative linguistic are also heavily based on
lexical knowledge.
Words are Orthographic tokens separated by white space. In some languages the
distinction between words and sentences is less clear.
Chinese, Japanese: no white space between words
nowhitespace —- no white space/no whites pace/now hit esp ace
Turkish: words could represent a complete “sentence”
Eg: uygariastiramadiklarimizdanmissinizcasina
“(behaving) as if you are among those whom we could not civilize”
Morphology is the study of the structure and formation of words. Its most important unit
's the morpheme, which is defined as the ‘minimal unit of meaning”
Consider a word like: “unhappiness”. This has three parts:
‘morphiemes:
AN,
__- uh happyness,
6
stemSuffix : ness
Affixes : happy
Stem : happy
un ness today”.
Bound Morphemes: These are lexical items incorporated into a word as a dependent
part. They cannot stand alone, but must be connected to another morphemes. Bound
jorphemes operate in the connection processes by means of derivation, inflection, and
jompounding. Free morphemes, on the other hand, are autonomous, can occur on their
pwn and are thus also Words at the same time. Technically, bound morphemes and free
lorphemes are said to differ in terms oftheir distribution or freedom of occurrence. As a
ule, lexemes consist of at least one free morpheme
lorphology handles the formation of words by using morphemes base form (stem, lemma),
» believe affixes (suffixes, prefixes, infixes), e.g., un-, -able, -ly
lorphological parsing is the task of recognizing the morphemes inside a wordee.g., hands,
Oxes, children and its important for many tasks like machine translation, information
Tetrieval, etc. and useful in parsing, text simplification, etc
Survey of English Morphology
Morphology is the study of the way words are built up from smaller meaning bearing
units, morphemes. A morpheme is often defined as the minimal meaning-bearing unit in
a language. So for example the word fox consists of a single morpheme (the morpheme
fox) while the word cats consists of two: the morpheme cat and the morpheme -s. As
this example suggests, it is often useful to distinguish two broad classes of morphemes:
stems and affixes. The exact details of the distinction vary from language to language,
but intuitively, the stem is the ‘main’ morpheme of the word, supplying the main meaning,
while the affixes add ‘additional’ meanings of various kinds. Affixes divided into
Prefixes, suffixes, infixes, and circum- fixes. Prefixes precede
the stem, circumfixes do both, and infixes are inserted inside
Word eats is composed of a stem eat and the suffix -s. The
a stem buckle and the prefix un-. English doesn't have anFor example, the affix
Stem hingi ‘borrow’ to produce humingi
Prefixes and suffixes are often called concatenative
composed of a number of morphemes concatenated together. A n
have extensive non-concatenative morphology, in which morphemes are cor
more complex ways. The Tagalog in- fixation example above is one example of
concatenative morphology, since two morphemes (hing! and um) are intermingled:
Another kind of non-concatenative Morphology is called templatic morphology oF roate
‘and-patiern morphology. This is very common in Arabic, Hebrew and other Semitic
languages. In Hebrew, for example, a verb isa Ti a
word stem with a grammatical morpheme, usually resulting in a word of a different class,
often with a meaning hard to predict exactly. For example, the
verb computerizes can
take the derivational suffix -ation to produce the noun computer:
ization,
inflectional morphology & Derivational morphology
Morphemes are defined as smallest meaning-bearing units. Morphemes can be
classified in various ways. One common classification is to separate those morphemes
that mark the grammatical forms of words (-s, -ed, -ing and others) from those that form
new lexemes conveying new meanings, e.g. un- and -ment. The former morphemes
are inflectional morphemes and form a key part of grammar, the latter are derivational
morphemes and play a role in word-formation, as we have seen. The following criteria
help you to distinguish the two types:
* Effect: Inflectional morphemes encode grammatical categories and relations, thus
marking word-forms, while derivational morphemes create new lexemes.
* Position: Derivational morphemes are closer to the stem than inflectional morphemes,
cf. amendments (amend[stem] - ment{derivational] - s[inflectional]) and legalized
(legal[stem] — ize[derivational] — edjinflectional]).
* Productivity: inflectional morphemes are highly productive, which means that they
can be attached to the vast majority of the members of a given class (say, verbs,
nouns or adjectives), whereas derivational morphemes tend to be more restricted
with regard to their scope of application. For example, the past morpheme can in
Principle be attached to all verbs; suffixation by means of the adjective-forming
derivational morphemes -able, however, is largely restricted to dynamic transitive
verbs, which excludes formations such as *bleedable or *lieable.
Class properties: Inflectional morphemes make up a closed and fairly stable class of
items which can be listed exhaustively, while derivatio 0
much more numerable and more open to changes in thei
Both inflectional and derivational morphemes must be
cannot occur by themselves, in isolation, and are there
Inflected words are variations of already existing lexem
grammatical shape. Therefore many of the Infic
dictionary. If you know the word surprise and look
the word surprise -s which simply expresses the |= . oy
of
affixes to create new words out
-ship and so on, These affixes
do, but instead
Derivational Morphology on the other hand uses
already existing lexemes. Typical affixes are -ness, -ish,
do not change the grammatical form of a word such as inflectional affixes ~
often create a new meaning of the base or change the word class of the ise A
Example would be the word light. The plural form light-s would be considered Inflect oa
Morphology, but if we consider de-light the prefix -de has changed the meaning of te
word completely. We now do not think of light in the form of sunshine or lamps anymore
but instead about a feeling. Also if we consider en-light the suffix -en has changed the
word class of light from noun to verb.
INFLECTIONAL MORPHOLOGY
Inflection is a morphological process that adapts existing words so that they function
effectively in sentences without changing the category of the base morphemes.
Inflection can be seen as the “realization of morphosyntactic features through:
morphological means” . But what exactly does that mean? It means that inflectional
morphemes of a word do not have to be listed in a dictionary since we can guess their
meaning from the root word. We know when we see the word what it connects to a
most times can even guess the difference to its original. For example let us cons
help-s, help-ed and help-er. According to what | have said about words listed
dictionary, all of these variants might be inflectional morphemes, but then on tl
hand does help-s really need an extra listing or can we guess from the root
the suffix -s what it means? Does our natural feeling and instinct for langua
us, that the suffix -s indicates the third person singular and that help-s therefo
a variant from help (considering help as a verb and not a noun here)? Yes it d
native speaker one instantly knows thats, as also the past form indicator eq
grammatical variant of the root lexeme help.
inflectional morpheme again, the root here beir
Ing:
emove all affixes. 7To illustrate this consider the following two sentences:
1. I help my grandmother in her garden,
2. He is my grandmother's help.
Here our general knowledge of words and their meaning shows us that in 4 help is used
as a verb and expresses us working with our grandmother in order to support her. In 2.
| help is a noun and stands for the person that re
variation of a word without actually chan:
cannot only distinguish verb and noun
also singular and plural, which makes i
later in 2.2.: Inflection in nouns, though.
gularly supports my grandmother. This
iging its form is called zero morphemes and
(which makes it a derivational morphemes) but
it an inflectional morpheme. | will talk about this
‘We may define inflectional morphology as the branch of morphology that deals with
Paradigms. It is therefore concerned with two things: on the one hand, with the semantic
oppositions among categories; and on the other, with the formal means, including
inflections, that distinguish them.” (Matthews, 1991)
Inflectional morphology is that itchanges the word form, it determines the grammar and
it does not form a new lexeme but rather a variant of a lexeme that does not need its own
entry in the dictionary.
word stem + grammatical morphemes
cat + s only for nouns, verbs, and some adjectives
Nouns
plural:
regular: +s, +es _ irregular: mouse - mice; ox - oxen ‘
many spelling rules: e.g. -y -> -ies like: butterfly - butterflies
possessive: +'s, +’
Verbs
main verbs (sleep, eat, walk)
modal verbs (can, will, should)
primary verbs (be, have, do)
VERB INFLECTIONAL SUFFIXES
1. The suffix —s functions in the Present Simple as the jing of the verb
: to work — he work-s
2. The suffix -ed functions in the past simple as in regular
verbs: to love —lov-ed
@)" in the
he perfect aspect
he fp
To study studied studied / To eat ate eaten
4. The suffix -ing functions in the marking of the present participle, the gerund andin
the marking of the continuous aspect: To eat — eating / To study - studying
NOUN INFLECTIONAL SUFFIXES
1. The suffix -s functions in the marking of the plural of nouns: dog ~ dogs
2. The suffix -s functions as a possessive marker (saxon genitive): Laura =
book
ADJECTIVE INFLECTIONAL SUFFIXES
The suffix er functions as comparative marker: quick - quicker
The suffix -est functions as superlative marker: quick - quickest
DERIVATIONAL MORPHOLOGY ot
Derivation is concerned with the way morphemes are connected to e:
as affixes. Derivational morphology is a type of word formation that or
either by changing syntactic category or by adding substantial new
a free or bound base. Derivation may be contrasted with inflection
with compounding on the other. The distinctions between derivatio
between derivation and compounding, however, are not always
may be derived by a variety of formal means including affixation,
Modification of various sorts, subtraction, and conversion. Affixatio
linguistically, especially prefixation and suffixation, Reduplica
with various internal changes like ablaut and root and pattern
ived words may fit into a number of semantic categories. F
lal and participant, collective and abstract noun are
ative categories are well-attested, as are rela
Languages frequently also have ways o
ative. Most languages have deristudy of derivation has also been important in a num
concerning the perception and Production of language.
Derivational morphology is defined as morphology that creates new lexemes, either by
langing the syntactic category (part Of speech) of a base or by adding substantial, non=
grammatical meaning or both. On the one hand, derivation may be distinguished from
change category but rather modifies
ber of psycholinguistic debates
e number, case, tense, aspect, person, among
lay be distinguished from compounding, which
combining two or more bases rather than by affi
ternal modification of various sorts, Although thi
practice applying them is not always easy,
le can distinguish affixes in two principal types:
iT.
others. On the other hand, derivation
also creates new lexemes, but by
ixation, reduplication, subtraction, or
é distinctions are generally useful, in
Prefixes - attached at the beginning of a lexical item or base-morpheme — ex: un-,
pre-, post-, dis, im-, etc.
2. Suffixes — attached at the end of a lexical item ex: -age, “ng, ful, -able, “ness,
5 Aeensipeea
-hood, -ly, etc.
AMPLES OF MORPHOLOGICAL DERIVATION
. Lexical item (free morpheme): like (verb) b. Lexical
+ prefix (bound morpheme) dis-
= dislike (verb)
Derivational affixes can cause semantic change:
- Prefix pre- means before; post- means after; un- rr
. Prefix = fixed before; Unhappy = not happy =
. Prefix de- added to a verb conveys a sense of :
sense of negativity. 7
- To decompose; to defame; to unex
aD yn of Daogest
ote: eatoshry petra
O INENogMi mart eh Kaimya.
al~~;
Derivation Versus Inflection
The distinction between derivation and inflection is a functional one rather than a formal
one, as Boolj (2000, p. 360) has pointed out. Either derivation or inflection may be
affected by formal means like affixation, ‘reduphication| intemal modification of bases,
i to create new lexemes while
and other morphological processes But derivation serves
inflection prototypically serves to modify lexemes to fit differe
the clearest cases, derivation changes category, for example taking a verb like employ
and making it a noun (employment, employer, employee) or an adjective (employa
or taking @ noun like union and making it 2 verb (unionize) or an adjective (unio,
unionesque). Derivation need not change category, however. For example, the
Of abstract nouns from concrete ones in English (king ~ kingdom; child
nt grammatical contexts, In
imperfective, habitual),
categories that languages miga
inflection from derivation; inflection is invariabh
Booij (1996) has argued that even this criterion
we mean by relevance to syntax. Case infle.
context, and are therefore clearly inflectional
inflectional when it is triggered by the number of
or tense and aspect on verbs is a matter of semai
configuration. Booij therefore distinguishes wha
triggered by distinctions elsewhere in a sente
that does not depend on the syntactic context, the latter being closer to derivation than
the former. Some theorists (Bybee, 1985; Dressler, 1989) postulate a continuum from
derivation to inflection, with no clear dividing line between them. Another position is that
of Scalise (1984), who has argued that evaluative morphology is neither inflectional nor
Gerivational but rather constitutes a third category of morphology.
ly relevant to syntax, derivation not. But
is problematic unless we are clear what
cctions, for example, mark grammatical
Number-marking on verbs is arguably
subject or object, but number on nouns
intic choice, independent of grammatical
i he calls contextual inflection, inflection
Nce, from inherent inflection, inflection
2. Stemming and Lemmatization
In natural language processing, there may come a time when
ecognize that the words “ask” and “asked”
his is the idea of reducing different forms
derived from one another can be mapped to
ave the same core meaning.
you want your program to
are just different tenses of the same verb.
of a word to a core root. Words that are
central word or symbol, especially if they
ey
jaybe this is in an information retrieval setting and you want to boost your algorithm's
ecall. Or perhaps you are trying to. analyze word usage in; nd wish to condense
elated words so that you don't have as much variability. s technique of text
jormalization may be useful to you. 4 ‘
lis is where something like stemming or lemmatizatio’
‘or grammatical reasons, documents are going to use dif
organize, organizes, and organizing. Additionally, there
iords with similar meanings, such as democracy, ade r
‘any situations, it seems as if it would be useful for a
elu documents that contain another word in the
‘word, such as
he goal of both stemming and lemmatization
sometimes derivationally related forms of a word
am, are, is > be
Car, cars, car's, cars’ => carlemming Algorithms Examples
orter stemmer: This stemming algorithm is an older one. It’s from the 1980s and its
ain concer is removing the common endings to words so that they can be resolved
common form. It's not too complex and development on it is frozen. Typically, it's
nice starting basic stemmer, but it's not feally advised to use it for any production/
mplex application. Instead, it has its place in research as a nice, basic stemming
gorithm that can guarantee reproducibility. It also is a very gentle stemming algorithm
yen compared to others.
snowball stemmer: This algorithm is also known as the Porter2 stemming algorithm. It is
lost universally accepted as better than the Porter stemmer, even being acknowledged
is Such by the individual who created the Porter stemmer. That being said, it is also more
essive than the Porter stemmer. A lot of the things added to the Snowball stemmer
ere because of issues noticed with the Porter stemmer. There is about a 5% difference
the way that Snowball stems versus Porter.
ancaster stemmer: Just for fun, the Lancaster stemming algorithm is another algorithm
fat you can use. This one is the most aggressive stemming algorithm of the bunch.
lowever, if you use the stemmer in NLTK, you can add your own custom rules to this
algorithm very easily. It's a good choice for that. One complaint around this stemming
algorithm though is that it sometimes is overly aggressive and can really transform words
‘emmatization usually refers to doing things properly with the use of a vocabulary and
Morphological analysis of words, normally aiming to remove inflectional endings only
‘and to return the base or dictionary form of a word, which is known as the lemma .
lemmatization is a more calculated process. It involves resolving words to their dictionary
form. in fact, a lemma of a word is its dictionary or canonical form! +
Because lemmatization is more nuanced in this respect, it requi
make work. For lemmatization to resolve a word to its lemma, it
Of speech. That requires extra computational linguistics power
tagger. This allows it to do better resolutions (like resolving is
Another thing to note about lemmatization is that it’s often
lemmatizer in a new language than it is a stemmingture of a language, it's a much more intensive
uct
ic stemming algorithm
g might return just s, whereas lemmatization
depending on whether the use of the token
iso difer in that stemming most commoniy
was as a verb or a nou! rds, whereas lemmatization commonly only collapses
collapses derivationally related words, Linguistic processing for stemming or
lemma.
the different inflectional forms of @ } f
lemmatization is often done by an additional plug-in component fo the indexing process,
andia number of such components exist, both commercial and open-source.
The most common algorithm for stemming English, and one that has repeatedly been
shown to be empirically very effective, is Porter's algorithm (Porter, 1980). The entire
algorithm is too long and intricate to present here, but we will indicate its general nature
Porter's algorithm consists of 5 phases of word reductions, applied sequentially. Within
each phase there are various conventions to select rules, such as selecting the rule from
each rule group that applies to the longest suffix. In the first phase, this convention is
used with the following rule group:
stru
require a lot more knowledge about the
uristi
process than just trying to set up a he
nin
If confronted with the token saw, stem
it Ww
would attempt to return either see oF $2!
in, The two may al
() Rule Example
SSES + ss caresses caress
HEB lees | Ponies Poni
>
—> caress
>inereases recall while harming precision. As an example of what can go wrong, note that
the Porter stemmer stems all of the following words:
‘operate operating operates operation operative operatives operational to oper.
jever, since operate in its various forms is a common verb, we would expect to lose
siderable precision on queries such as the following with Porter stemming:
operational and research
operating and system
operative and dentistry
a case like this, moving to using a lemmatizer would not completely fix the problem
fause particular inflectional forms are used in particular collocations: a sentence with
jords operate and system is not a good match for the query operating and system.
ting better value from term normalization depends more on pragmatic issues of word
than on formal issues of linguistic morphology.
situation is different for languages with much more morphology (such as Spanish,
an, Hindi and Finnish).
Regular expression
gular expression (RE), a language for specifying text search strings or search pattern.
e regular expression languages used for searching texts in UNIX (vi, Perl, Emacs,
p) and Microsoft. Usually search patterns are used by string searching algorithms for
i” or “find and replace” operations on strings, or for input validation. Itis a technique
eloped in theoretical computer science and formal language theory.
ords are almost identical, and many RE features exist in the various Web search
ines. Besides this practical use, the regular expression is an important theoretical
pI throughout computer science and linguistics, A regular expression is a formula in
special language that is used for specifying simple classes of strings. A string isa
quence of symbols; for the purpose of most text-based search techniques, a string is a
quence of alphanumeric characters (letters, numbers, space s, and punctuation).
¢ is just a character like any ot d we represent it with
n for characterizing
r these purposes a spac
€ symbol .. Formally, @ regular expression is an algebré
set of strings. Thus, they can be used to specify sea
\guage in a formal way.attern describing a certain amount of text. Thej
is a pi ‘
m: ory on which they are ased. But we
hematical theory on whl y bi " Will gj
ated to “regex” or “regexp”. Regu
dig into that. You will usually find the name abbrevi g) ; Qular
- i 3 (regex or regexp) are extremely useful in extracting a ion from an,
expressions " m. a
a by searching for one or more matches of a specific search pattern (i.e. a specif
text by
sequence of ASCII or unicode characters).
for short, i
o
like speech and text, by
e for the field of
Basically, a regular expression |
name comes from the math
broadly defined as the
Natural Language Processing, s broadly
L language,
automatic manipulation of natura Uke
software.Statistical { aims to do statistical inferenc
natural language.
/\b(\wANLP\ we) \b/e Ez
Figure 1: shows matching of the string NLP in the given text on the site https://www-regextester.com/
Regular Expression Patterns
—“and$
matches any string that starts with The -> Try it!
matches a string that ends with end
exact string match (starts and ends with The end)
hes any string that has the text roar in it
*+?and gD
S a string that has ab followed by zero or more c
a string that has ab followed by one or more c
@ string that has ab followed by zero or one c
string that has ab followed by2c
—a[bc] same as previous
Character classes —\d \w\s and.
\d matches a single character that is a digit
\w
matches a word character (alphanumeric character plus underscore)
matches a whitespace character (includes tabs and line breaks)
matches any character
\w and \s also present their negations with \D, \W and \S respectively.
example, \D will perform the inverse match with tespect to that obtained with \d.
matches a single non-digit character
rder to be taken literally, you must escape the characters *.[$()|*+?{\with a backslash
they have special meaning.
matches a string that has a $ before one digit
can match also non-printable characters like tabs \t, new-lines \n, carriage returns \r
is
are learning how to construct a regex but forgetting a fundamental concept: flags.
ex usually comes within this form /abc/, where the search pattern is delimited by
slash characters /. At the end we can specify a flag with these values (we can also
bine them each other):
g (global) does not return after the first match, restarting the subsequent searches
from the end of the previous match
m (multi-line) when enabled * and $ will match the start and end of a line, instead of
the whole string
i (insensitive) makes the whole expression case-insensitive (for
would match AbC)
uping and capturing — ()
JaBcli
parentheses create a capturing group with value
bo)* using ?: we disable the capturing group
bc) using ? we put a name to the group
Operator is very useful when we need to extract i
your preferred programming language. Any mulcena a
4 in the form of a classical array: we will access their
t a name to the groups (using (?...)) we will be able to retrieve
ng the match result like a dictionary where the keys will be the name
Bracket expressions—[]
[abe] matches a string that has either an a or a b or ac -> is the same as albjc
[a<} matches a string that has either an a or ab or ac -> is the same as albje
[244-F0-9] a string that represents a single hexadecimal digit, case insensitively
fo-9y% @ siting that has a character from 0 to 9 before a % sign
[22A-Z] 2 string that has not a letter from a to z or from A to Z. In this case the
“'s used as negation of the expression
Greedy and Lazy match
‘The quantifiers ("+ (}) are greedy operators, so they expand the match as far as they
an through the provided text.
For example, <+> matches