0% found this document useful (0 votes)
52 views

Unit 1 and Unit 2 Good Notes

mca

Uploaded by

Rishita
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views

Unit 1 and Unit 2 Good Notes

mca

Uploaded by

Rishita
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

lOMoARcPSD|23424567

Unit-1 and unit-2 - good notes

Computer Science and Engineering (BMS College of Engineering)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Rishita Kishore Shah ([email protected])
lOMoARcPSD|23424567

Course Title NATURAL LANGUAGE PROCESSING

Course Code 22AM5PE Credits 3 L-T-P 3-0-0


N LP

CIE 50 SEE 100 Marks (50%


Marks Weightage )

Contact Hours 3 Total Lecture Hours 36


/Week

UNIT – 1 7 Hrs

Introduction: Introduction: What is Natural Language Processing (NLP),


Language and Knowledge, The Challenges of NLP, Language and Grammar,
Processing Indian Languages, NLP Applications.
Language Modeling: Introduction, Various Grammar-based Language Models,
Statistical Language Model.

UNIT – 2 8 Hrs

Word Level Analysis: Introduction, Regular Expressions, Finite-State Automata,


Morphological Parsing, Spelling Error Detection and Correction, Words and Word
Classes, Part-of-Speech Tagging.
Syntactic Analysis: Context-Free Grammar, Constituency, Parsing, Probabilistic
Parsing, Indian Languages.

UNIT – 3 6 Hrs

Semantic Analysis: Meaning Representation, Lexical Semantics, Ambiguity, Word


Sense Disambiguation.
Discourse Processing: Cohesion, Reference Resolution, Discourse Coherence
and Structure.

UNIT – 4 7 Hrs

Vector Semantics and Embeddings: Lexical Semantics, Vector Semantics, Words


and Vectors, Cosine for measuring similarity, TF-IDF: Weighing terms in the
vector, Applications of the TF-IDF vector model, Word2vec, Visualizing
Embeddings, Semantic properties of embeddings, Bias and Embeddings,
Evaluating Vector Models

UNIT – 5 8 Hrs

Applications of Natural Language Processing: Question Answering Systems,


Information Retrieval, Information Extraction, Automatic Text Categorization,
Machine Translation, Speech Technologies, Human and Machine Intelligence.
Downloaded by Rishita Kishore Shah ([email protected])
lOMoARcPSD|23424567

Text Books:
1. Natural Language Processing and Information Retrieval Tanveer Siddiqui,
U.S. Tiwary 1 st Edition Oxford University press 2008. 2. Speech and Language
Processing: An introduction to Natural Language

Processing, Computational Linguistics and Speech Recognition by Daniel


Jurafsky and James H Martin, 3rd Edition, Prentice Hall, 2019.

Reference Books:
1. Natural Language Processing: An information Access Perspective by Kavi
Narayana Murthy, Ess Ess Publications, 2006.
2. Applied Text Analysis with Python by Benjamin Bengfort, Tony Ojeda,
Rebecca Bilbro, O'Reilly Media, June 2018.

1.1. What is Natural Language Processing (NLP)

Linguistics: is concerned with language, it’s formation, syntax, meaning, different


kind of phrases (noun or verb) Computer Science: is concerned with applying
linguistic knowledge, by transforming it into computer programs with the help of
sub-fields such as Artificial Intelligence (Machine Learning & Deep Learning).
Natural language processing (NLP) is the intersection of computer science,
linguistics and machine learning. The field focuses on communication between
computers and humans in natural language and NLP is all about making
computers understand and generate human language.

Advantages of NLP

o NLP helps users to ask questions about any subject and get a direct response
Downloaded by Rishita Kishore Shah ([email protected])
lOMoARcPSD|23424567

within seconds.
o NLP offers exact answers to the question means it does not offer unnecessary and
unwanted information.
o NLP helps computers to communicate with humans in their languages.
o It is very time efficient.
o Most of the companies use NLP to improve the efficiency of documentation
processes, accuracy of documentation, and identify the information from large
databases.

Why NLP
• Huge data from tweets, reviews, chats, queries etc.
• Most of them are unstructured
• Hard for Human’s to handle and manage them.
• To have a deep understanding of broad natural language

History of NLP
(1940-1960) - Focused on Machine Translation (MT)

The Natural Languages Processing started in the year 1940s.

1948 - In the Year 1948, the first recognisable NLP application was introduced in
Birkbeck College, London.

1950s - In the Year 1950s, there was a conflicting view between linguistics and
computer science. Now, Chomsky developed his first book syntactic structures and
claimed that language is generative in nature.

In 1957, Chomsky also introduced the idea of Generative Grammar, which is rule based
descriptions of syntactic structures.

(1960-1980) - Flavored with Artificial Intelligence (AI)

In the year 1960 to 1980, the key developments were:

Augmented Transition Networks (ATN)

Augmented Transition Networks is a finite state machine that is capable of recognizing


regular languages.

Case Grammar
Case Grammar was developed by Linguist Charles J. Fillmore in the year 1968. Case
Grammar uses languages such as English to express the relationship between nouns and
verbs by using the preposition.

In Case Grammar, case roles can be


Downloaded defined
by Rishita toShah
Kishore link([email protected])
certain kinds of verbs and objects.
lOMoARcPSD|23424567

For example: "Neha broke the mirror with the hammer". In this example case grammar
identify Neha as an agent, mirror as a theme, and hammer as an instrument.

In the year 1960 to 1980, key systems were:

SHRDLU

SHRDLU is a program written by Terry Winograd in 1968-70. It helps users to


communicate with the computer and moving objects. It can handle instructions such as
"pick up the green boll" and also answer the questions like "What is inside the black
box." The main importance of SHRDLU is that it shows those syntax, semantics, and
reasoning about the world that can be combined to produce a system that understands
a natural language.

LUNAR

LUNAR is the classic example of a Natural Language database interface system that is
used ATNs and Woods' Procedural Semantics. It was capable of translating elaborate
natural language expressions into database queries and handle 78% of requests without
errors.

1980 - Current

Till the year 1980, natural language processing systems were based on complex sets of
hand-written rules. After 1980, NLP introduced machine learning algorithms for
language processing

In the beginning of the year 1990s, NLP started growing faster and achieved good
process accuracy, especially in English Grammar. In 1990 also, an electronic text
introduced, which provided a good resource for training and examining natural
language programs. Other factors may include the availability of computers with fast
CPUs and more memory. The major factor behind the advancement of natural language
processing was the Internet.
Disadvantages of NLP
A list of disadvantages of NLP is given below:

o NLP may not show context.


o NLP is unpredictable
o NLP may require more keystrokes.
o NLP is unable to adapt to the new domain, and it has a limited function that's why
NLP is built for a single and specific task only.

Components of NLP

There are the following Downloaded


two components of NLP -
by Rishita Kishore Shah ([email protected])
lOMoARcPSD|23424567

1. Natural Language Understanding (NLU)

Natural Language Understanding (NLU) helps the machine to understand and analyse
human language by extracting the metadata from content such as concepts, entities,
keywords, emotion, relations, and semantic roles.

NLU mainly used in Business applications to understand the customer's problem in both
spoken and written language.

NLU involves the following tasks -

o It is used to map the given input into useful representation.


o It is used to analyze different aspects of the language.

2. Natural Language Generation (NLG)

Natural Language Generation (NLG) acts as a translator that converts the computerized
data into natural language representation. It mainly involves Text planning, Sentence
planning, and Text Realization.

Note: The NLU is difficult than NLG.


Applications of NLP
There are the following applications of NLP -

1. Question Answering

Question Answering focuses on building systems that automatically answer the


questions asked by humans in a natural language.

2. Spam Detection

Spam detection is used to detect unwanted e-mails getting to a user's inbox.

Downloaded by Rishita Kishore Shah ([email protected])


lOMoARcPSD|23424567

3. Sentiment Analysis

Sentiment Analysis is also known as opinion mining. It is used on the web to analyse
the attitude, behaviour, and emotional state of the sender. This application is
implemented through a combination of NLP (Natural Language Processing) and
statistics by assigning the values to the text (positive, negative, or natural), identify the
mood of the context (happy, sad, angry, etc.)

4. Machine Translation

Machine translation is used to translate text or speech from one natural language to
another natural language.

Example: Google Translator


5. Spelling correction

Microsoft Corporation provides word processor software like MS-word, PowerPoint for
the spelling correction.
Downloaded by Rishita Kishore Shah ([email protected])
lOMoARcPSD|23424567

6. Speech Recognition
Speech recognition is used for converting spoken words into text. It is used in
applications, such as mobile, home automation, video recovery, dictating to Microsoft
Word, voice biometrics, voice user interface, and so on.

7. Chatbot

Implementing the Chatbot is one of the important applications of NLP. It is used by many
companies to provide the customer's chat services.

8. Information extraction

Information extraction is one of the most important applications of NLP. It is used for
extracting structured information from unstructured or semi-structured machine
readable documents.

9. Natural Language Understanding (NLU)

It converts a large set of text into more formal representations such as first-order logic
structures that are easier for the computer programs to manipulate notations of the
natural language processing.

1.2. Language and Knowledge


i. How to build an NLP pipeline

There are the following steps to build an NLP pipeline -


Downloaded by Rishita Kishore Shah ([email protected])
lOMoARcPSD|23424567

Step1: Sentence Segmentation

Sentence Segment is the first step for building the NLP pipeline. It breaks the paragraph
into separate sentences.

Example: Consider the following paragraph -


Independence Day is one of the important festivals for every Indian citizen. It is
celebrated on the 15th of August each year ever since India got independence from
the British rule. The day celebrates independence in the true sense.

Sentence Segment produces the following result:

1. "Independence Day is one of the important festivals for every Indian citizen." 2. "It
is celebrated on the 15th of August each year ever since India got independence
from the British rule."
3. "This day celebrates independence in the true sense."

Step2: Word Tokenization

Word Tokenizer is used to break the sentence into separate words or tokens.

Example:

JavaTpoint offers Corporate Training, Summer Training, Online Training, and Winter
Training.

Word Tokenizer generates the following result:

"JavaTpoint", "offers", "Corporate", "Training", "Summer", "Training", "Online",


"Training", "and", "Winter", "Training", "."

Step3: Stemming

Stemming is used to normalize words into its base form or root form. For example,
celebrates, celebrated and celebrating, all these words are originated with a single root
word "celebrate." The big problem with stemming is that sometimes it produces the root
word which may not have any meaning.

For Example, intelligence, intelligent, and intelligently, all these words are originated
with a single root word "intelligen." In English, the word "intelligen" do not have any
meaning.

Step 4: Lemmatization

Lemmatization is quite similar to the Stamming. It is used to group different inflected


forms of the word, called Lemma. The main difference between Stemming and
lemmatization is that it Downloaded
producesbythe root
Rishita word,
Kishore which
Shah has a meaning.
([email protected])
lOMoARcPSD|23424567

For example: In lemmatization, the words intelligence, intelligent, and intelligently has
a root word intelligent, which has a meaning.
Step 5: Identifying Stop Words

In English, there are a lot of words that appear very frequently like "is", "and", "the", and
"a". NLP pipelines will flag these words as stop words. Stop words might be filtered out
before doing any statistical analysis.

Example: He is a good boy.

Note: When you are building a rock band search engine, then you do not ignore the word
"The."

Step 6: Dependency Parsing

Dependency Parsing is used to find that how all the words in the sentence are related to
each other.

Step 7: POS tags

POS stands for parts of speech, which includes Noun, verb, adverb, and Adjective. It
indicates that how a word functions with its meaning as well as grammatically within
the sentences. A word has one or more parts of speech based on the context in which it
is used.

Example: "Google" something on the Internet.

In the above example, Google is used as a verb, although it is a proper noun.

Step 8: Named Entity Recognition (NER)

Named Entity Recognition (NER) is the process of detecting the named entity such as
person name, movie name, organization name, or location.

Example: Steve Jobs introduced iPhone at the Macworld Conference in San Francisco,
California.

Step 9: Chunking

Chunking is used to collect the individual piece of information and grouping them into
bigger pieces of sentences.
Phases of NLP
Natural Language Processing is separated into five primary stages or phases, starting
with simple word processing and progressing to identifying complicated phrase
meanings.
Downloaded by Rishita Kishore Shah ([email protected])
lOMoARcPSD|23424567

1. Lexical or Morphological Analysis Lexical or Morphological Analysis is the initial


step in NLP. It entails recognizing and analyzing word structures. The collection of
words and phrases in a language is referred to as the lexicon. Lexical analysis is the
process of breaking down a text file into paragraphs, phrases, and words. The source
code is scanned as a stream of characters and converted into intelligible lexemes in
this phase.

It includes techniques as follows:

• Stop word removal (removing ‘and’, ‘of’, ‘the’ etc. from text)
• Tokenization (breaking the text into sentences or words)
Word tokenizer
o
o Sentence tokenizer
o Tweet tokenizer
• Stemming (removing ‘ing’, ‘es’, ‘s’ from the tail of the words)
• Lemmatization (converting the words to their base forms)

2. Syntax Analysis or Parsing


o Syntactic or Syntax analysis is a technique for checking grammar, arranging words,
and displaying relationships between them. It entails examining the syntax of the
words in the phrase and arranging them in a way that demonstrates the
relationship between them. Syntax analysis guarantees that the structure of a
particular piece of text is proper. It tries to parse the sentence in order to ensure
that the grammar is correct at the sentence level. A syntax analyzer assigns POS
tags based on the sentence structure given the probable POS created in the
preceding stage.

Some of the techniques Downloaded


used in this phase
by Rishita are:Shah ([email protected])
Kishore
lOMoARcPSD|23424567

o Dependency Parsing
o Parts of Speech (POS) tagging

Semantic Analysis Semantic analysis is the process of looking for meaning in a


statement. It concentrates mostly on the literal meaning of words, phrases, and
sentences is the main focus. It also deals with putting words together to form sentences.
It extracts the text’s exact meaning or dictionary definition. The meaning of the text is
examined. It is accomplished by mapping the task domain’s syntactic structures and
objects.

o Take the following sentence for example: “The guava ate an apple.” The line is
syntactically valid, yet it is illogical because guavas cannot eat.
o Discourse Integration The term “discourse integration” refers to a feeling of
context. The meaning of any sentence is determined by the meaning of the
sentence immediately preceding it. In addition, it establishes the meaning of the
sentence that follows. The sentences that come before it play a role in discourse
integration. That is to say, that statement or word is dependent on the preceding
sentence or words. It’s the same with the use of proper nouns and pronouns.
o
o Example: "John got ready at 9 AM. Later he took the train to California" o Here, the
machine is able to understand that the word “he” in the second sentence is referring
to “John”.

Pragmatic Analysis The fifth and final phase of NLP is pragmatic analysis. The overall
communicative and social content, as well as its impact on interpretation, are the focus
of pragmatic analysis. Pragmatic Analysis uses a set of rules that describe cooperative
dialogues to help you find
the intended result. It covers things like word repetition, who said what to whom, and so
on. It comprehends how people communicate with one another, the context in which
they converse, and a variety of other factors. It refers to the process of abstracting or
extracting the meaning of a situation’s use of language. It translates the given text using
the knowledge gathered in the preceding stages. “Switch on the TV” when used in a
sentence, is an order or request to switch the TV on.

Some more examples:

"Thank you for coming so late, we have wrapped up the meeting" (Contains sarcasm)

"Can you share your screen?" (here the context is about computer’s screen share during
a remote meeting)

Challenges in NLP

Ambiguity in computational linguistics is a situation where a word or a sentence may


Downloaded by Rishita Kishore Shah ([email protected])
lOMoARcPSD|23424567

have more than one meaning. That is, a sentence may be interpreted in more than one
way. This leads to uncertainty in choosing the right meaning of a sentence especially
while processing natural languages by computer.

• Ambiguity is a challenging task in natural language understanding (NLU). • The process


of handling the ambiguity is called as disambiguation. • Ambiguity presents in almost all
the steps of natural language processing. (Steps of NLP – lexical analysis, syntactic
analysis, semantic analysis, discourse analysis, and pragmatic analysis).

Consider the following sentence for an example;


“Raj tried to reach his friend on the mobile, but he didn’t attend”
In this sentence, we have the presence of lexical, syntactic, and anaphoric ambiguities.

Lexical ambiguity – The word “tried” means “attempted” not “judged” or “tested”.
Also, the word “reach” means “establish communication” not “gain” or “pass” or
“strive”.

Syntactic ambiguity – The phrase “on the mobile” attached to “reach” and thus means
“using the mobile”. It is not attached to “friend”.

Anaphoric ambiguity – The anaphor “he” refers the “friend” not “Raj”.
i. Lexical ambiguity
It is class of ambiguity caused by a word and its multiple senses especially when the
word is part of sentence or phrase. A word can have multiple meanings under different
part of speech categories. Also, under each POS category they may have multiple
different senses. Lexical ambiguity is about choosing which sense of a particular word
under a particular POS category.
In a sentence, the lexical ambiguity is caused while choosing the right sense of a word
under a correct POS category.
For example, let us take the sentence “I saw a ship”. Here, the words “saw” and “ship”
would mean multiple things as follows;
Saw = present tense of the verb saw (cut with a saw) OR past tense of the verb see
(perceive by sight) OR a noun saw (blade for cutting) etc. According to WordNet, the
word “saw” is defined under 3 different senses in NOUN category and under 25 different
senses in VERB category.
Ship = present tense of the verb ship (transport commercially) OR present tense of the
verb ship (travel by ship) OR a noun ship (a vessel that carries passengers) etc. As per
WordNet, the word “ship” is defined with 1 sense under NOUN category and 5 senses
under VERB category.
Due to multiple meanings, there arises an ambiguity in choosing the right sense of “saw”
and “ship”.
Handling lexical ambiguity
Lexical ambiguity can be handled using the tasks like POS tagging and Word Sense
Downloaded by Rishita Kishore Shah ([email protected])
lOMoARcPSD|23424567

Disambiguation.

ii. Syntactic ambiguity


It is a type of ambiguity where the doubt is about the syntactic structure of the sentence.
That is, there is a possibility that a sentence could be parsed in many syntactical forms (a
sentence may be interpreted in more than one way). The doubt is about which one
among different syntactical forms is correct.
For example, the sentence “old men and women” is ambiguous. Here, the doubt is that
whether the adjective old is attached with both men and women or men alone. Phrase
attachment. "Mary ate a salad with spinach from Califonia for lunch on Tuesday." "with
spinach" can attach to "salad" or "ate"
"from California" can attach to "spinach", "salad", or "ate".
"for lunch" can attach to "California", "spinach", "salad", or "ate"
and "on Tuesday" can attach to "lunch", "California", "spinach", "salad" or "ate".
(Crossovers are not allowed, so you cannot both attach "on Tuesday" to "spinach" and
attach "for lunch" to salad. Nonetheless there are 42 possible different parse trees.)`

iii. Semantic ambiguity


Even after the syntax and the meanings of individual words are resolved, still there are
more than one way of reading a sentence. For example, the sentence “the dog has been
domesticated for more than 1000 years” could be semantically interpreted as follows; 1. A
particular dog has been domesticated or
2. The dog species has been domesticated.
Semantic ambiguity is an uncertainty that occurs when a word, phrase or sentence has
more than one interpretation.

iv. Anaphoric ambiguity


Anaphora in linguistics is about referring backwards (or an entity in another context) in
a text.
“Suresh kicked the ball. It went out of the stadium”
In this sentence, the pronoun “it” is an anaphor. This anaphor refers to the entity “the
ball” in the previous sentence. The entity is described as the antecedent of anaphor “it”.
This example is simple and does not show any ambiguity. Let us see one more example
sentence;
“London had snow yesterday. It fell to a depth of a meter”
In this sentence, how do we relate the pronoun “it” with the previous sentence? We have
three antecedents namely “London”, “snow” and “yesterday”. We can relate the
anaphor to either “London”, or “snow”, or “yesterday”. We would be able to get the
correct meaning if we relate the anaphor to the antecedent “snow”.
Some words (anaphors) in a sentence have little or no meaning of their own but instead
refer to other words in the same or other sentences. Anaphoric ambiguity refers to
such a situation where an anaphor have more than one possible reference in the same
or other sentence.
Downloaded by Rishita Kishore Shah ([email protected])
lOMoARcPSD|23424567

Some facts about anaphoric ambiguity

• Anaphors are mostly pronouns, or noun phrases in some cases.

Example: “Darshan plays keyboard. He loves music”. In this sentence, the anaphor “He”
is a pronoun.
“A puppy drank the milk. The cute little dog was satisfied”. In this sentence, the anaphor
“The cute little dog” is a noun pharse.

• Anaphoric references may not explicitly present in the previous sentence. Instead it
may refer a part of an entity (antecedent) in the previous sentence.

Example: “I went to the hospital, and they told me to go home and rest”. Here, the
anaphor “they” refers not to the “hospital” directly, instead to the “hospital staff”.

• Anaphors may not be in the immediately previous sentence. They may present in the
sentences before the previous one or may present in the same sentence.

Example: “The horse ran up the hill. It was very steep. It soon got tired”. Here, the
anaphor “it” of the third sentence refers the “horse” in the first sentence

v. Pragmatic ambiguity
Pragmatics focuses on conversational implicature. Conversational implicature is a
process in which the speaker implies and a listener infers. Simply, it is a study about the
sentences that are not directly spoken. It is the study of how people use language. The
pragmatic level of linguistic processing deals with the use of real-world knowledge and
understanding how this impacts the meaning of what is being communicated. By
analyzing the contextual dimension of the documents and queries, a more detailed
representation is derived.
Pragmatic ambiguity arises when the statement is not specific, and the context does not
provide the information needed to clarify the statement (Walton D. (1996) A Pragmatic
Synthesis. In: Fallacies Arising from Ambiguity. Applied Logic Series, vol 1. Springer,
Dordrecht).
Roughly, a pragmatic ambiguity occurs in a requirement if different readers give
different interpretations to it, depending on the context of the requirement. The context
of a requirement includes the other requirements of the same document, which
influence the understanding of the requirement, and the background knowledge of the
reader, which gives a meaning to the concepts expressed in the requirement.
Example:
Sentence Direct meaning Other meanings
(semantic meaning) (pragmatic meanings)

Downloaded by Rishita Kishore Shah ([email protected])


lOMoARcPSD|23424567

Do you know what time Asking for the current Expressing anger to
is it? time someone who missed the
due time or something `

Will you crack open the To break Open the door just a little
door? I am getting hot

The chicken is ready to eat The chicken is ready to The cooked chicken is
eat its breakfast, for ready to be served
example.

Difference between semantics and pragmatics

• Semantics is about literal meaning of the words and their interrelations, whereas
pragmatics focuses on the inferred meaning that the speakers and listeners perceive.
• Semantics is the study of meaning, or more precisely, the study of the relation between
linguistic expressions and their meanings. Pragmatics is the study of context, or more
precisely, a study of the way context can influence our understanding of linguistic
utterances.

Language and Grammar:

Downloaded by Rishita Kishore Shah ([email protected])


lOMoARcPSD|23424567

Downloaded by Rishita Kishore Shah ([email protected])


lOMoARcPSD|23424567

Downloaded by Rishita Kishore Shah ([email protected])


lOMoARcPSD|23424567

Downloaded by Rishita Kishore Shah ([email protected])


lOMoARcPSD|23424567

Downloaded by Rishita Kishore Shah ([email protected])


lOMoARcPSD|23424567

Chapter 2
Language Modeling

2.1 Various Grammar Based Language Models


2.2. Statistical Language Model
N grams.

Study 3.1 from speech and Language processing.


https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf
Unit-2

Study:
Regular Expressions
Spelling Error Detection and Correction
From speech and language processing

Downloaded by Rishita Kishore Shah ([email protected])

You might also like