AI 3rd Unit - Part 2 - Natural Language Processing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

3rd Unit, AI - Part 2 – Natural Language Processing

Q1 ) What is NLP?

1. NLP stands for Natural Language Processing, which is


a part of Computer Science, Human
language, and Artificial Intelligence.
2. It is the technology that is used by machines to
understand, analyse, manipulate, and interpret
human's languages.
3. It helps developers to organize knowledge for
performing tasks such as translation, automatic
summarization, Named Entity Recognition (NER),
speech recognition, relationship extraction, and topic
segmentation.

Q2 ) What are the advantages of NLP?

1. NLP helps users to ask questions about any subject and get a direct response within
seconds.
2. NLP offers exact answers to the question means it does not offer unnecessary and
unwanted information.
3. NLP helps computers to communicate with humans in their languages.
4. It is very time efficient.
5. Most of the companies use NLP to improve the efficiency of documentation processes,
accuracy of documentation, and identify the information from large databases.

Q3 ) What are the disadvantages of NLP?

A list of disadvantages of NLP is given below:

1. NLP may not show context.


2. NLP is unpredictable
3. NLP may require more keystrokes.
4. NLP is unable to adapt to the new domain, and it has a limited function that's why
NLP is built for a single and specific task only.

1
3rd Unit, AI - Part 2 – Natural Language Processing

Q4 ) Explain about the components of NLP?

There are the following two components of NLP -

1. Natural Language Understanding (NLU)


2. Natural Language Generation (NLG)

1. Natural Language Understanding (NLU)

(i) Natural Language Understanding (NLU) helps the machine to understand and
analyse human language by extracting the metadata from content such as
concepts, entities, keywords, emotion, relations, and semantic roles.
(ii) NLU mainly used in Business applications to understand the customer's problem
in both spoken and written language.
(iii) NLU involves the following tasks -

 It is used to map the given input into useful representation.


 It is used to analyze different aspects of the language.

2. Natural Language Generation (NLG)

(i) Natural Language Generation (NLG) acts as a translator that converts the
computerized data into natural language representation.
(ii) It mainly involves

 Text planning,
 Sentence planning,
 and Text Realization.

Differences between NLU and NLG:

2
3rd Unit, AI - Part 2 – Natural Language Processing

Q5 ) Explain the applications of NLP?

1. Question Answering

Question Answering focuses on building systems that


automatically answer the questions asked by humans in a
natural language.

2. Spam Detection

Spam detection is used to detect unwanted e-


mails getting to a user's inbox.

3. Sentiment Analysis

 Sentiment Analysis is also known as opinion mining. It


is used on the web to analyse the attitude, behaviour,
and emotional state of the sender.
 This application is implemented through a combination
of NLP (Natural Language Processing) and statistics
by assigning the values to the text (positive, negative, or natural), identify the
mood of the context (happy, sad, angry, etc.)

4. Machine Translation

Machine translation is used to translate text or speech from one natural language
to another natural language.

Example: Google Translator

5. Spelling correction

Microsoft Corporation provides word


processor software like MS-word,
PowerPoint for the spelling

3
3rd Unit, AI - Part 2 – Natural Language Processing

correction.

6. Speech Recognition

Speech recognition is used for converting spoken words into text.

It is used in applications, such as mobile, home automation, video recovery,


dictating to Microsoft Word, voice biometrics, voice user interface, and so on.

7. Chatbot

Implementing the Chatbot is one of the


important applications of NLP. It is used
by many companies to provide the
customer's chat services.

8. Information extraction

Information extraction is one of the most important applications of NLP. It is used


for extracting structured information from unstructured or semi-structured
machine-readable documents.

9. Natural Language Understanding (NLU)

It converts a large set of text into more formal representations such as first-order
logic structures that are easier for the computer programs to manipulate
notations of the natural language processing.

Q5 ) How To Build NLP Pipeline?

There are the following steps to build an NLP pipeline –


Step1: Sentence Segmentation
Step2: Word Tokenization
Step3: Stemming
Step 4: Lemmatization
Step 5: Identifying Stop Words
Step 6: Dependency Parsing
Step 7: POS tags

4
3rd Unit, AI - Part 2 – Natural Language Processing
Step 8: Named Entity Recognition (NER)
Step 9: Chunking

Step1: Sentence Segmentation

1. Sentence Segment is the first step for building the NLP pipeline. It breaks the
paragraph into separate sentences.

2. Example: Consider the following paragraph -

Independence Day is one of the important festivals for every Indian citizen. It is
celebrated on the 15th of August each year ever since India got independence from
the British rule. The day celebrates independence in the true sense.

3. Sentence Segment produces the following result:


1. "Independence Day is one of the important festivals for every Indian citizen."
2. "It is celebrated on the 15th of August each year ever since India got
independence from the British rule."
3. "This day celebrates independence in the true sense.”

Step2: Word Tokenization

1. Word Tokenizer is used to break the sentence into separate words or tokens.
2. Example:
1. JavaTpoint offers Corporate Training, Summer Training, Online Training, and
Winter Training.
3. Word Tokenizer generates the following result:
1. "JavaTpoint", "offers", "Corporate", "Training", "Summer", "Training", "Online",
"Training", "and", "Winter", "Training", "."

Step3: Stemming

1. Stemming is used to normalize words into its base form or root form.
2. For example, celebrates, celebrated and celebrating, all these words are originated
with a single root word "celebrate."
3. The big problem with stemming is that sometimes it produces the root word which
may not have any meaning.
4. For Example, intelligence, intelligent, and intelligently, all these words are originated
with a single root word "intelligen." In English, the word "intelligen" do not have any
meaning.

Step 4: Lemmatization

5
3rd Unit, AI - Part 2 – Natural Language Processing

1. Lemmatization is quite similar to the Stemming.


2. It is used to group different inflected forms of the word, called Lemma.
3. The main difference between Stemming and lemmatization is that it produces the
root word, which has a meaning.
4. For example: In lemmatization, the words intelligence, intelligent, and intelligently
has a root word intelligent, which has a meaning.

Step 5: Identifying Stop Words

1. In English, there are a lot of words that appear very frequently like "is", "and", "the",
and "a".
2. NLP pipelines will flag these words as stop words.
3. Stop words might be filtered out before doing any statistical analysis.
4. Example: He is a good boy.

Step 6: Dependency Parsing

1. Dependency Parsing is used to find that how all the words in the sentence are related to
each other.

Step 7: POS tags

1. POS stands for parts of speech, which includes Noun, verb, adverb, and Adjective.
2. It indicates that how a word functions with its meaning as well as grammatically
within the sentences.
3. A word has one or more parts of speech based on the context in which it is used.
4. Example: "Google" something on the Internet.

In the above example, Google is used as a verb, although it is a proper noun.

Step 8: Named Entity Recognition (NER)

1. Named Entity Recognition (NER) is the process of detecting the named entity such
as person name, movie name, organization name, or location.
2. Example: Steve Jobs introduced iPhone at the Macworld Conference in San
Francisco, California.

Step 9: Chunking

1. Chunking is used to collect the individual piece of information and grouping them into
bigger pieces of sentences.

6
3rd Unit, AI - Part 2 – Natural Language Processing

Q6 ) What are the different phases of NLP?

There are the following five phases of NLP:

1. Lexical Analysis and Morphological

1. The first phase of NLP is the Lexical Analysis.


2. This phase scans the source code as a stream of
characters and converts it into meaningful lexemes.
3. It divides the whole text into paragraphs, sentences, and
words.

2. Syntactic Analysis (Parsing)

Syntactic Analysis is used to check grammar, word


arrangements, and shows the relationship among the words.

Example: Agra goes to the Poonam

In the real world, Agra goes to the Poonam, does not make any sense, so this sentence
is rejected by the Syntactic analyzer.

3. Semantic Analysis

1. Semantic analysis is concerned with the meaning representation.


2. It mainly focuses on the literal meaning of words, phrases, and sentences.

4. Discourse Integration

Discourse Integration depends upon the sentences that proceeds and also invokes the
meaning of the sentences that follow it.

5. Pragmatic Analysis

1. Pragmatic is the fifth and last phase of NLP.


2. It helps you to discover the intended effect by applying a set of rules that
characterize cooperative dialogues.
3. For Example: "Open the door" is interpreted as a request instead of an order.

7
3rd Unit, AI - Part 2 – Natural Language Processing

Q7 ) Explain differences between Natural language and Computer Language ?

Q 8 ) Explain Why NLP is difficult?

NLP is difficult because Ambiguity and Uncertainty exist in the language.


Ambiguity
There are the following three ambiguity –
Lexical Ambiguity
Lexical Ambiguity exists in the presence of two or more possible meanings of the
sentence within a single word.
Example:
Manya is looking for a match.
In the above example, the word match refers to that either Manya is looking for a partner
or Manya is looking for a match. (Cricket or other match)

Syntactic Ambiguity
Syntactic Ambiguity exists in the presence of two or more possible meanings within the
sentence.
Example:
I saw the girl with the binocular.
In the above example, did I have the binoculars? Or did the girl have the binoculars?

Referential Ambiguity
Referential Ambiguity exists when you are referring to something using the pronoun.
Example: Kiran went to Sunita. She said, "I am hungry."
In the above sentence, you do not know that who is hungry, either Kiran or Sunita.

8
3rd Unit, AI - Part 2 – Natural Language Processing

Q9 ) Explain NLP language models?

1. A language model is the core component of modern Natural Language Processing


(NLP).
2. A Language Model Is A Statistical Tool that analyzes the pattern of human language for
the prediction of words.
3. NLP-based applications use language models for a variety of tasks, such as audio to
text conversion,
i. speech recognition,
ii. sentiment analysis,
iii. summarization,
iv. spell correction, etc.

Let’s understand how language models help in processing these NLP tasks:

Speech Recognition:

1. Smart speakers, such as Alexa, use Automatic Speech Recognition (ASR)


mechanisms for translating the speech into text.
2. It translates the spoken words into text and between this translation, the ASR
mechanism analyzes the intent/sentiments of the user by differentiating between
the words.
3. For example, analyzing homophone phrases such as “Let her” or “Letter”, “But
her” “Butter”.

Machine Translation:

1. When translating a Chinese phrase “我在吃” into English, the translator can give several
choices as output:

I eat lunch
I am eating
Me am eating
Eating am I

2. Here, the language model tells that the translation “I am eating” sounds natural and will
suggest the same as output.

There are primarily two types of language models:

1. Statistical Language Models

9
3rd Unit, AI - Part 2 – Natural Language Processing

N-Gram models

(a) Unigram
(b) Bidirectional
(c) Exponential
(d) Continuous Space

2. Neural Language Models

1. Statistical Language Models

1. Statistical models include the development of probabilistic models that are able to
predict the next word in the sequence, given the words that precede it.
2. A number of statistical language models are in use already.
3. Let’s take a look at some of those popular models:

10
3rd Unit, AI - Part 2 – Natural Language Processing

11
3rd Unit, AI - Part 2 – Natural Language Processing

12
3rd Unit, AI - Part 2 – Natural Language Processing

13
3rd Unit, AI - Part 2 – Natural Language Processing

14
3rd Unit, AI - Part 2 – Natural Language Processing

15
3rd Unit, AI - Part 2 – Natural Language Processing

16
3rd Unit, AI - Part 2 – Natural Language Processing

17
3rd Unit, AI - Part 2 – Natural Language Processing

18
3rd Unit, AI - Part 2 – Natural Language Processing

Q10 ) Explain The Use/Applications of language models in NLP?

19
3rd Unit, AI - Part 2 – Natural Language Processing

20
3rd Unit, AI - Part 2 – Natural Language Processing

Q11) Explain Text Classification in NLP?

1. Text clarification is the process of categorizing the text into a group of words.
2. By using NLP, text classification can automatically analyze text and then assign a set of
predefined tags or categories based on its context.
3. NLP is used for sentiment analysis, topic detection, and language detection.
4. There are mainly three text classification approaches-
 Rule-based System,
 Machine System
 Hybrid System.
5. Rule-based Approach Text Classification

(i) In the rule-based approach, texts are separated into an organized group using a
set of handicraft linguistic rules.
(ii) Those handicraft linguistic rules contain users to define a list of words that are
characterized by groups.

21
3rd Unit, AI - Part 2 – Natural Language Processing

(iii) For example,, words like Donald Trump and Boris Johnson would be categorized
into politics. People like LeBron James and Ronaldo would be categorized into
sports.

6. Machine-based Text classification


a. Machine-based
based classifier learns to make a classification based on past
observation from the data sets.
b. User data is pre-labeled
labeled as train and test data.
c. It collects the classification strategy from the previous inputs and learns
continuously.
d. Machine-based classifier uses a bag of words for feature extension.
e. In a bag of words, a vector represents the frequency of words in a predefined
dictionary of a word list.
f. We can perform NLP using the following machine learning algorithms: Naïve
Bayer, support vector machine learning algorithm (SVM), and Deep Learning.

Text classification

7. Hybrid Approach For Text Classification

The third approach to text classification is the Hybrid Approach.

(a) Hybrid approach usage combines a rule-based and machine Based approach.
approach
(b) Hybrid based approach usage of the rule rule-based
based system to create a tag and use
machine learning to train the system and create a rule.
(c) Then the machine
machine-based rule list is compared with the rule-based
based rule list.
(d) If something does no
nott match on the tags, humans improve the list manually.
(e) It is the best method to implement text classification

22
3rd Unit, AI - Part 2 – Natural Language Processing

Examples Of Text Classification:

23
3rd Unit, AI - Part 2 – Natural Language Processing

24
3rd Unit, AI - Part 2 – Natural Language Processing

25
3rd Unit, AI - Part 2 – Natural Language Processing

26
3rd Unit, AI - Part 2 – Natural Language Processing

27
3rd Unit, AI - Part 2 – Natural Language Processing

Q12 ) Explain Information Retrieval in NLP?

1. Information retrieval:

(i) Information retrieval (IR) is a field of study dealing with the representation, storage,
organization of, and access to documents.
(ii) The documents may be books, reports, pictures, videos, web pages or multimedia
files.
(iii) The whole point of an IR system is to provide a user easy access to documents
containing the desired information.
(iv) The best known example of an IR system is Google search engine.

2. What Are Information Retrieval Systems?

(i) An information retrieval system searches a collection of natural language documents


with the goal of retrieving exactly the set of documents that matches a user’s
question. They have their origin in library systems.
(ii) These systems assist users in finding the information they require but it does not
attempt to deduce or generate answers.
(iii) It tells about the existence and location of documents that might consist of the
required information that is given to the user.
(iv) The documents that satisfy the user’s requirement are called relevant documents. If
we have a perfect IR system, then it will retrieve only relevant documents.

3. Types Of Data

(i) Structured data –


Structured data is data whose elements are addressable for effective analysis. It has
been organized into a formatted repository that is typically a database. It concerns all
data which can be stored in database SQL in a table with rows and columns. They
have relational keys and can easily be mapped into pre-designed fields. Today,
those data are most processed in the development and simplest way to manage
information. Example: Relational data.
(ii) Semi-Structured data –
Semi-structured data is information that does not reside in a relational database but
that has some organizational properties that make it easier to analyze. With some
processes, you can store them in the relation database (it could be very hard for
some kind of semi-structured data), but Semi-structured exist to ease
space. Example: XML data.

28
3rd Unit, AI - Part 2 – Natural Language Processing

(iii) Unstructured data –


Unstructured data is a data which is not organized in a predefined manner or does
not have a predefined
ined data model, thus it is not a good fit for a mainstream relational
database. So for Unstructured data, there are alternative platforms for storing and
managing, it is increasingly prevalent in IT systems and is used by organizations in a
variety of business
iness intelligence and analytics applications. Example:: Word, PDF,
Text, Media logs.

4. Differences Between Information Retrieval System And Data Retrieval System

Block Diagram

5. Automatic Abstracting Of Documents

Luhn’s idea and conflation algorithm:

29
3rd Unit, AI - Part 2 – Natural Language Processing

(a) Luhn proposed that the frequency of word occurrence in an article furnishes a
useful measurement of word significance.
(b) It is further proposed that the relative position within a sentence of words having
given values of significance furnish a useful measurement for determining the
significance of sentences.
(c) The significance factor of a sentence will therefore be based on a combination of
these two measurements.'
(d) Luhn's contribution to automatic text analysis assumes that frequency data can
be used to extract words and sentences to represent a document.

Let f be the frequency of occurrence of various word types in a given position of text
and r their rank order, that is, the order of their frequency of occurrence, then a plot
relating f and r yields a curve similar to the hyperbolic curve

1. This is in fact a curve demonstrating Zipf's Law which states that the product of the
frequency of use of words and the rank order is approximately constant.
2. Zipf verified his law on American Newspaper English.
3. Luhn used it as a null hypothesis to enable him to specify two cut-offs, an upper and
a lower (see Figure 2.1.), thus excluding non-significant words.
4. The words exceeding the upper cut-off were considered to be common and those
below the lower cut-off rare, and therefore not contributing significantly to the content
of the article.
5. He thus devised a counting technique for finding significant words.

30
3rd Unit, AI - Part 2 – Natural Language Processing

6. Consistent with this he assumed that the resolving power of significant words, by
which he meant the ability of words to discriminate content, reached a peak at a rank
order position half way between the two cut-offs and from the peak fell off in either
direction reducing to almost zero at the cut-off points.
7. A certain arbitrariness is involved in determining the cut-offs. There is no oracle
which gives their values. They have to be established by trial and error.
8. It is interesting that these ideas are really basic to much of the later work in IR.
9. Luhn himself used them to devise a method of automatic abstracting. He went on to
develop a numerical measure of significance for sentences based on the number of
significant and non-significant words in each portion of the sentence.
10. Sentences were ranked according to their numerical score and the highest ranking
were included in the abstract (extract really).

Conflation Algorithm:

The aim of conflation algorithm is generate from the input text (full text, abstract, or title)
a document representative adequate for use in an automatic retrieval system. Such a
system consists of three steps.

1. Removal of high frequency words


2. Suffix stripping
3. Detecting equivalent words

Single Pass Algorithm For Clustering Of Document Repository:

31
3rd Unit, AI - Part 2 – Natural Language Processing

The following four things must be provided in advance:

(i) No of clusters
(ii) Min/max size of each cluster
(iii) When to group two documents
(iv) Whether Overlap is allowed

32
3rd Unit, AI - Part 2 – Natural Language Processing

Q13 ) Explain Information Extraction in NLP?

1. NLP primarily comprises of Natural Language Understanding (NLU) (human to


machine) and Natural Language Generation (NLG) (machine to human).
2. In recent years there has been a surge in unstructured data in the form of text, videos,
audio and photos.
3. NLU aids in extracting valuable information from text such as social media data,
customer surveys, and complaints.
4. Consider the text snippet below from a customer review of a fictional insurance company
called Rocketz Auto Insurance Company:

The customer service of Rocketz is terrible. I must call the call center multiple times
before I get a decent reply. The call center guys are extremely rude and totally ignorant.
Last month I called with a request to update my correspondence address from Brooklyn
to Manhattan. I spoke with about a dozen representatives – Lucas Hayes, Ethan Gray,
Nora Diaz, Sofia Parker to name a few. Even after writing multiple emails and filling out
numerous forms, the address has still not been updated. Even my agent John is
useless. The policy details he gave me were wrong. The only good thing about the
company is the pricing. The premium is reasonable compared to the other insurance
companies in the United States. There has not been any significant increase in my
premium since 2015.

5. Various methods for text extraction


1. Named Entity Recognition
2. Sentiment Analysis
3. Text Summarization

33
3rd Unit, AI - Part 2 – Natural Language Processing
4. Aspect Mining
5. Topic Modeling

1. Named Entity Recognition


1. The most basic and useful technique in NLP is extracting the entities in the text is
Named Entity Recognition.
2. It highlights the fundamental concepts and references in the text.
3. Named entity recognition (NER) identifies entities such as people, locations,
organizations, dates, etc. from the text.
4. NER output for the sample text will typically be:
 Person: Lucas Hayes, Ethan Gray, Nora Diaz, Sofia Parker, John
 Location: Brooklyn, Manhattan, United States
 Date: Last month, 2015
 Organization: Rocketz
5. NER is generally based on grammar rules and supervised models. However,
there are NER platforms such as open NLP that have pre-trained and built-in NER
models.

2. Sentiment Analysis
1. The most widely used technique in NLP is Sentiment Analysis.
2. Sentiment Analysis is most useful in cases such as customer surveys, reviews and
social media comments where people express their opinions and feedback.
3. The simplest output of sentiment analysis is a 3-point scale:
positive/negative/neutral.
4. In more complex cases the output can be a numeric score that can be bucketed
into as many categories as required.
6. In the case of our text snippet, the customer clearly expresses different sentiments in
various parts of the text. Because of this, the output is not very useful.
Instead, we can find the sentiment of each sentence and separate out the
negative and positive parts of the review.

7. Sentiment Score can also help us pick out the most negative and positive parts of
the review:
8. Most negative comment:
The call center guys are extremely rude and totally ignorant.
Sentiment Score: -1.233288
9. Most positive comment:
the premium is reasonable compared to the other insurance companies in
the United States.
Sentiment Score: 0.2672612
10. Sentiment Analysis can be done using supervised as well as unsupervised
techniques. The most popular supervised model used for sentiment analysis is Naïve
Bayes.
 It requires a training corpus with sentiment labels, upon which a model is
trained which is then used to identify the sentiment.
11. Naive Bayes is not the only tool out there - different machine learning techniques
like Random Forest Or Gradient Boosting can also be used.
12. The unsupervised techniques also known as the lexicon-based methods
require a corpus of words with their associated sentiment and polarity. The sentiment
score of the sentence is calculated using the polarities of the words in the sentence.

34
3rd Unit, AI - Part 2 – Natural Language Processing
3. Text Summarization
1. As the name suggests, there are techniques in NLP that help summarize large
chunks of text.
2. Text summarization is mainly used in cases such as news articles and research
articles.
3. Two broad approaches to text summarization are extraction and abstraction.
i. Extraction methods create a summary by extracting parts from the text.
ii. Abstraction methods create summary by generating fresh text that
conveys the crux of the original text.
4. There are various algorithms that can be used for text summarization like LexRank,
TextRank, and Latent Semantic Analysis.
5. To take the example of LexRank, this algorithm ranks the sentences using
similarity between them.
6. A sentence is ranked higher when it is similar to more sentences, and these
sentences are in turn similar to other sentences.
7. Using LexRank, the sample text is summarized as: I have to call the call center
multiple times before I get a decent reply. The premium is reasonable
compared to the other insurance companies in the United States.

4. Aspect Mining
1. Aspect mining identifies the different aspects in the text.
2. When used in conjunction with sentiment analysis, it extracts complete
information from the text.
3. One of the easiest methods of aspect mining is using Part-of-speech
tagging.
4. When Aspect Mining Along With Sentiment Analysis is used on the
sample text, the output conveys the complete intent of the text:
Aspects & Sentiments:
i. Customer service – negative
ii. Call center – negative
iii. Agent – negative
iv. Pricing/Premium – positive

5. Topic Modeling
1. Topic modeling is one of the more complicated methods to identify natural topics in
the text.
2. A prime advantage of topic modeling is that it is an unsupervised technique.
1. Model training and a labeled training dataset are not required.
3. There are quite a few algorithms for topic modeling:
1. Latent Semantic Analysis (LSA)
2. Probabilistic Latent Semantic Analysis (PLSA)
3. Latent Dirichlet Allocation (LDA)
4. Correlated Topic Model (CTM).
4. One of the most popular methods is Latent Dirichlet Allocation.
5. The premise of LDA is that each text document comprises of several topics and
each topic comprises of several words.
6. The input required by LDA is merely the text documents and the expected number of
topics.
7. Using the sample text and assuming two inherent topics, the topic modeling output
will identify the common words across both topics.

35
3rd Unit, AI - Part 2 – Natural Language Processing
For our example,
The main theme for the first topic 1 includes words like call, center, and
service.
The main theme in topic 2 are words like premium, reasonable and price.
This implies that topic 1 corresponds to customer service and topic two
corresponds to pricing.
The diagram below shows the results in detail.

36

You might also like