AI 3rd Unit - Part 2 - Natural Language Processing
AI 3rd Unit - Part 2 - Natural Language Processing
AI 3rd Unit - Part 2 - Natural Language Processing
Q1 ) What is NLP?
1. NLP helps users to ask questions about any subject and get a direct response within
seconds.
2. NLP offers exact answers to the question means it does not offer unnecessary and
unwanted information.
3. NLP helps computers to communicate with humans in their languages.
4. It is very time efficient.
5. Most of the companies use NLP to improve the efficiency of documentation processes,
accuracy of documentation, and identify the information from large databases.
1
3rd Unit, AI - Part 2 – Natural Language Processing
(i) Natural Language Understanding (NLU) helps the machine to understand and
analyse human language by extracting the metadata from content such as
concepts, entities, keywords, emotion, relations, and semantic roles.
(ii) NLU mainly used in Business applications to understand the customer's problem
in both spoken and written language.
(iii) NLU involves the following tasks -
(i) Natural Language Generation (NLG) acts as a translator that converts the
computerized data into natural language representation.
(ii) It mainly involves
Text planning,
Sentence planning,
and Text Realization.
2
3rd Unit, AI - Part 2 – Natural Language Processing
1. Question Answering
2. Spam Detection
3. Sentiment Analysis
4. Machine Translation
Machine translation is used to translate text or speech from one natural language
to another natural language.
5. Spelling correction
3
3rd Unit, AI - Part 2 – Natural Language Processing
correction.
6. Speech Recognition
7. Chatbot
8. Information extraction
It converts a large set of text into more formal representations such as first-order
logic structures that are easier for the computer programs to manipulate
notations of the natural language processing.
4
3rd Unit, AI - Part 2 – Natural Language Processing
Step 8: Named Entity Recognition (NER)
Step 9: Chunking
1. Sentence Segment is the first step for building the NLP pipeline. It breaks the
paragraph into separate sentences.
Independence Day is one of the important festivals for every Indian citizen. It is
celebrated on the 15th of August each year ever since India got independence from
the British rule. The day celebrates independence in the true sense.
1. Word Tokenizer is used to break the sentence into separate words or tokens.
2. Example:
1. JavaTpoint offers Corporate Training, Summer Training, Online Training, and
Winter Training.
3. Word Tokenizer generates the following result:
1. "JavaTpoint", "offers", "Corporate", "Training", "Summer", "Training", "Online",
"Training", "and", "Winter", "Training", "."
Step3: Stemming
1. Stemming is used to normalize words into its base form or root form.
2. For example, celebrates, celebrated and celebrating, all these words are originated
with a single root word "celebrate."
3. The big problem with stemming is that sometimes it produces the root word which
may not have any meaning.
4. For Example, intelligence, intelligent, and intelligently, all these words are originated
with a single root word "intelligen." In English, the word "intelligen" do not have any
meaning.
Step 4: Lemmatization
5
3rd Unit, AI - Part 2 – Natural Language Processing
1. In English, there are a lot of words that appear very frequently like "is", "and", "the",
and "a".
2. NLP pipelines will flag these words as stop words.
3. Stop words might be filtered out before doing any statistical analysis.
4. Example: He is a good boy.
1. Dependency Parsing is used to find that how all the words in the sentence are related to
each other.
1. POS stands for parts of speech, which includes Noun, verb, adverb, and Adjective.
2. It indicates that how a word functions with its meaning as well as grammatically
within the sentences.
3. A word has one or more parts of speech based on the context in which it is used.
4. Example: "Google" something on the Internet.
1. Named Entity Recognition (NER) is the process of detecting the named entity such
as person name, movie name, organization name, or location.
2. Example: Steve Jobs introduced iPhone at the Macworld Conference in San
Francisco, California.
Step 9: Chunking
1. Chunking is used to collect the individual piece of information and grouping them into
bigger pieces of sentences.
6
3rd Unit, AI - Part 2 – Natural Language Processing
In the real world, Agra goes to the Poonam, does not make any sense, so this sentence
is rejected by the Syntactic analyzer.
3. Semantic Analysis
4. Discourse Integration
Discourse Integration depends upon the sentences that proceeds and also invokes the
meaning of the sentences that follow it.
5. Pragmatic Analysis
7
3rd Unit, AI - Part 2 – Natural Language Processing
Syntactic Ambiguity
Syntactic Ambiguity exists in the presence of two or more possible meanings within the
sentence.
Example:
I saw the girl with the binocular.
In the above example, did I have the binoculars? Or did the girl have the binoculars?
Referential Ambiguity
Referential Ambiguity exists when you are referring to something using the pronoun.
Example: Kiran went to Sunita. She said, "I am hungry."
In the above sentence, you do not know that who is hungry, either Kiran or Sunita.
8
3rd Unit, AI - Part 2 – Natural Language Processing
Let’s understand how language models help in processing these NLP tasks:
Speech Recognition:
Machine Translation:
1. When translating a Chinese phrase “我在吃” into English, the translator can give several
choices as output:
I eat lunch
I am eating
Me am eating
Eating am I
2. Here, the language model tells that the translation “I am eating” sounds natural and will
suggest the same as output.
9
3rd Unit, AI - Part 2 – Natural Language Processing
N-Gram models
(a) Unigram
(b) Bidirectional
(c) Exponential
(d) Continuous Space
1. Statistical models include the development of probabilistic models that are able to
predict the next word in the sequence, given the words that precede it.
2. A number of statistical language models are in use already.
3. Let’s take a look at some of those popular models:
10
3rd Unit, AI - Part 2 – Natural Language Processing
11
3rd Unit, AI - Part 2 – Natural Language Processing
12
3rd Unit, AI - Part 2 – Natural Language Processing
13
3rd Unit, AI - Part 2 – Natural Language Processing
14
3rd Unit, AI - Part 2 – Natural Language Processing
15
3rd Unit, AI - Part 2 – Natural Language Processing
16
3rd Unit, AI - Part 2 – Natural Language Processing
17
3rd Unit, AI - Part 2 – Natural Language Processing
18
3rd Unit, AI - Part 2 – Natural Language Processing
19
3rd Unit, AI - Part 2 – Natural Language Processing
20
3rd Unit, AI - Part 2 – Natural Language Processing
1. Text clarification is the process of categorizing the text into a group of words.
2. By using NLP, text classification can automatically analyze text and then assign a set of
predefined tags or categories based on its context.
3. NLP is used for sentiment analysis, topic detection, and language detection.
4. There are mainly three text classification approaches-
Rule-based System,
Machine System
Hybrid System.
5. Rule-based Approach Text Classification
(i) In the rule-based approach, texts are separated into an organized group using a
set of handicraft linguistic rules.
(ii) Those handicraft linguistic rules contain users to define a list of words that are
characterized by groups.
21
3rd Unit, AI - Part 2 – Natural Language Processing
(iii) For example,, words like Donald Trump and Boris Johnson would be categorized
into politics. People like LeBron James and Ronaldo would be categorized into
sports.
Text classification
(a) Hybrid approach usage combines a rule-based and machine Based approach.
approach
(b) Hybrid based approach usage of the rule rule-based
based system to create a tag and use
machine learning to train the system and create a rule.
(c) Then the machine
machine-based rule list is compared with the rule-based
based rule list.
(d) If something does no
nott match on the tags, humans improve the list manually.
(e) It is the best method to implement text classification
22
3rd Unit, AI - Part 2 – Natural Language Processing
23
3rd Unit, AI - Part 2 – Natural Language Processing
24
3rd Unit, AI - Part 2 – Natural Language Processing
25
3rd Unit, AI - Part 2 – Natural Language Processing
26
3rd Unit, AI - Part 2 – Natural Language Processing
27
3rd Unit, AI - Part 2 – Natural Language Processing
1. Information retrieval:
(i) Information retrieval (IR) is a field of study dealing with the representation, storage,
organization of, and access to documents.
(ii) The documents may be books, reports, pictures, videos, web pages or multimedia
files.
(iii) The whole point of an IR system is to provide a user easy access to documents
containing the desired information.
(iv) The best known example of an IR system is Google search engine.
3. Types Of Data
28
3rd Unit, AI - Part 2 – Natural Language Processing
Block Diagram
29
3rd Unit, AI - Part 2 – Natural Language Processing
(a) Luhn proposed that the frequency of word occurrence in an article furnishes a
useful measurement of word significance.
(b) It is further proposed that the relative position within a sentence of words having
given values of significance furnish a useful measurement for determining the
significance of sentences.
(c) The significance factor of a sentence will therefore be based on a combination of
these two measurements.'
(d) Luhn's contribution to automatic text analysis assumes that frequency data can
be used to extract words and sentences to represent a document.
Let f be the frequency of occurrence of various word types in a given position of text
and r their rank order, that is, the order of their frequency of occurrence, then a plot
relating f and r yields a curve similar to the hyperbolic curve
1. This is in fact a curve demonstrating Zipf's Law which states that the product of the
frequency of use of words and the rank order is approximately constant.
2. Zipf verified his law on American Newspaper English.
3. Luhn used it as a null hypothesis to enable him to specify two cut-offs, an upper and
a lower (see Figure 2.1.), thus excluding non-significant words.
4. The words exceeding the upper cut-off were considered to be common and those
below the lower cut-off rare, and therefore not contributing significantly to the content
of the article.
5. He thus devised a counting technique for finding significant words.
30
3rd Unit, AI - Part 2 – Natural Language Processing
6. Consistent with this he assumed that the resolving power of significant words, by
which he meant the ability of words to discriminate content, reached a peak at a rank
order position half way between the two cut-offs and from the peak fell off in either
direction reducing to almost zero at the cut-off points.
7. A certain arbitrariness is involved in determining the cut-offs. There is no oracle
which gives their values. They have to be established by trial and error.
8. It is interesting that these ideas are really basic to much of the later work in IR.
9. Luhn himself used them to devise a method of automatic abstracting. He went on to
develop a numerical measure of significance for sentences based on the number of
significant and non-significant words in each portion of the sentence.
10. Sentences were ranked according to their numerical score and the highest ranking
were included in the abstract (extract really).
Conflation Algorithm:
The aim of conflation algorithm is generate from the input text (full text, abstract, or title)
a document representative adequate for use in an automatic retrieval system. Such a
system consists of three steps.
31
3rd Unit, AI - Part 2 – Natural Language Processing
(i) No of clusters
(ii) Min/max size of each cluster
(iii) When to group two documents
(iv) Whether Overlap is allowed
32
3rd Unit, AI - Part 2 – Natural Language Processing
The customer service of Rocketz is terrible. I must call the call center multiple times
before I get a decent reply. The call center guys are extremely rude and totally ignorant.
Last month I called with a request to update my correspondence address from Brooklyn
to Manhattan. I spoke with about a dozen representatives – Lucas Hayes, Ethan Gray,
Nora Diaz, Sofia Parker to name a few. Even after writing multiple emails and filling out
numerous forms, the address has still not been updated. Even my agent John is
useless. The policy details he gave me were wrong. The only good thing about the
company is the pricing. The premium is reasonable compared to the other insurance
companies in the United States. There has not been any significant increase in my
premium since 2015.
33
3rd Unit, AI - Part 2 – Natural Language Processing
4. Aspect Mining
5. Topic Modeling
2. Sentiment Analysis
1. The most widely used technique in NLP is Sentiment Analysis.
2. Sentiment Analysis is most useful in cases such as customer surveys, reviews and
social media comments where people express their opinions and feedback.
3. The simplest output of sentiment analysis is a 3-point scale:
positive/negative/neutral.
4. In more complex cases the output can be a numeric score that can be bucketed
into as many categories as required.
6. In the case of our text snippet, the customer clearly expresses different sentiments in
various parts of the text. Because of this, the output is not very useful.
Instead, we can find the sentiment of each sentence and separate out the
negative and positive parts of the review.
7. Sentiment Score can also help us pick out the most negative and positive parts of
the review:
8. Most negative comment:
The call center guys are extremely rude and totally ignorant.
Sentiment Score: -1.233288
9. Most positive comment:
the premium is reasonable compared to the other insurance companies in
the United States.
Sentiment Score: 0.2672612
10. Sentiment Analysis can be done using supervised as well as unsupervised
techniques. The most popular supervised model used for sentiment analysis is Naïve
Bayes.
It requires a training corpus with sentiment labels, upon which a model is
trained which is then used to identify the sentiment.
11. Naive Bayes is not the only tool out there - different machine learning techniques
like Random Forest Or Gradient Boosting can also be used.
12. The unsupervised techniques also known as the lexicon-based methods
require a corpus of words with their associated sentiment and polarity. The sentiment
score of the sentence is calculated using the polarities of the words in the sentence.
34
3rd Unit, AI - Part 2 – Natural Language Processing
3. Text Summarization
1. As the name suggests, there are techniques in NLP that help summarize large
chunks of text.
2. Text summarization is mainly used in cases such as news articles and research
articles.
3. Two broad approaches to text summarization are extraction and abstraction.
i. Extraction methods create a summary by extracting parts from the text.
ii. Abstraction methods create summary by generating fresh text that
conveys the crux of the original text.
4. There are various algorithms that can be used for text summarization like LexRank,
TextRank, and Latent Semantic Analysis.
5. To take the example of LexRank, this algorithm ranks the sentences using
similarity between them.
6. A sentence is ranked higher when it is similar to more sentences, and these
sentences are in turn similar to other sentences.
7. Using LexRank, the sample text is summarized as: I have to call the call center
multiple times before I get a decent reply. The premium is reasonable
compared to the other insurance companies in the United States.
4. Aspect Mining
1. Aspect mining identifies the different aspects in the text.
2. When used in conjunction with sentiment analysis, it extracts complete
information from the text.
3. One of the easiest methods of aspect mining is using Part-of-speech
tagging.
4. When Aspect Mining Along With Sentiment Analysis is used on the
sample text, the output conveys the complete intent of the text:
Aspects & Sentiments:
i. Customer service – negative
ii. Call center – negative
iii. Agent – negative
iv. Pricing/Premium – positive
5. Topic Modeling
1. Topic modeling is one of the more complicated methods to identify natural topics in
the text.
2. A prime advantage of topic modeling is that it is an unsupervised technique.
1. Model training and a labeled training dataset are not required.
3. There are quite a few algorithms for topic modeling:
1. Latent Semantic Analysis (LSA)
2. Probabilistic Latent Semantic Analysis (PLSA)
3. Latent Dirichlet Allocation (LDA)
4. Correlated Topic Model (CTM).
4. One of the most popular methods is Latent Dirichlet Allocation.
5. The premise of LDA is that each text document comprises of several topics and
each topic comprises of several words.
6. The input required by LDA is merely the text documents and the expected number of
topics.
7. Using the sample text and assuming two inherent topics, the topic modeling output
will identify the common words across both topics.
35
3rd Unit, AI - Part 2 – Natural Language Processing
For our example,
The main theme for the first topic 1 includes words like call, center, and
service.
The main theme in topic 2 are words like premium, reasonable and price.
This implies that topic 1 corresponds to customer service and topic two
corresponds to pricing.
The diagram below shows the results in detail.
36