0% found this document useful (0 votes)

32 views37 pages

Unit 6 - AI (NLP)

Uploaded by

divyanagarajj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views37 pages

Unit 6 - AI (NLP)

Uploaded by

divyanagarajj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 37

Unit 6

N L P
atural anguage rocessing
What is NLP?
• AI uses interaction between machines & humans through natural
languages used by humans- This Enables communication b/w machine &
human.
• It deals with programming the computers to process & analyse large
amount of natural language data.
• It accepts anything spoken or typed, converts them to digital signals
and generate useful response.
• It helps machines to understand, analyse, manipulate & interpret
human language.
• It’s used in all industries where human interactions are involved such as
enquiry handling, education, counselling, customer support system,
crime detection & consultation.
Applications of NLP:
• Chatbots
• Virtual Assistants
• Social Media Monitoring
• Speech Recognition
• Machine Translation
• Sentiment Analysis
• Text Classification
• Education & Training
• Text Extraction
• Health Care
• Automatic Summarization
Sentiment Analysis
Chatbots :
• Script bot • Smart bot
• Script bots are easy to make and • Smart bots are flexible ,
less interactive powerful and are more
• Script bot have limited interactive
functionality • Smart bot have wide
• These chatbots are easy to functionality
integrate to a messaging platform. • Smart bots learn with more data
by itself
• Eg:Customer care section(These
• Eg:Google Assistant , Cortana ,
bots answer some basic queries
and connect them to human once Siri etc.,(They can manage to
they are unable to handle) handle customer’s query)
Human Language Vs Computer Language :

• How can humans “talk to”(instruct)

computers ?
• Answer ?????????????
Multiple Meanings of Words
• To understand let us have an example of the following three
sentences:
1.His face turned red after he found out that he had taken the
wrong bag
1. What does this mean? Is he feeling ashamed because he took
another person’s bag instead of his? Is he feeling angry because he
did not manage to steal the bag that he has been targeting?
2.His face turns red after consuming the medicine
1. Is he having an allergic reaction? Or is he not able to bear the
taste of that medicine?
Perfect Syntax but no Meaning
• Sometimes, when the statement is perfectly correct syntax but there
is no meaning to it.
• Example: Chickens feed extravagantly while the moon drinks tea
• This statement is correct grammatically but does this make any sense?
In Human language, a perfect
balance of syntax and semantics is important for better
understanding.
Data Processing
• As we have already gone through some of the complications in human
languages above, now it is time to see how Natural Language Processing
makes it possible for machines to understand and speak Natural
Languages just like humans.
• Since we all know that the language of computers is Numerical, the very
first step that comes to our mind is to convert our language to numbers.
This conversion takes a few steps to happen. The first step to it is Text
Normalisation.
• Text Normalisation helps in cleaning up the textual data in such a way
that it comes down to a level where its complexity is lower than the
actual data.
Text Normalisation
1. Sentence Segmentation
2. Tokenisation
3. Removal of Stopwords
4. Converting text to a common case
5. Stemming
6. Lemmatization
Text Normalization
• Steps to normalize the text to a lower level.
• Process of downsizing & simplifying the text to make it
suitable for machine processing.
• Remove unnecessary pieces from text & break the text into
simpler tokens converting them to numerical form.
• Processing is done on text collected from multiple documents
& sources.
• This Textual data collected from multiple sources is known
as corpus.(A corpus can be defined as a collection of text
documents. It can be thought of as just a bunch of text files in
a directory.)
https://www.youtube.com/watch?v=2HVe6rYID2I
1. Sentence Segmentation

• The whole corpus is broken down into simple sentences.

• Split paragraph into small sentences, wherever there is a
punctuation(each sentences is treated as separate data to be
processed).
Example:
• Before Sentence Segmentation
• “You want to see the dreams with close eyes and achieve them?
They’ll remain dreams, look for AIMs and your eyes have to stay open
for a change to be seen.”
• After Sentence Segmentation
1.You want to see the dreams with close eyes and achieve them?
2.They’ll remain dreams, look for AIMs and your eyes have to stay open
for a change to be seen.
2. Tokenisation

• After segmenting the sentences, each sentence is then further

divided into individual text pieces called tokens(also called
word tokenization/ word segmentation).
• A “Token” can be a word, number or special character in a
sentence.
• In most languages words are separated by spaces. But
sometimes this may not be very helpful.
Text
The cat sat on the mat.
Tokens
‘The’, ‘cat’, ‘sat’, ‘on’, ‘the’, ‘mat’, ‘.’
Under Tokenisation, every word, number, and special character is
considered separately and each of them is now a separate token.
You
want
to
see
the
drea
ms A Ach
with T Se Th Dream Clos Eye The
clos You want o e e s With
e s
n iev
m
?
e d e
eyes
and
achi
eve
them
?
3. Removal of Stop words, Special characters & numbers

• Stop words- grammatical words that frequently occur in corpus

but do not any value or essence to the information. Ex: a, an,
and, are, as, for, it , is, into, in, if, on, such, the, there, to.
• Stop words are removed to make it easier for computer to focus
on important, meaningful words.

• Removal of special characters and/or numbers depends on the type of corpus

that we are working on and whether we should keep them in it or not.
• For example: if you are working on a document containing email IDs, then you
might not want to remove the special characters and numbers whereas in
some other textual data if these characters do not make sense, then you can
remove them along with the stopwords.
Stop Words :
• Stopwords: Stopwords are the words that occur very
frequently in the corpus but do not add any value to it.
• Examples: a, an, and, are, as, for, it, is, into, in, if, on,
or, such, the, there, to.
• Example
1.You want to see the dreams with close eyes and achieve them?
1. the removed words would be
2. to, the, and, ?
2.The outcome would be:
1. You want see dreams with close eyes achieve them
4. Converting text to a common case
• As the name suggests, we convert the whole text into a
similar case, preferably lower case. This ensures that
the case sensitivity of the machine does not consider
the same words as different just because of different
cases.
5. Stemming & Lemmatization
Stemming

• Definition: Stemming is a technique used to extract the base form

of the words(root word called Lemma) by removing affixes from
them. It is just like cutting down the branches of a tree to its
stems.
• This is done because a single word may have different forms.
Example – eat, eats, eating etc.
• The stemmed words (words which we get after removing the
affixes) might not be meaningful.
• NOTE: Some may end up having no meaning like for words like
happiness, after stemming results in happi
Example ::

Words Affixes Stem

healing ing heal
dreams s dream
studies es studi
Lemmatization
• Definition: In lemmatization, affix are removed and
words are converted to their base form (also known as
lemma) to keep them meaningful is a meaningful.
• It takes a longer time to execute than stemming.
• Example :

Words Affixes lemma

healing ing heal

dreams s dream

studies es study

studying ing study

Difference between stemming and lemmatization
Stemming lemmatization
1.The stemmed words might 1.The lemma word is a
not be meaningful. meaningful one.
2.Caring ➔ Car 2.Caring ➔ Care
Feature Extraction from Text

• After text normalization, to start processing, the features of the text

has to be extracted.
• For this it has to be converted to suitable numeric form with
algorithms such as Bag of Words and Term Frequency
Bag of word Algorithm (BoW) – Algorithm to transform tokens into a set of features

• In a bag of words, we get the occurrences of each word and

construct the vocabulary for the corpus.
• Bag of Words just creates a set of vectors containing the count of
word occurrences in the document (reviews).
• The algorithm returns to us the unique words out of the corpus
and their occurrences in it.
• BoW algorithm is not concerned with sequence of words
The bag of words gives us two things:
• Vocabulary-Unique words identified in the corpus.
• Frequency-Number of occurrences of each word.

https://www.youtube.com/watch?v=c0Tk8KEHBFc
Steps of the bag of words algorithm

1. Text Normalisation: Removing all punctuations, unnecessary

symbols & converting entire text into lowercase.
2. Create Vocabulary or Dictionary: Making a list of all the
unique words occurring in the corpus.
3. Text Vectorisation: Document vectors to be created using
separate column for each word, whereas each row corresponds
to its review. For each document in the corpus, find out how
many times the word has occurred.
Step 1: Collecting data and pre-processing it.

• Raw Data • Processed Data

• Document 1: Aman and Anil are • Document 1: [aman, and, anil,
stressed are, stressed ]
• Document 2: Aman went to a • Document 2: [aman, went, to, a,
therapist therapist]
• Document 3: Anil went to • Document 3: [anil, went, to,
download a health chatbot download, a, health, chatbot]

Note: No tokens have been removed in the stop words removal step. It is
because we have very little data and since the frequency of all the words is
almost the same, no word can be said to have lesser value than the other.
Step 2: Create Dictionary
Dictionary in NLP means a list of all the unique words
occurring in the corpus.
If some words are repeated in different documents, they are
all written just once while creating the dictionary.

Dictionary:

aman and anil are stressed went

downloa
health chatbot therapist a to
d
Step 3: Create a document vector
How to make a document vector table?
• In the document, vector vocabulary is written in the top row. Now, for
each word in the document, if it matches the vocabulary, put a 1 under it.
If the same word appears again, increment the previous value by 1. And if
the word does not occur in that document, put a 0 under it.

therapis downloa chatb

aman and anil are stressed went to a health
t d ot

1 1 1 1 1 0 0 0 0 0 0 0
step 4: Create a document vector
table for all documents

ama stresse therapi downloa healt

and anil are went to a chatbot
n d st d h

1 1 1 1 1 0 0 0 0 0 0 0
1 0 0 0 0 1 1 1 1 0 0 0
0 0 1 0 0 1 1 1 0 1 1 1
TFIDF

• TFIDF stands for Term Frequency & Inverse

Document Frequency.
Term Frequency
dow
ama stres ther healt chat
and anil are went to a nloa
n sed apist h bot
d
1 1 1 1 1 0 0 0 0 0 0 0
1 0 0 0 0 1 1 1 1 0 0 0
0 0 1 0 0 1 1 1 0 1 1 1
Inverse Document Frequency

DF (Document Frequency) dow

ama stres ther healt chat
and anil are went to a nloa
n sed apist h bot
d
2 1 2 1 1 2 2 2 1 1 1 1

IDF (Inverse Document Frequency)

• Definition of Inverse Document Frequency (IDF): In the
case of inverse document frequency, we need to put
the document frequency in the denominator while
the total number of documents is the numerator.
stress therapi downl chatbo
aman and anil are went to a health
ed st oad t
3/2 3/1 3/2 3/1 3/1 3/2 3/2 3/2 3/1 3/1 3/1 3/1
Formula of TFIDF
• The formula of TFIDF for any word W becomes:
• TFIDF(W) = TF(W) * log( IDF(W) )

stresse therapi downlo

aman and anil are went to a health chatbot
d st ad
1*log(3/ 1*log(3 1*log(3/ 1*log(3 1*log(3 0*log(3/ 0*log(3/ 0*log(3/ 0*log(3 0*log(3 0*log(3 0*log(3
2) ) 2) ) ) 2) 2) 2) ) ) ) )
1*log(3/ 0*log(3 0*log(3/ 0*log(3 0*log(3 1*log(3/ 1*log(3/ 1*log(3/ 1*log(3 0*log(3 0*log(3 0*log(3
2) ) 2) ) ) 2) 2) 2) ) ) ) )
0*log(3/ 0*log(3 1*log(3/ 0*log(3 0*log(3 1*log(3/ 1*log(3/ 1*log(3/ 0*log(3 1*log(3 1*log(3 1*log(3
2) ) 2) ) ) 2) 2) 2) ) ) ) )
stress therapi downl chatbo
aman and anil are went to a health
ed st oad t

0.176 0.477 0.176 0.477 0.477 0 0 0 0 0 0 0

0.176 0 0 0 0 0.176 0.176 0.176 0.477 0 0 0

0 0 0.176 0 0 0.176 0.176 0.176 0 0.477 0.477 0.477

Sample Paper Questions - NLP (Part 2)
No ratings yet
Sample Paper Questions - NLP (Part 2)
7 pages
Natural Language Processing Notes Class 10 AI
100% (1)
Natural Language Processing Notes Class 10 AI
20 pages
Natural Language Processing Tools and Approaches
No ratings yet
Natural Language Processing Tools and Approaches
106 pages
Natural Language Processing
No ratings yet
Natural Language Processing
6 pages
Natural Language Processing: Learning Is Not A Course, Its A Path From Passion To Profession
No ratings yet
Natural Language Processing: Learning Is Not A Course, Its A Path From Passion To Profession
19 pages
Chapter-1 Introduction To NLP
No ratings yet
Chapter-1 Introduction To NLP
12 pages
UNIT 1 - Part1
No ratings yet
UNIT 1 - Part1
121 pages
NLP m2
No ratings yet
NLP m2
71 pages
Ai NLP
No ratings yet
Ai NLP
9 pages
Unit-6 Natural Language Processing
No ratings yet
Unit-6 Natural Language Processing
7 pages
Natural Language Processing Dossier 20231110 141736 0000
No ratings yet
Natural Language Processing Dossier 20231110 141736 0000
114 pages
English Grade 7 Q2
75% (4)
English Grade 7 Q2
152 pages
Natural Language Processing
No ratings yet
Natural Language Processing
10 pages
AIUnit 6 10
No ratings yet
AIUnit 6 10
8 pages
Natural Language Processing Revision Notes
No ratings yet
Natural Language Processing Revision Notes
4 pages
Natural Language Processing Notes Class 10 AI
No ratings yet
Natural Language Processing Notes Class 10 AI
25 pages
Part B Notes
No ratings yet
Part B Notes
62 pages
Ai Part B ch12
No ratings yet
Ai Part B ch12
16 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
Introduction To
No ratings yet
Introduction To
16 pages
Intro To NLP: Natural Language Toolkit
No ratings yet
Intro To NLP: Natural Language Toolkit
11 pages
AIML-HC Mod 04
No ratings yet
AIML-HC Mod 04
71 pages
2. Basics of Text Processing
No ratings yet
2. Basics of Text Processing
28 pages
Personal BEST A2 - TB UNIT11
No ratings yet
Personal BEST A2 - TB UNIT11
16 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
Ir Manual
No ratings yet
Ir Manual
53 pages
1009 NLP PPT
No ratings yet
1009 NLP PPT
31 pages
2-Text Operations - New
No ratings yet
2-Text Operations - New
39 pages
517-C-30070-Assignment - Chapter NLP
No ratings yet
517-C-30070-Assignment - Chapter NLP
9 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
Text Mining
No ratings yet
Text Mining
34 pages
NLP
No ratings yet
NLP
40 pages
NLP Ai X
No ratings yet
NLP Ai X
6 pages
Introduction To NLP Basics of Text Processing, Spelling Correction-Edit Distance, Weighted Edit Distance
No ratings yet
Introduction To NLP Basics of Text Processing, Spelling Correction-Edit Distance, Weighted Edit Distance
35 pages
NLB Final Lab Manual
No ratings yet
NLB Final Lab Manual
23 pages
Introduction to NLP
No ratings yet
Introduction to NLP
15 pages
Unit 6 (NLP)
No ratings yet
Unit 6 (NLP)
8 pages
PDF NLP
No ratings yet
PDF NLP
7 pages
Chapter 2 Part 1 & 2
No ratings yet
Chapter 2 Part 1 & 2
58 pages
18 Text Mining - Text Preprocessing
No ratings yet
18 Text Mining - Text Preprocessing
40 pages
Natural Language Processing - Compressed
No ratings yet
Natural Language Processing - Compressed
17 pages
1st BAC DIAGNOSTIC TEST - 2019-20
100% (1)
1st BAC DIAGNOSTIC TEST - 2019-20
3 pages
AI6122 Topic 1.2 - WordLevel
No ratings yet
AI6122 Topic 1.2 - WordLevel
63 pages
NLP Class10 PDF
No ratings yet
NLP Class10 PDF
9 pages
Unit Ii NLP Notes Final
No ratings yet
Unit Ii NLP Notes Final
6 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
5 pages
TSP Unit1 Own
No ratings yet
TSP Unit1 Own
20 pages
TSP Unit1 Own
No ratings yet
TSP Unit1 Own
13 pages
Introduction To Natural Language Processing (Nlp) : Ths. Đặng Nhân Cách Email: [email protected]
No ratings yet
Introduction To Natural Language Processing (Nlp) : Ths. Đặng Nhân Cách Email: [email protected]
25 pages
Part-of-Speech (POS) Tagging
No ratings yet
Part-of-Speech (POS) Tagging
4 pages
TextMining
No ratings yet
TextMining
43 pages
Introduction
No ratings yet
Introduction
23 pages
Text Analytics and Natural Language Processing - KAI073
No ratings yet
Text Analytics and Natural Language Processing - KAI073
24 pages
Statistical NLP
No ratings yet
Statistical NLP
45 pages
VO - MCA - SEM 4 - Text Mining - U2
No ratings yet
VO - MCA - SEM 4 - Text Mining - U2
15 pages
NLP Notes
No ratings yet
NLP Notes
10 pages
التطبيقات الآلية لمعالجة الصوت خطوة واقعية لحل إشكالية التواصل في ظل اللسانيات الحاسوبية
No ratings yet
التطبيقات الآلية لمعالجة الصوت خطوة واقعية لحل إشكالية التواصل في ظل اللسانيات الحاسوبية
19 pages
NLP - Notes
No ratings yet
NLP - Notes
3 pages
Unit 6 - NLP Notes
No ratings yet
Unit 6 - NLP Notes
7 pages
Unit 3. Modal Verbs
No ratings yet
Unit 3. Modal Verbs
5 pages
Psycholinguistics Language Processing
No ratings yet
Psycholinguistics Language Processing
22 pages
Kids Zone 3
No ratings yet
Kids Zone 3
21 pages
Grammar Translation Method - Meaning, Merits, Demerits & Techniques
No ratings yet
Grammar Translation Method - Meaning, Merits, Demerits & Techniques
4 pages
Mimamsa Ekavaakyataa PDF
No ratings yet
Mimamsa Ekavaakyataa PDF
66 pages
TRANSLATION CRITICISM TQA Model by J. House
No ratings yet
TRANSLATION CRITICISM TQA Model by J. House
22 pages
A2 Key - Classroom Practice - Key - 2021
No ratings yet
A2 Key - Classroom Practice - Key - 2021
5 pages
Parts of Speech: Grade (9,10,11)
No ratings yet
Parts of Speech: Grade (9,10,11)
37 pages
GRAMMAR LESSON 4 - Adjective and Adverbs
No ratings yet
GRAMMAR LESSON 4 - Adjective and Adverbs
8 pages
GRADE9
No ratings yet
GRADE9
25 pages
Introduction To English Morphology
No ratings yet
Introduction To English Morphology
21 pages
Grammar Solved
No ratings yet
Grammar Solved
13 pages
Inglés 01 Mg. Nancy León Pereyra: I'm David Clark
No ratings yet
Inglés 01 Mg. Nancy León Pereyra: I'm David Clark
4 pages
6 Minute English Which Cooking Oil Is The Best Worksheet
No ratings yet
6 Minute English Which Cooking Oil Is The Best Worksheet
5 pages
Đáp án đề thi HSG TA 8 2022
No ratings yet
Đáp án đề thi HSG TA 8 2022
3 pages
Malyalam
No ratings yet
Malyalam
19 pages
Tenthclass Examspecial English TG
No ratings yet
Tenthclass Examspecial English TG
12 pages
Karen Book
No ratings yet
Karen Book
26 pages
Bad Language Introduction
No ratings yet
Bad Language Introduction
24 pages
The Status and Role of Regional Languages in Higher Education in Pakistan
No ratings yet
The Status and Role of Regional Languages in Higher Education in Pakistan
22 pages
Chap 1
No ratings yet
Chap 1
1 page
Outline Draft
No ratings yet
Outline Draft
3 pages
Grade 8 SA2 Portions 2022
No ratings yet
Grade 8 SA2 Portions 2022
4 pages
5-Quantification Syntax
No ratings yet
5-Quantification Syntax
5 pages
Writing: A Website Review - Exercises: Preparation
No ratings yet
Writing: A Website Review - Exercises: Preparation
2 pages
Latihan Membuat Kisi-Kisi Soal Teks PISA 1 (Leksito Rini - SMAN 1 Brebes)
No ratings yet
Latihan Membuat Kisi-Kisi Soal Teks PISA 1 (Leksito Rini - SMAN 1 Brebes)
2 pages

Unit 6 - AI (NLP)

Uploaded by

Unit 6 - AI (NLP)

Uploaded by

Unit 6

• How can humans “talk to”(instruct)

• The whole corpus is broken down into simple sentences.

• After segmenting the sentences, each sentence is then further

• Stop words- grammatical words that frequently occur in corpus

• Removal of special characters and/or numbers depends on the type of corpus

• Definition: Stemming is a technique used to extract the base form

Words Affixes Stem

Words Affixes lemma

healing ing heal

studying ing study

• After text normalization, to start processing, the features of the text

• In a bag of words, we get the occurrences of each word and

1. Text Normalisation: Removing all punctuations, unnecessary

• Raw Data • Processed Data

aman and anil are stressed went

therapis downloa chatb

ama stresse therapi downloa healt

• TFIDF stands for Term Frequency & Inverse

DF (Document Frequency) dow

IDF (Inverse Document Frequency)

stresse therapi downlo

0.176 0.477 0.176 0.477 0.477 0 0 0 0 0 0 0

0.176 0 0 0 0 0.176 0.176 0.176 0.477 0 0 0

0 0 0.176 0 0 0.176 0.176 0.176 0 0.477 0.477 0.477

You might also like