1009 NLP PPT
1009 NLP PPT
chapter: 6
Natural Language Processing
(8 marks)
NLP
• Enables machines to understand and process human language
Applications
• Automatic summarization
• Sentiment analysis
• Text classification-> spam filtering
• Virtual assistant
Applications
• Automatic summarization: summarizing the meaning of documents.
• Sentiment analysis: to identify sentiments among several posts.
• Text classification: to assign predefined categories to a document.
• Virtual assistant: Google Assistant, ChatGPT
AI project cycle
• Problem scoping
• Data acquisition-> surveys, observations, database from internet,
interviews
• Data exploration-> text is normalized through various steps and is
lowered to minimum vocabulary
• Modelling
• Evaluation
chatbots
chatbots
• Script- bots
• These bots are pre-programmed with specific responses to certain
phrases or keywords.
• They are good for simple tasks like answering frequently asked
questions, providing basic information, and processing simple
transactions.
• However, they are limited to a set of predetermined responses and
cannot learn from previous interactions with users.
Smart- bot
• These bots use artificial intelligence (AI) to perform their functions.
• These bots are not pre-programmed with responses.
• They can learn from previous interactions with users.
chatbots
Script- bot Smart- bot
Easy to make Flexible and powerful
Work around script which is Work on bigger databases and other
programed in them. resources directly.
Mostly they are free and are easy to Learns with more data
integrate to a messaging platform
No or little language processing use NLP to perform their functions.
skills
Limited functionality Wide functionality
Human language vs computer language
Dr . Smith went
to the hospital .
Removing stopwords, special characters and
numbers
• Tokens which are not necessary are removed from the token list.
• Stopwords are the words which occur very frequently in the corpus
but do not add any value to it.
to the hospital .
• Stemming does not take into account if the stemmed word is meaningful or not.
• It just removes the affixes hence it is faster.
Lemmatization
• Stemming and lemmatization both are alternative processes to each
other.
• Both does removal the affixes
• In lemmatization, the word we get after affix removal(also known as
lemma) is a meaningful one.
• It takes a longer time to execute than stemming.
Lemmatization
Q. Document1: Aman and Anil are stressed.
• Document2: Aman went to a therapist.
• Document3: Anil went to download a health chatbot.
Apply text normalization on the above corpus. Write the output of each
steps in text normalization.
Q. Define
Corpus
Token
Lemma
Syntax
Semantics
Stopwords
Q. Differentiate between stemming and lemmatization with examples.
Bag of Words
• We need to convert the tokens into numbers. Since computer can
understand only numbers.
• For this we would use the Bag of Words algorithm.
Bag of Words algorithm
1. Text normalization
2. Create dictionary
3. Create document vectors
4. Create document vectors for all the documents.
Bag of Words algorithm: Step 1: Text
normalization
• Document 1: Aman and Anil are stressed
• Document 2: Aman went to a therapist
• Document 3: Anil went to download a health chatbot
aman and anil are stressed
aman and anil are stressed went to a therapist download health chatbot
Step3: Create document vectors
• For each document in the corpus, find out how many times the word
from the dictionary has occurred.
• Document 1: Aman and Anil are stressed
aman and anil are stressed went to a therapist download health chatbot
1 1 1 1 1 0 0 0 0 0 0 0
Step 4: Create document vectors for all the documents.
1 1 1 1 1 0 0 0 0 0 0 0
1 0 0 0 0 1 1 1 1 0 0 0
0 0 1 0 0 1 1 1 0 1 1 1
Q. Write the steps of Bag of Words algorithm on the below corpus.
Document 1: Dr. Smith went to the hospital.
Document 2: He arrived on time.
Document 3: The operation started soon after.