0% found this document useful (0 votes)
12 views31 pages

1009 NLP PPT

The document discusses Natural Language Processing (NLP) and its applications, including automatic summarization, sentiment analysis, text classification, and virtual assistants. It outlines the AI project cycle, chatbot types (script-bots and smart-bots), and the differences between human and computer languages, emphasizing text normalization processes. Additionally, it explains the Bag of Words algorithm for converting text into numerical data for machine understanding.

Uploaded by

aripandey2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views31 pages

1009 NLP PPT

The document discusses Natural Language Processing (NLP) and its applications, including automatic summarization, sentiment analysis, text classification, and virtual assistants. It outlines the AI project cycle, chatbot types (script-bots and smart-bots), and the differences between human and computer languages, emphasizing text normalization processes. Additionally, it explains the Bag of Words algorithm for converting text into numerical data for machine understanding.

Uploaded by

aripandey2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Subject specific skills

chapter: 6
Natural Language Processing
(8 marks)
NLP
• Enables machines to understand and process human language
Applications
• Automatic summarization
• Sentiment analysis
• Text classification-> spam filtering
• Virtual assistant
Applications
• Automatic summarization: summarizing the meaning of documents.
• Sentiment analysis: to identify sentiments among several posts.
• Text classification: to assign predefined categories to a document.
• Virtual assistant: Google Assistant, ChatGPT
AI project cycle
• Problem scoping
• Data acquisition-> surveys, observations, database from internet,
interviews
• Data exploration-> text is normalized through various steps and is
lowered to minimum vocabulary
• Modelling
• Evaluation
chatbots
chatbots
• Script- bots
• These bots are pre-programmed with specific responses to certain
phrases or keywords.
• They are good for simple tasks like answering frequently asked
questions, providing basic information, and processing simple
transactions.
• However, they are limited to a set of predetermined responses and
cannot learn from previous interactions with users.
Smart- bot
• These bots use artificial intelligence (AI) to perform their functions.
• These bots are not pre-programmed with responses.
• They can learn from previous interactions with users.
chatbots
Script- bot Smart- bot
Easy to make Flexible and powerful
Work around script which is Work on bigger databases and other
programed in them. resources directly.
Mostly they are free and are easy to Learns with more data
integrate to a messaging platform
No or little language processing use NLP to perform their functions.
skills
Limited functionality Wide functionality
Human language vs computer language

The computer understands the language of


numbers.
Human language
• Nouns, verbs, adverbs, adjectives
• There are rules to provide structure to a language.
• Syntax: the grammatical structure of a sentence.
• human communication is complex.
• His face turned red after he found out that he took the wrong bag.
• The red car zoomed past his nose.
• His face turns red after consuming the medicine.
• In natural language a word can have multiple meanings according to
the context of it.
• Semantics is the meaning of words, phrases, and sentences in human
language.
• Syntax: the grammatical structure of a sentence.
• Semantics is the meaning of words, phrases, and sentences in human
language.
• Different syntax, same semantics: 2+3 = 3+2
Data processing/Text normalization
• Text normalization helps to lower the complexity of
textual data.
• The whole textual data from all the documents
altogether is known as corpus.
• When developing models like ChatGPT, large
amounts of text data (referred to as a corpus) are
used as input during the training process.
• The corpus can include books, articles, websites,
conversations, and other text sources.
• This helps the AI learn patterns of language,
grammar, and knowledge.
Steps of Text Normalisation
i. Sentence Segmentation
ii. Tokenisation
iii. Removing stopwords, special characters and
numbers
iv. Converting text to a common case
v. Stemming or Lemmatization
Sentence Segmentation
• The whole corpus is divided into sentences.
• Example: Corpus: Dr. Smith went to the hospital. He arrived on time.
The operation started soon after.
• After sentence segmentation:
Sentence 1: Dr. Smith went to the hospital.
Sentence 2: He arrived on time.
Sentence 3: The operation started soon after.
Tokenisation
• Tokens : any word or number or special character occurring in a
sentence.
• Example: Sentence: Dr. Smith went to the hospital.
• When divided into tokens:

Dr . Smith went

to the hospital .
Removing stopwords, special characters and
numbers
• Tokens which are not necessary are removed from the token list.
• Stopwords are the words which occur very frequently in the corpus
but do not add any value to it.

Note: if you are working on a document containing email IDs, then do


not remove the special characters and numbers from that document.
Removing stopwords, special characters and
numbers
Dr . Smith went

to the hospital .

After removing stopwords, special characters and


numbers

Dr Smith went hospital


Converting text to a common case
• This ensures that the case-sensitivity of the machine does not
consider same words as different just because of different cases.
Converting text to a common case

Dr Smith went hospital

• After converting text to lower case

dr smith went hospital


Stemming
• Stemming is the process in which the affixes of words are removed
and the words are converted to their base form.

• Stemming does not take into account if the stemmed word is meaningful or not.
• It just removes the affixes hence it is faster.
Lemmatization
• Stemming and lemmatization both are alternative processes to each
other.
• Both does removal the affixes
• In lemmatization, the word we get after affix removal(also known as
lemma) is a meaningful one.
• It takes a longer time to execute than stemming.
Lemmatization
Q. Document1: Aman and Anil are stressed.
• Document2: Aman went to a therapist.
• Document3: Anil went to download a health chatbot.
Apply text normalization on the above corpus. Write the output of each
steps in text normalization.
Q. Define
Corpus
Token
Lemma
Syntax
Semantics
Stopwords
Q. Differentiate between stemming and lemmatization with examples.
Bag of Words
• We need to convert the tokens into numbers. Since computer can
understand only numbers.
• For this we would use the Bag of Words algorithm.
Bag of Words algorithm
1. Text normalization
2. Create dictionary
3. Create document vectors
4. Create document vectors for all the documents.
Bag of Words algorithm: Step 1: Text
normalization
• Document 1: Aman and Anil are stressed
• Document 2: Aman went to a therapist
• Document 3: Anil went to download a health chatbot
aman and anil are stressed

aman went to a therapist

anil went to download a health chatbot


Note: no tokens have been removed in the stopwords removal step. It is
because we have very little data.
Step 2: Create dictionary
• Make a list of all unique words occurring in the corpus.

aman and anil are stressed

aman went to a therapist

anil went to download a health chatbot

aman and anil are stressed went to a therapist download health chatbot
Step3: Create document vectors
• For each document in the corpus, find out how many times the word
from the dictionary has occurred.
• Document 1: Aman and Anil are stressed

aman and anil are stressed went to a therapist download health chatbot

1 1 1 1 1 0 0 0 0 0 0 0
Step 4: Create document vectors for all the documents.

• Document 1: Aman and Anil are stressed


• Document 2: Aman went to a therapist
• Document 3: Anil went to download a health chatbot
aman and anil are stressed went to a therapist download health chatbot

1 1 1 1 1 0 0 0 0 0 0 0
1 0 0 0 0 1 1 1 1 0 0 0
0 0 1 0 0 1 1 1 0 1 1 1
Q. Write the steps of Bag of Words algorithm on the below corpus.
Document 1: Dr. Smith went to the hospital.
Document 2: He arrived on time.
Document 3: The operation started soon after.

You might also like