0% found this document useful (0 votes)
20 views85 pages

Applications of NLP

The document outlines various applications of Natural Language Processing (NLP), including machine translation, chatbots, text classification, sentiment analysis, and text summarization. It also discusses information retrieval techniques, the Vector Space Model, Bag of Words, N-grams, and TF-IDF for document representation and similarity measurement. Additionally, it covers information extraction, question answering systems, and text classification methodologies, highlighting their significance in processing and understanding unstructured text data.

Uploaded by

priti.malkhede
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views85 pages

Applications of NLP

The document outlines various applications of Natural Language Processing (NLP), including machine translation, chatbots, text classification, sentiment analysis, and text summarization. It also discusses information retrieval techniques, the Vector Space Model, Bag of Words, N-grams, and TF-IDF for document representation and similarity measurement. Additionally, it covers information extraction, question answering systems, and text classification methodologies, highlighting their significance in processing and understanding unstructured text data.

Uploaded by

priti.malkhede
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

Applications of NLP

Tushar B. Kute,
http://tusharkute.com
NLP Applications
NLP Applications

• Text Communication and Interaction:


• Machine Translation: Automatically translating text
between languages, breaking down language barriers.
• Chatbots and Virtual Assistants: Developing chatbots that
can understand and respond to user queries in a
conversational way, or virtual assistants that can perform
tasks based on spoken instructions.
• Text Classification and Spam Filtering: Categorizing text
data like emails or social media posts (e.g., spam detection,
sentiment analysis).
• Text Summarization: Automatically generating concise
summaries of lengthy documents or articles.
NLP Applications

• Language Understanding and Analysis:


– Sentiment Analysis: Extracting sentiment or opinion
(positive, negative, neutral) from text data like reviews,
social media posts, or surveys.
– Named Entity Recognition (NER): Identifying and
classifying named entities in text, such as people,
organizations, locations, dates, monetary values, etc.
– Topic Modeling: Discovering underlying thematic
structures in large collections of documents.
– Part-of-Speech (POS) Tagging: Assigning grammatical
labels (e.g., noun, verb, adjective) to each word in a
sentence for deeper syntactic analysis.
NLP Applications

• Content Creation and Text Generation:


– Machine Writing: Generating different creative text
formats like poems, code, scripts, emails, or
marketing copy, based on specific styles or
instructions.
– Text Paraphrasing and Re-writing: Rewriting
sentences or passages while preserving the meaning
but using different wording or sentence structures.
– Automatic Text Summarization: Creating shorter
versions of lengthy documents or articles that
capture the main points.
NLP Applications

• Additional Applications:
• Speech Recognition: Converting spoken language into text
format, enabling voice-enabled applications.
• Text-to-Speech (TTS): Converting written text into spoken
language for applications like audiobooks or assistive
technologies.
• Optical Character Recognition (OCR): Extracting text from
images or scanned documents.
• Author Identification: Identifying the author of a text based
on stylistic patterns.
• Information Retrieval: Finding relevant documents or
information from large collections of text data.
Information retrieval

• Information retrieval (IR) is the field of computer


science concerned with finding and accessing
information from large collections of data, typically
text-based.
• It's the foundation for many technologies you use
every day, like search engines and library catalogs.
– The primary goal of IR is to identify and deliver
information items (documents, web pages, etc.)
that are relevant to a user's information need.
This need is often expressed as a search query.
Information retrieval

• Process:
– User Query: The user submits a query that
specifies their information need.
– This query can be a simple keyword search or a
more complex phrase expressing a specific topic
or question.
Information retrieval

• Retrieval Process:
• The IR system retrieves a set of documents or data items that
are potentially relevant to the query. This might involve
techniques like:
– Indexing: Preprocessing and storing information about
documents in a structured way to facilitate efficient retrieval.
– Matching: Comparing the user's query with the indexed
information to identify documents with a high degree of
relevance. Different matching algorithms can be used based
on keywords, phrases, or semantic similarity.
– Ranking: Ranking the retrieved documents based on their
estimated relevance to the user's query. This ranking helps
users prioritize which documents to examine first.
Information retrieval

• Evaluation:
– The effectiveness of an IR system is often
evaluated by metrics like precision (proportion
of retrieved documents that are relevant) and
recall (proportion of relevant documents that
are retrieved).
Information retrieval

• Applications:
• Web Search Engines: Tools like Google, Bing, and DuckDuckGo
use IR techniques to crawl and index the web, enabling users to
find relevant information through search queries.
• Library Catalogs: Online library catalogs utilize IR to help users
search for books, articles, and other library resources based on
keywords, titles, authors, or other criteria.
• Email Search: Search functionalities within email applications
rely on IR techniques to find specific emails based on keywords
or senders/recipients.
• E-commerce Product Search: Product search on e-commerce
websites uses IR to match user queries with product
descriptions, specifications, and attributes.
Vector Space Model

• The Vector Space Model (VSM) is a fundamental and


widely used technique in information retrieval (IR)
for representing documents and queries as vectors
in a high-dimensional space.
• VSM represents documents and queries as vectors
in a multi-dimensional space where each dimension
corresponds to a unique term in the vocabulary of
all documents in the collection.
• The weight or value associated with each term in a
document's vector reflects the term's importance or
relevance to that document.
Vector Space Model

• Documents and queries are represented as rows


in a term-document matrix.
• Each column represents a unique term.
• The value at a specific row (document) and
column (term) intersection indicates the weight
of that term within that document.
Vector Space Model

• Term Frequency (TF): The raw frequency of a term's


occurrence within a document. Higher frequency suggests
more relevance.
• Inverse Document Frequency (IDF): Considers the term's
overall importance across the entire document collection.
Terms appearing in many documents have lower IDF
weights, while terms specific to a few documents have
higher weights.
• TF-IDF: Combines TF and IDF, giving more weight to terms
that are frequent within a document but rare across the
entire collection. This helps focus on terms that are
distinctive and informative for that specific document.
Bag of Words
Bag of words

• Bag of words is a Natural Language Processing


technique of text modelling.
• In technical terms, we can say that it is a method
of feature extraction with text data.
• This approach is a simple and flexible way of
extracting features from documents.
Bag of words

• A bag of words is a representation of text that


describes the occurrence of words within a document.
• We just keep track of word counts and disregard the
grammatical details and the word order.
• It is called a “bag” of words because any information
about the order or structure of words in the
document is discarded.
• The model is only concerned with whether known
words occur in the document, not where in the
document.
Bag of words: Why?

• One of the biggest problems with text is that it is messy


and unstructured, and machine learning algorithms
prefer structured, well defined fixed-length inputs and
by using the Bag-of-Words technique we can convert
variable-length texts into a fixed-length vector.
• Also, at a much granular level, the machine learning
models work with numerical data rather than textual
data.
• So to be more specific, by using the bag-of-words (BoW)
technique, we convert a text into its equivalent vector
of numbers.
Bag of words: Example

• Sentences:
– The quick brown fox jumps over the lazy dog.
– The cat chases the mouse and it squeaks
loudly.
Bag of words: Example
N-grams

• Again same questions, what are n-grams and why


do we use them? Let us understand this with an
example below-

• Sentence 1: “This is a good job. I will not miss it for


anything”

• Sentence 2: ”This is not good at all”


N-grams

• For this example, let us take the vocabulary of 5 words


only. The five words being-
– good
– job
– miss
– not
– all
• So, the respective vectors for these sentences are:
“This is a good job. I will not miss it for
anything”=[1,1,1,1,0]
”This is not good at all”=[1,0,0,1,1]
N-grams

• Can you guess what is the problem here? Sentence 2 is a


negative sentence and sentence 1 is a positive sentence.
Does this reflect in any way in the vectors above? Not at
all.
• So how can we solve this problem? Here come the N-
grams to our rescue.
• An N-gram is an N-token sequence of words: a 2-gram
(more commonly called a bigram) is a two-word
sequence of words like “really good”, “not good”, or
“your homework”, and a 3-gram (more commonly called a
trigram) is a three-word sequence of words like “not at
all”, or “turn off light”.
N-grams

• For example, the bigrams in the first line of text in the previous
section: “This is not good at all” are as follows:
– “This is”
– “is not”
– “not good”
– “good at”
– “at all”
• Now if instead of using just words in the above example, we use
bigrams (Bag-of-bigrams) as shown above. The model can
differentiate between sentence 1 and sentence 2.
• So, using bi-grams makes tokens more understandable (for
example, “HSR Layout”, in Bengaluru, is more informative than
“HSR” and “layout”)
The TF-IDF Vectorizer

• The TF*IDF algorithm is used to weigh a keyword in


any document and assign the importance to that
keyword based on the number of times it appears
in the document.
• Put simply, the higher the TF*IDF score (weight),
the rarer and more important the term, and vice
versa.
• Each word or term has its respective TF and IDF
score. The product of the TF and IDF scores of a
term is called the TF*IDF weight of that term.
The TF-IDF Vectorizer

• The TF (term frequency) of a word is the number of times


it appears in a document. When you know it, you’re able
to see if you’re using a term too often or too
infrequently.
– TF(t) = (Number of times term t appears in a
document) / (Total number of terms in the document).
• The IDF (inverse document frequency) of a word is the
measure of how significant that term is in the whole
corpus.
– IDF(t) = log_e(Total number of documents / Number
of documents with term t in it).
The TF-IDF Vectorizer
Example:

• 1. It was a beautiful rainy day that made by whole


day awesome.
• 2. We made it awesome by adding more flavors on
that day.
Example:
Vector Space Model

• Document Similarity:
– Once documents and queries are represented as
vectors, VSM calculates the similarity between
them.
– Common similarity measures include cosine
similarity, which considers the angle between
the two vectors in the high-dimensional space. A
higher cosine similarity score indicates a closer
semantic relationship between the document
and the query.
Information Extraction using sequence labelling

• Information extraction (IE) using sequence labeling


is a powerful technique for automatically extracting
specific pieces of information from text data.
• Sequence labeling models process text data one
element (word or character) at a time, predicting a
label for each element that indicates its role in the
information you want to extract.
Information Extraction using sequence labelling

• Process:
– Data Preparation:
• Define the information you want to extract
(e.g., names of people, locations,
organizations, dates).
• Annotate a training dataset where each word
or character in a sentence is labeled with its
corresponding role (e.g., "B-PER" for the
beginning of a person's name, "I-PER" for the
middle of a person's name).
Information Extraction using sequence labelling

• Sequence Labeling Model:


– Popular choices include:
• Bidirectional Long Short-Term Memory (BiLSTM): A
recurrent neural network (RNN) architecture that can
capture contextual information from both directions
of the sentence.
• Conditional Random Fields (CRFs): Probabilistic
graphical models that consider dependencies between
labels for consecutive elements in the sequence.
– The model is trained on the annotated data, learning to
predict the correct label for each element in a new
unseen sentence.
Information Extraction using sequence labelling

• Example:
– Sentence: "Barack Obama, the former president
of the United States, visited Paris."
– Labels: "B-PER Barack I-PER Obama B-TITLE
president I-TITLE of the I-ORG United I-ORG
States I-LOC Paris."
– Extracted Information: Person: Barack Obama,
Title: president of the United States, Location:
Paris
Information Extraction using sequence labelling

• Applications:
– Named Entity Recognition (NER): Identifying and
classifying named entities like people,
organizations, locations, dates, etc.
– Relation Extraction: Extracting relationships
between entities (e.g., "works at", "located in").
– Event Extraction: Identifying and classifying events
described in text (e.g., "protest", "financial
transaction").
– Question Answering: Extracting answers to specific
questions from factual text data.
Question answers system

• Question answering (QA) systems are computer


programs designed to automatically answer
questions posed in natural language.
• They aim to bridge the gap between humans
and information retrieval by providing more
direct and user-friendly access to knowledge.
Question answers system

• Users submit questions in natural language


(e.g., "What is the capital of France?").
• The system understands the question's intent
and retrieves relevant information from a
knowledge base or vast collection of text data.
• The system processes the information and
generates an answer that directly addresses the
user's query.
QA System : Approaches

• Retrieval-Based QA: Focuses on finding documents or


passages containing the answer to the question. This
might involve information retrieval techniques and
keyword matching.
• Knowledge-Based QA: Leverages a structured knowledge
base containing information about entities, relationships,
and facts. The system queries the knowledge base to find
answers based on the question's meaning.
• Generative QA: Utilizes natural language generation
techniques to formulate an answer directly, even if the
answer isn't explicitly stated in the knowledge base. This
often involves deep learning models.
QA System : Applications

• Search Engines: Many search engines incorporate QA


functionalities to provide more user-friendly and
informative answers to search queries.
• Virtual Assistants: Chatbots and virtual assistants leverage
QA systems to answer user questions and complete tasks
based on natural language instructions.
• Education and Training: Educational platforms can utilize
QA systems to provide immediate feedback and answer
student questions on various topics.
• Customer Service: QA systems can be integrated into
customer service chatbots to answer frequently asked
questions and provide support.
Text Classification / Categorization

• Text Classification is the processing of labeling or


organizing text data into groups.
• It forms a fundamental part of Natural Language
Processing. In the digital age that we live in we are
surrounded by text on our social media accounts, in
commercials, on websites, Ebooks, etc.
• The majority of this text data is unstructured, so
classifying this data can be extremely useful.
Text Classification
Text Classification: Applications

• Spam detection in emails


• Sentiment analysis of online reviews
• Topic labeling documents like research papers
• Language detection like in Google Translate
• Age/gender identification of anonymous users
• Tagging online content
• Speech recognition used in virtual assistants like
Siri and Alexa
Rule Based Approach

• These approaches make use of handcrafted linguistic


rules to classify text.
• One way to group text is to create a list of words
related to a certain column and then judge the text
based on the occurrences of these words.
• For example, words like “fur”, “feathers”, “claws”, and
“scales” could help a zoologist identify texts talking
about animals online.
• These approaches require a lot of domain knowledge
to be extensive, take a lot of time to compile, and are
difficult to scale.
Machine Learning Approach
Text Summarization

• Text summarization is the process of generating


short, fluent, and most importantly accurate
summary of a respectively longer text document.
• The main idea behind automatic text summarization
is to be able to find a short subset of the most
essential information from the entire set and
present it in a human-readable format.
• As online textual data grows, automatic text
summarization methods have the potential to be
very helpful because more useful information can be
read in a short time.
Text Summarization
Why Auto Text Summarization?

• Summaries reduce reading time.


• When researching documents, summaries make the
selection process easier.
• Automatic summarization improves the effectiveness of
indexing.
• Automatic summarization algorithms are less biased than
human summarization.
• Personalized summaries are useful in question-answering
systems as they provide personalized information.
• Using automatic or semi-automatic summarization systems
enables commercial abstract services to increase the number
of text documents they are able to process.
Text Summarization Types
Text Summarization

• Based on input type:


– Single Document, where the input length is
short. Many of the early summarization
systems dealt with single-document
summarization.
– Multi-Document, where the input can be
arbitrarily long.
Text Summarization

• Based on the purpose:


– Generic, where the model makes no assumptions about the
domain or content of the text to be summarized and treats
all inputs as homogeneous. The majority of the work that
has been done revolves around generic summarization.
– Domain-specific, where the model uses domain-specific
knowledge to form a more accurate summary. For example,
summarizing research papers of a specific domain,
biomedical documents, etc.
– Query-based, where the summary only contains information
that answers natural language questions about the input
text.
Text Summarization

• Based on output type:


– Extractive, where important sentences are selected
from the input text to form a summary. Most
summarization approaches today are extractive in
nature.
– Abstractive, where the model forms its own phrases
and sentences to offer a more coherent summary,
like what a human would generate. This approach is
definitely more appealing, but much more difficult
than extractive summarization.
TextRank Algorithm

• TextRank is an extractive summarization


technique.
• It is based on the concept that words which
occur more frequently are significant. Hence,
the sentences containing highly frequent words
are important .
• Based on this , the algorithm assigns scores to
each sentence in the text . The top-ranked
sentences make it to the summary.
TextRank Algorithm
Sentiment Analysis

• Sentiment analysis, also known as opinion mining, is


a technique in natural language processing (NLP)
that aims to understand the emotional tone or
opinion expressed in a piece of text.
• It analyzes text data to classify the sentiment as
positive, negative, or neutral.
Sentiment Analysis
Sentiment Analysis: Approaches

• Lexicon-Based Approach: Relies on sentiment lexicons,


which are large dictionaries containing words with
predefined sentiment scores (positive, negative, or
neutral). The sentiment score of a text is calculated
based on the sentiment scores of the words it contains.
• Machine Learning Approach: Trains machine learning
models on labeled data sets of text with known
sentiment. These models can then be used to classify
the sentiment of new, unseen text data. This approach
can be more nuanced than lexicon-based methods.
Sentiment Analysis: Approaches
Sentiment Analysis: Applications

• Understanding Customer Reviews: Businesses can analyze


customer reviews of products or services to gauge overall
sentiment and identify areas for improvement.
• Social Media Monitoring: Brands can track sentiment on
social media platforms to understand public perception
and respond to negative feedback.
• Market Research: Analyzing online opinions can help
understand customer preferences and inform marketing
strategies.
• News Analysis: Sentiment analysis can be used to
understand the overall tone of news articles or social
media discussions about current events.
Sentiment Analysis: Challenges

• Sarcasm and Irony: Text can be subjective and


contain sarcasm or irony, which can be difficult for
sentiment analysis tools to detect.
• Context and Nuance: Sentiment analysis might not
always capture the full context of a situation or the
subtle nuances of human language.
• Multilingual Sentiment Analysis: Analyzing
sentiment across different languages presents
additional challenges due to cultural and linguistic
variations.
Named Entity Recognition

• Named entity recognition (NER) — sometimes referred


to as entity chunking, extraction, or identification — is
the task of identifying and categorizing key information
(entities) in text.
• An entity can be any word or series of words that
consistently refers to the same thing. Every detected
entity is classified into a predetermined category.
• For example, an NER machine learning (ML) model
might detect the word “MITU Skillologies” in a text and
classify it as a “Company”.
Named Entity Recognition

• NER is a form of natural language processing


(NLP), a subfield of artificial intelligence.
• NLP is concerned with computers processing
and analyzing natural language, i.e., any
language that has developed naturally, rather
than artificially, such as with computer coding
languages.
Named Entity Recognition
Named Entity Recognition

• Person
– E.g., Elvis Presley, Audrey Hepburn, David Beckham
• Organization
– E.g., Google, Mastercard, University of Oxford
• Time
– E.g., 2006, 16:34, 2am
• Location
– E.g., Trafalgar Square, MoMA, Machu Picchu
• Work of art
– E.g., Hamlet, Guernica, Exile on Main St.
How NER used?

• NER is suited to any situation in which a high-


level overview of a large quantity of text is
helpful.
• With NER, you can, at a glance, understand the
subject or theme of a body of text and quickly
group texts based on their relevancy or
similarity.
How NER used?

• Human resources
– Speed up the hiring process by summarizing
applicants’ CVs; improve internal workflows by
categorizing employee complaints and questions
• Customer support
– Improve response times by categorizing user
requests, complaints and questions and filtering
by priority keywords
How NER used?

• Search and recommendation engines


– Improve the speed and relevance of search
results and recommendations by summarizing
descriptive text, reviews, and discussions
– Booking.com is a notable success story here
• Content classification
– Surface content more easily and gain insights
into trends by identifying the subjects and
themes of blog posts and news articles
How NER used?

• Health care
– Improve patient care standards and reduce workloads by
extracting essential information from lab reports
– Roche is doing this with pathology and radiology reports
• Academia
– Enable students and researchers to find relevant material
faster by summarizing papers and archive material and
highlighting key terms, topics, and themes
– The EU’s digital platform for cultural heritage,
Europeana, is using NER to make historical newspapers
searchable
Pre-processing

• To prepare the text data for the model building we


perform text preprocessing. It is the very first step of
NLP projects. Some of the preprocessing steps are:
– Removing punctuations like . , ! $( ) * % @
– Removing URLs
– Removing Stop words
– Lower casing
– Tokenization
– Stemming
– Lemmatization
Why Pre-processing?

• Significance of text preprocessing in the


performance of models.
• Data preprocessing is an essential step in building a
Machine Learning model and depending on how
well the data has been preprocessed; the results
are seen.
• In NLP, text preprocessing is the first step in the
process of building a model.
NLTK

• The Natural Language Toolkit, or more commonly


NLTK, is a suite of libraries and programs for symbolic
and statistical natural language processing (NLP) for
English written in the Python programming language.
• It was developed by Steven Bird and Edward Loper in
the Department of Computer and Information
Science at the University of Pennsylvania.
• NLTK includes graphical demonstrations and sample
data. It is accompanied by a book that explains the
underlying concepts behind the language processing
tasks supported by the toolkit, plus a cookbook.
NLTK

• NLTK is intended to support research and teaching in


NLP or closely related areas, including empirical
linguistics, cognitive science, artificial intelligence,
information retrieval, and machine learning.
• NLTK has been used successfully as a teaching tool, as
an individual study tool, and as a platform for
prototyping and building research systems.
• There are 32 universities in the US and 25 countries
using NLTK in their courses.
• NLTK supports classification, tokenization, stemming,
tagging, parsing, and semantic reasoning functionalities
nltk.org
Install nltk

• !pip install nltk -U

• Installing nltk packages


– import nltk
– nltk.download(‘package-name’)
Using Python Scripts
Using Python Scripts
Chatbot

• Chatbots are software applications that use


artificial intelligence & natural language
processing to understand what a human wants,
and guides them to their desired outcome with
as little work for the end user as possible.
• Like a virtual assistant for your customer
experience touchpoints.
Chatbot

• A well designed & built chatbot will:


– Use existing conversation data (if available)
to understand the type of questions people
ask.
– Analyze correct answers to those questions
through a ‘training’ period.
– Use machine learning & NLP to learn context,
and continually get better at answering those
questions in the future.
Chatbot
Chatbot

• One of the most interesting parts of the chatbot


software space is the variety of ways you can build a
chatbot.
• The underlying technology can vary quite a bit, but it
really all comes down to what your goals are. At the
highest level, there are three types of chatbots most
consumers see today:
– Rules-Based Chatbots – These chatbots follow pre-
designed rules, often built using a graphical user
interface where a bot builder will design paths using
a decision tree.
Chatbot

• Continued...
– AI Chatbots – AI chatbots will automatically learn
after an initial training period by a bot developer.
– Live Chat – These bots are primarily used by
Sales & Sales Development teams. They can also
be used by Customer Support organizations, as
live chat is a more simplistic chat option to
answer questions in real-time.
Chatbot
Dialogflow Chatbot

• Dialogflow, a Google Cloud Platform service,


allows you to build conversational interfaces
(chatbots) for various applications like websites,
mobile apps, or messaging platforms.
• It utilizes machine learning to understand user
intents and generate appropriate responses.
Dialogflow Chatbot
Summary

• NLP bridges the language gap, allowing computers to


understand and process human language.
• It powers machine translation, transforming text from one
language to another.
• NLP also fuels chatbots and virtual assistants, enabling
them to respond to our questions and requests in a
conversational manner.
• Furthermore, it empowers sentiment analysis, revealing the
emotions and opinions hidden within text.
• By unlocking the secrets of human language, NLP is
revolutionizing the way we interact with machines and
information in the digital world.
Thank you
This presentation is created using LibreOffice Impress 7.4.1.2, can be used freely as per GNU General Public License

@mITuSkillologies @mitu_group @mitu-skillologies @MITUSkillologies

Web Resources
https://mitu.co.in
@mituskillologies http://tusharkute.com @mituskillologies

[email protected]
[email protected]

You might also like