0% found this document useful (0 votes)
8 views

CH4

NLP CH4

Uploaded by

shyamthakkar1673
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

CH4

NLP CH4

Uploaded by

shyamthakkar1673
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 98

G.

H Patel College of
Engineering and
Technology
Text Analysis, Summarization and Extraction
Text Classification:
Introduction:

• Unstructured data accounts for over 80% of all data, with text
being one of the most common categories. Because analyzing,
comprehending, organizing, and sifting through text data is
difficult and time-consuming due to its messy nature, most
businesses do not exploit it to its full potential despite all the
potential benefits it would bring.
Introduction:

• where Machine Learning and text classification come into play.


Companies may use text classifiers to quickly and cost-effectively
arrange all types of relevant content, including emails, legal
documents, social media, chatbots, surveys, and more.
Introduction:

• some of the essential models you need to know, how to evaluate


those models, and the potential alternatives to developing your
algorithms.
What is Text Classifier:

• Natural Language Processing (NLP), Sentiment Analysis, spam, and


intent detection, and other applications use text classification as
a core Machine Learning technique.
• A text classifier labels unstructured texts into predefined text
categories. Instead of users having to review and analyze vast
amounts of information to understand the context, text
classification helps derive relevant insight.
• The goal of text classification is to categorize or predict a class of
unseen text documents, often with the help of supervised machine
learning.
Text Classification pipeline:
Text classification use cases and application:

• spam filter is a common application that uses text classification to


sort emails into spam and non-spam categories.
Spam Classification:
Classifying news articles and blogs:

• A supervised machine learning model is trained on labeled data,


which includes both the raw text and the target. Once a model is
trained, it is then used in production to obtain a category (label)
on the new and unseen data (articles/blogs written in the future).
Classifying news and articles, blogs:
Types of Text classification System:

• There are two types of Text Classification:


• 1.Rule based text classification
• 2.Machine learning based classification.
Rule-based text classification:

• Rule-based techniques use a set of manually constructed language


rules to categorize text into categories or groups.

• For example, imagine you have tons of new articles, and your goal
is to assign them to relevant categories such as Sports, Politics,
Economy, etc.
Rule based text classification:

• With a rule-based classification system, you will do a human


review of a couple of documents to come up with linguistic rules
like this one:

• If the document contains words such as money, dollar, GDP, or inflation, it


belongs to the Politics group (class).
Rule-based text classification:

• These systems, to begin with, demand in-depth expertise in the


field. They take a lot of time since creating rules for a
complicated system can be difficult and frequently necessitates
extensive study and testing.
Machine learning Based text classification:

• Machine learning-based text classification is a supervised machine


learning problem.

• It learns the mapping of input data (raw text) with the labels (also
known as target variables).
Machine learning Based text classification:

• supervised machine learning, text classification machine learning


has two phases; training and prediction.
Machine learning Based text classification:
Training Phase:

• A supervised machine learning algorithm is trained on the input-


labeled dataset during the training phase. At the end of this
process, we get a trained model that we can use to obtain
predictions (labels) on new and unseen data.
Training Phase and Prediction phase
Prediction phase

• Once a machine learning model is trained, it can be used to


predict labels on new and unseen data. This is usually done by
deploying the best model from an earlier phase as an API on the
server.
Prediction phase
Text Preprocessing Pipeline:
Feature Extraction

• The two most common methods for extracting feature from text
or in other words converting text data (strings) into numeric
features so machine learning model can be trained are: Bag of
Words (a.k.a CountVectorizer) and Tf-IDF.
Bag of Word :

• A bag of words (BoW) model is a simple way of representing text


data as numeric features. It involves creating a vocabulary of
known words in the corpus and then creating a vector for each
document that contains counts of how often each word appears.
Bag of Word:
TF-IDF

• The TF-IDF model is different from the bag of words model in that
it takes into account the frequency of the words in the document,
as well as the inverse document frequency. This means that the
TF-IDF model is more likely to identify the important words in a
document than the bag of words model.
What is Text summarization:

• Text summarization condenses one or more texts into shorter


summaries for enhanced information extraction.

• Text summarization is the creation of a short, accurate, and fluent


summary of a longer text document.
Types of automatic text summarization:

• 1. Extractive summarization
• 2.Abtractive summarization
Extractive summarization:

• Extractive summarization extracts unmodified sentences from the


original text documents. A key difference between extractive
algorithms is how they score sentence importance while reducing
topical redundancy.
Extractive summarization:

• As with other NLP tasks, text summarization requires text data first
undergo preprocessing. This includes tokenization, stopword removal,
and stemming or lemmatization in order to make the dataset readable
by a machine learning model. After preprocessing, all extractive text
summarization methods follow three general, independent steps:
representation, sentence scoring, and sentence selection.
Extractive summarization(representation)

• Represent the Bag of Word model


• which represent text segments— such as words or sentences— as
datapoints in a vector space. Large, multi-document datasets for use term
frequency-inverse document frequency (TF-IDF), a variant of bag of words
that weights each term to reflect its importance within a text set.
Extractive summarization(Sentence Scoring)

• Sentence scoring, per its name, scores each sentence in a text


according to their importance to that text.

• There are different type of method for sentence scoring

• TF-IDF method.
Extractive summarization(Sentence
Selection)

• weighted sentences by importance, algorithms select the n most


important sentences for a document or collection thereof. These
sentences comprise the generated summary.
Extractive summarization(Sentence
Selection)

• he sentence selection step aims to reduce redundancy in the final


summaries. Maximal marginal relevance methods employ an
iterative approach.
Abstractive summarization:

• Abstractive summarization generates original summaries using


sentences not found in the original text documents. Such
generation requires neural networks and large language models
(LLMs) to produce semantically meaningful text sequences.

• abstractive text summarization is more computationally expensive


then extractive
Abstractive summarization:

• What are the methods are use in Abstractive Summarization?


1. Sentence Compression -: sentence compression—humans
summarize longer texts and sentences by shortening them. There
are two general approaches to sentence compression: rule-based
and statistical methods.
Abstractive summarization:

2. Information Fusion:- summarize documents by concatenating


information from multiple passages into a single sentence or phrase.
What are the Benefits of Text Summarization:
1. Scalable and Quick

• Manually summarizing a short document is fairly easy, but what if


you have an article or paper that is hundreds or thousands of
pages long?

• software will analyze all your input text and source documents and
provide you with a summary text.
Leverage Existing Tools

• Text summarization algorithms are easy to use and available to


make your research and business decision-making process more
efficient and actionable.
Understand Your Customers Better:

• NLP can extract insight from text data, this makes it a perfect
tool for keeping track of customer feedback, determining
sentiment, whether it’s positive or negative, and to what degree.

• monitor reviews in real-time and flag the most important or time-sensitive


comments, provide timely feedback, and ignore irrelevant information.
Summarize a Text In Different Format:

• Natural language processing helps you obtain summarized text extracted


from your competitor’s web pages, market research documents,
industry-related articles, etc. Having a clear idea of the market and your
competitors helps you determine actionable steps for presenting your
product or refining your business strategy. This helps you stand out
amongst the competitors and maintain a competitive advantage in the
market.
Summarize a Text In Different Format:

• NLP platform can provide you with the most relevant sentences
that you can use to communicate your product, important points
to focus on and give you a deep understanding of your
environment.
Ensure all Critical Information is Covered:

• The automated text summarizing approach makes it easy for the


user to read all the most important sentences in a document.
What are the Use cases of Text
Summarization:
• Financial Research with NLP
• Media Monitoring with NLP
Why automatic Text summarization:

• Summaries reduce reading time.

• While researching using various documents, summaries make the selection process easier.

• Automatic summarization improves the effectiveness of indexing.

• Automatic summarization algorithms are less biased than human summarizers.

• Personalized summaries are useful in question-answering systems as they provide personalized


information.

• Using automatic or semi-automatic summarization systems enables commercial abstract services to -


increase the number of text documents they are able to process.
Types of Summarization:
Types of Summarization:

• An Extractive summarization method consists of selecting important sentences, paragraphs


etc. from the original document and concatenating them into shorter form.

• An Abstractive summarization is an understanding of the main concepts in a document and


then express those concepts in clear natural language.

• The Domain-specific summarization techniques utilize the available knowledge specific to


the domain of text. For example, automatic summarization research on medical text
generally attempts to utilize the various sources of codified medical knowledge and
ontologies.
Types of Text Summarization:

• The Generic summarization focuses on obtaining a generic summary or abstract of the collection of
documents, or sets of images, or videos, news stories etc.

• The Query-based summarization, sometimes called query-relevant summarization, summarizes


objects specific to a query.

• The Multi-document summarization is an automatic procedure aimed at extraction of information


from multiple texts written about the same topic. Resulting summary report allows individual users,
such as professional information consumers, to quickly familiarize themselves with information
contained in a large cluster of documents.

• The Single-document summarization generates a summary from a single source document.


How to do text summarization

• Text cleaning
• Sentence Tokenization
• Word tokenization
• Word-frequency table
• Summarization
Named Entity Recognition:

• Named Entity Recognition (NER) is a technique in natural language processing (NLP)


that focuses on identifying and classifying entities. The purpose of NER is to
automatically extract structured information from unstructured text, enabling
machines to understand and categorize entities in a meaningful manner for various
applications like text summarization, building knowledge graphs, question answering,
and knowledge graph construction. The article explores the fundamentals, methods
and implementation of the NER model.
What is Named Entity Recognition (NER)?

• Name-entity recognition (NER) is also referred to as entity identification, entity


chunking, and entity extraction. NER is the component of information extraction that
aims to identify and categorize named entities within unstructured text.

• NER involves the identification of key information in the text and classification into a
set of predefined categories.

• There are different kinds of Categories like a person names, organizations, locations,
time expressions, quantities, percentages
How Name entity Reorganization Work:

• The NER system analyses the entire input text to identify and locate the named entities.

• NER can be trained to classify entire documents into different types, such as invoices,
receipts, or passports. Document classification enhances the versatility of NER, allowing it to
adapt its entity recognition based on the specific characteristics and context of different
document types.

• NER employs machine learning algorithms, including supervised learning, to analyze labeled
datasets. These datasets contain examples of annotated entities, guiding the model in
recognizing similar entities in new, unseen data.
How Name entity Reorganization Work:

• multiple training iterations, the model refines its understanding of contextual


features, syntactic structures, and entity patterns, continuously improving its
accuracy over time.
Name entity Reorganization Methods:

• Lexicon Based Method

• The NER uses a dictionary with a list of words or terms. The process involves
checking if any of these words are present in a given text. However, this approach
isn’t commonly used because it requires constant updating and careful maintenance
of the dictionary to stay accurate and effective.
Name entity Reorganization Methods:
• Rule Based Method

• The Rule Based NER method uses a set of predefined rules guides the extraction of
information. These rules are based on patterns and context. Pattern-based rules
focus on the structure and form of words, looking at their morphological patterns. On
the other hand, context-based rules consider the surrounding words or the context in
which a word appears within the text document. This combination of pattern-based
and context-based rules enhances the precision of information extraction in Named
Entity Recognition (NER).
Name entity Reorganization Methods:
• Machine learning based method
• Multi-Class Classification with Machine Learning Algorithms
• One way is to train the model for multi-class classification using different machine
learning algorithms, but it requires a lot of labelling. In addition to labelling the
model also requires a deep understanding of context to deal with the ambiguity of
the sentences.
Name entity Reorganization Methods:
• Machine learning based method
• Conditional Random Field (CRF)

• Conditional random field is implemented by both NLP Speech Tagger and NLTK. It is
a probabilistic model that can be used to model sequential data such as words.
What is Information Extraction:

• Information extraction is the process of extracting information from


unstructured textual sources to enable finding entities as well as
classifying and storing them in a database. Semantically enhanced
information extraction (also known as semantic annotation) couples
those entities with their semantic descriptions and connections from a
knowledge graph. By adding metadata to the extracted concepts, this
technology solves many challenges in enterprise content management
and knowledge discovery.
Information Extraction:

• Textual sources from which information extraction can distill


structured information are legal acts, medical records, social
media interactions and streams, online news, government
documents, corporate reports and more.
How does Information Extraction Work:
What are the Key components of Information
Extraction:
• Named Entity Recognition (NER)
• NER identifies and classifies entities within a text into predefined
categories such as the names of persons, organizations, locations,
dates, etc.

• Relationship Extraction

• This involves identifying and categorizing the relationships


between entities within a text, helping to build a network of
connections and insights.
What are the Key components of Information
Extraction:

• Event Extraction
• Event extraction identifies specific occurrences described in the
text and their attributes, such as what happened, who was
involved, and where and when it occurred.
Information Extraction Techniques in NLP:

• 1. Named Entity Recognition (NER)

• Definition: Identifying and classifying named entities (e.g., persons, organizations,


locations, dates) in text.

• Techniques:

• Rule-based approaches: Utilize predefined rules and patterns.

• Statistical models: Use probabilistic models like Hidden Markov Models (HMM) and
Conditional Random Fields (CRF).

• Deep learning: Leverage neural networks such as BiLSTM-CRF and transformers like BERT.
Information Extraction Techniques in NLP:
• 2. Relation Extraction

• Definition: Identifying and categorizing relationships between entities within a text.

• Techniques:

• Pattern-based: Uses patterns and linguistic rules.

• Supervised learning: Employs labeled data to train classifiers.

• Distant supervision: Uses a large amount of noisy labeled data from knowledge
bases.

• Neural networks: Utilizes CNNs, RNNs, and transformers for relation classification.
Information Extraction Techniques in NLP:

• Definition: Detecting events and their participants, attributes, and temporal


information.

• Techniques:

• Template-based: Matches text with pre-defined event templates.

• Machine learning: Uses classifiers and sequence labeling methods.

• Deep learning: Applies RNNs, CNNs, and attention mechanisms to capture event
structures.
Information Extraction Techniques in NLP:

• Definition: Determining when different expressions in a text refer to the same entity.

• Techniques:

• Rule-based: Employs heuristic rules.

• Machine learning: Trains classifiers using features like gender, number, and syntactic
role.

• Neural networks: Uses deep learning models like BiLSTM and transformers for
coreference chains.
Information Extraction Techniques in NLP:

• 5. Template Filling

• Definition: Extracting specific pieces of information to populate predefined


templates.

• Techniques:

• Rule-based: Matches text to slots based on rules.

• Machine learning: Uses classifiers to fill template slots.

• Hybrid methods: Combine rules and machine learning for better accuracy.
Information Extraction Techniques in NLP:

• 6. Open Information Extraction (OpenIE)

• Definition: Extracting tuples of arbitrary relations and arguments from text.

• Techniques:

• Pattern-based: Utilizes linguistic patterns to identify relational triples.

• Statistical: Uses probabilistic models to determine the confidence of extracted


relations.

• Neural OpenIE: Leverages deep learning models to improve the extraction process.
What are the challenges in Information
Extraction:
• Ambiguity and Variability of Language: Human language is inherently ambiguous and
varies greatly in structure and style, making accurate extraction challenging.

• Domain-Specific Adaptation: IE systems need to be tailored to specific domains to


achieve high accuracy, requiring substantial effort in training and customization.

• Data Quality and Annotation: The quality of the extracted information heavily
depends on the quality of the training data and the annotations used to train IE
models.

You might also like