CH4
CH4
H Patel College of
Engineering and
Technology
Text Analysis, Summarization and Extraction
Text Classification:
Introduction:
• Unstructured data accounts for over 80% of all data, with text
being one of the most common categories. Because analyzing,
comprehending, organizing, and sifting through text data is
difficult and time-consuming due to its messy nature, most
businesses do not exploit it to its full potential despite all the
potential benefits it would bring.
Introduction:
• For example, imagine you have tons of new articles, and your goal
is to assign them to relevant categories such as Sports, Politics,
Economy, etc.
Rule based text classification:
• It learns the mapping of input data (raw text) with the labels (also
known as target variables).
Machine learning Based text classification:
• The two most common methods for extracting feature from text
or in other words converting text data (strings) into numeric
features so machine learning model can be trained are: Bag of
Words (a.k.a CountVectorizer) and Tf-IDF.
Bag of Word :
• The TF-IDF model is different from the bag of words model in that
it takes into account the frequency of the words in the document,
as well as the inverse document frequency. This means that the
TF-IDF model is more likely to identify the important words in a
document than the bag of words model.
What is Text summarization:
• 1. Extractive summarization
• 2.Abtractive summarization
Extractive summarization:
• As with other NLP tasks, text summarization requires text data first
undergo preprocessing. This includes tokenization, stopword removal,
and stemming or lemmatization in order to make the dataset readable
by a machine learning model. After preprocessing, all extractive text
summarization methods follow three general, independent steps:
representation, sentence scoring, and sentence selection.
Extractive summarization(representation)
• TF-IDF method.
Extractive summarization(Sentence
Selection)
• software will analyze all your input text and source documents and
provide you with a summary text.
Leverage Existing Tools
• NLP can extract insight from text data, this makes it a perfect
tool for keeping track of customer feedback, determining
sentiment, whether it’s positive or negative, and to what degree.
• NLP platform can provide you with the most relevant sentences
that you can use to communicate your product, important points
to focus on and give you a deep understanding of your
environment.
Ensure all Critical Information is Covered:
• While researching using various documents, summaries make the selection process easier.
• The Generic summarization focuses on obtaining a generic summary or abstract of the collection of
documents, or sets of images, or videos, news stories etc.
• Text cleaning
• Sentence Tokenization
• Word tokenization
• Word-frequency table
• Summarization
Named Entity Recognition:
• NER involves the identification of key information in the text and classification into a
set of predefined categories.
• There are different kinds of Categories like a person names, organizations, locations,
time expressions, quantities, percentages
How Name entity Reorganization Work:
• The NER system analyses the entire input text to identify and locate the named entities.
• NER can be trained to classify entire documents into different types, such as invoices,
receipts, or passports. Document classification enhances the versatility of NER, allowing it to
adapt its entity recognition based on the specific characteristics and context of different
document types.
• NER employs machine learning algorithms, including supervised learning, to analyze labeled
datasets. These datasets contain examples of annotated entities, guiding the model in
recognizing similar entities in new, unseen data.
How Name entity Reorganization Work:
• The NER uses a dictionary with a list of words or terms. The process involves
checking if any of these words are present in a given text. However, this approach
isn’t commonly used because it requires constant updating and careful maintenance
of the dictionary to stay accurate and effective.
Name entity Reorganization Methods:
• Rule Based Method
• The Rule Based NER method uses a set of predefined rules guides the extraction of
information. These rules are based on patterns and context. Pattern-based rules
focus on the structure and form of words, looking at their morphological patterns. On
the other hand, context-based rules consider the surrounding words or the context in
which a word appears within the text document. This combination of pattern-based
and context-based rules enhances the precision of information extraction in Named
Entity Recognition (NER).
Name entity Reorganization Methods:
• Machine learning based method
• Multi-Class Classification with Machine Learning Algorithms
• One way is to train the model for multi-class classification using different machine
learning algorithms, but it requires a lot of labelling. In addition to labelling the
model also requires a deep understanding of context to deal with the ambiguity of
the sentences.
Name entity Reorganization Methods:
• Machine learning based method
• Conditional Random Field (CRF)
• Conditional random field is implemented by both NLP Speech Tagger and NLTK. It is
a probabilistic model that can be used to model sequential data such as words.
What is Information Extraction:
• Relationship Extraction
• Event Extraction
• Event extraction identifies specific occurrences described in the
text and their attributes, such as what happened, who was
involved, and where and when it occurred.
Information Extraction Techniques in NLP:
• Techniques:
• Statistical models: Use probabilistic models like Hidden Markov Models (HMM) and
Conditional Random Fields (CRF).
• Deep learning: Leverage neural networks such as BiLSTM-CRF and transformers like BERT.
Information Extraction Techniques in NLP:
• 2. Relation Extraction
• Techniques:
• Distant supervision: Uses a large amount of noisy labeled data from knowledge
bases.
• Neural networks: Utilizes CNNs, RNNs, and transformers for relation classification.
Information Extraction Techniques in NLP:
• Techniques:
• Deep learning: Applies RNNs, CNNs, and attention mechanisms to capture event
structures.
Information Extraction Techniques in NLP:
• Definition: Determining when different expressions in a text refer to the same entity.
• Techniques:
• Machine learning: Trains classifiers using features like gender, number, and syntactic
role.
• Neural networks: Uses deep learning models like BiLSTM and transformers for
coreference chains.
Information Extraction Techniques in NLP:
• 5. Template Filling
• Techniques:
• Hybrid methods: Combine rules and machine learning for better accuracy.
Information Extraction Techniques in NLP:
• Techniques:
• Neural OpenIE: Leverages deep learning models to improve the extraction process.
What are the challenges in Information
Extraction:
• Ambiguity and Variability of Language: Human language is inherently ambiguous and
varies greatly in structure and style, making accurate extraction challenging.
• Data Quality and Annotation: The quality of the extracted information heavily
depends on the quality of the training data and the annotations used to train IE
models.