0% found this document useful (0 votes)
9 views24 pages

IMTC634 - Data Science - Chapter 7

Uploaded by

msmakkar.chief19
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views24 pages

IMTC634 - Data Science - Chapter 7

Uploaded by

msmakkar.chief19
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

Chapter 7: Text

Mining and Analytics


Chapter Index
S. Reference Particulars Slide
No. No. From - To
1 Learning Objectives 3
2 Topic 1 Differences between Text 4
Mining and Text Analytics

3 Topic 2 Text Mining Techniques 5 – 10


4 Topic 3 Text Mining Technologies 11 – 16
5 Topic 4 Methods and Approaches i 17 – 21
n Text Analytics

6 Topic 5 Applications of Text Analyt 22


ics

7 Let’s Sum Up 23
Learning Objectives

 Understand text mining and analytics

 €€ Describe the text mining techniques

 €€ Explain the text mining technologies

 €€ Elucidate the methods and approaches in text analytics

 €€ Describe the applications of text analytics


1. Differences between Text Mining
and Text Analytics

 Text Mining is the first step before analysing the text data. It involves

cleaning the data so that the same is made ready for text analytics.
 The various steps involved in the text mining process is shown in the

following figure :

Identification of a corpus

Preprocessing the text

Bag of words in R

Verify words in data frame

 Text analytics use techniques to infer, prescribe or predict any

information from the mined data.


2. Text Mining Techniques

 In order to make computers analyze, understand and


generate text, various techniques have developed in the
previous years. Some of the techniques are as follows:
 Sentiment analysis
 €€Topic modeling
 €€Term frequency
 €€Named entity recognition
 €€Event extraction
2. Text Mining Techniques

Sentiment analysis

 Sentiment analysis is one of the most significant and popular


techniques to describe and infer the textual data.

 It is used to derive the emotions from the text, tweets,


Facebook posts, or YouTube comments.

 Sentiments such as good, bad, anger, neutral, anxiety, etc.


are inferred from the given text.

 For example, how the people opine about a movie, topic or


decision by the government, etc. can be analyzed using
sentiment analysis tool.
2. Text Mining Techniques

Topic modeling
• Topic Modeling is a statistical approach for discovering
topic(s) from a collection of text documents based on
statistics of each word.
• Latent Dirichlet Allocation (LDA) is one of the most common
algorithms for topic modeling.
• The LDA Algorithm classifies the Corpus into Topics
automatically by self-learning to assign probabilities to all
terms in the corpus.
2. Text Mining Techniques

Term Frequency
• The Term Frequency tells about the importance of the word
with respect to total number of terms in the document.
• The ‘Term Frequency (TF)’ is usually measured along with
‘Inverse Document Frequency (IDF)’ as ‘TF-IDF’.
• ‘TF-IDF’ is abbreviation for ‘Term Frequency-Inverse
Document Frequency’. It is a statistic measure which tells
how a word is important in the given document.
2. Text Mining Techniques

Named Entity Recognition


 Named entity is the real-world object denoted by proper
noun for place, person, product, organization, quantity,
percentage, time, etc.
 Named Entity Recognition is a tool used in text analytics
which classify the named entities in the given corpus into
predefined classes, such as place, person, product,
organization, quantity, percentage, time, etc.
2. Text Mining Techniques

Event Extraction
 Suppose we want information of an event happened. Online
news has published this information in large text. Deriving
detailed and structured information about the event from
this text is called event extraction.
 By event extraction, we identify Ws, i.e., Who, When, Where,
to Whom, Why and How.
 In other words, event extraction identifies the relationship
between entities.
 Suppose you are analyzing the information on joint venture.
Then we will be extracting partners, products, place, capital
and profits of the said joint venture.
3. Text Mining Technologies

 Text mining is used to retrieve the potential information out


of the available data.
 Different technologies are required to extract the potential
information, some of which are as follows:

€€ Information Retrieval

€€ Information Extraction

€€ Clustering

€€ Categorization

€€ Summarization
3. Text Mining Technologies

Information Retrieval
 Information Retrieval (IR) is extracting documents that
satisfies an information needed from within large collections.
These documents may be unstructured or semi-structured
and usually in text format. These documents are classified or
clustered as per the content or similarity in the content.
 It is a very broad term and data extracted from different
sources is further processed as per the requirement for
decision-making.
 In simple terms, you can say information retrieval gets sets
of relevant documents from the corpora or the masses.
3. Text Mining Technologies

Information Extraction
 Extraction of structured information from unstructured and/or
semi-structured documents is known as information extraction.
 In most of the cases, this activity concerns processing of human
language texts by means of Natural Language Processing (NLP).
 Information Extraction is the activity by which the document is
processed with automatic annotation and extraction of content
from images, audio, video.
 Internet Movie Database (IMDb) is an online database about the
information of world films, TV programs, home videos and video
games.
3. Text Mining Technologies

Clustering
 When you search for something on a web search engine, you get
huge number of documents in response to search phrase you
entered. It becomes difficult for you to browse or to identify the
relevant information.
 Clustering helps to group the retrieved documents into meaningful
categories. This grouping is done based on the descriptor (sets of
word) in the document. It is an unsupervised knowledge discovery
technique.
 One of the common example of clustering is hierarchical
clustering.
 In Hierarchical Clustering, each data point forms one cluster and
then pairs with the most adjacent cluster.
3. Text Mining Technologies

Categorization
 ‘Categorization’ refers to assigning the given document to a specific
category. A common example is segregating the application forms on
the basis of age, discipline, class, etc.
 The categorization can be done on the basis of topics or its
attributes, such as type of document, author, year of printing,
subject, etc.
 Categorization is also called ‘classification’ when you want to assign
instances of the appropriate class of your known types. If you are
using Gmail for handling emails, you find folders with names
Primary, Promotion, Social, Updates and Forum. Your emails are
being categorized into the previous mentioned categories.
3. Text Mining Technologies

Summarization
 Summarization is shorter form of text derived from one or
more texts which gives important knowledge from the
original document.
 The most important advantage of using a summary is that it
reduces the reading time.
 Text Summarization methods can be classified into the
following types:
 Extractive summarization
 €€Abstractive summarization
 Indicative summarization
 €€Informative summarization
4. Methods and Approaches in Text
Analytics

 In text mining, there are mining approaches, one which is based


on keywords and another one which is based on intelligent
technologies.
 The keyword-based approach uses different elements in the text
by identifying repetitive patterns present in the text and
establishing relationship between these elements using statistical
techniques.
 Text analytics is based on retrieval according to user requirement.
For information retrieval, the following methods are being used in
text analytics:
 Term-based method
 Phrase-based method
 Concept-based method
 Pattern taxonomy method
4. Methods and Approaches in Text
Analytics
Content Analysis
• Content analysis is a method for summarizing any form of content
by counting various aspects of the content.
• Content analysis also uses the quantitative method, though it
analyzes terms and the results are in the form of numbers and
percentages. The content analysis has six main stages, which are as
follows:

1. Selecting content for analysis

2. Units of content

3. Preparing content for coding

4. Coding the content

5. Counting and weighing

6. Drawing conclusions
4. Methods and Approaches in Text
Analytics
Natural Language Processing
 Program computers to process and analyze the natural
language is called Natural Language Processing (NLP).
 The NLP process is broken down into three parts. The first
task of NLP is to understand the natural language received
by the computer.
 The next task is called the part-of-speech (POS) tagging or
word-category disambiguation.
 The third step taken by an NLP is text-to-speech conversion.
At this stage, the computer programming language is
converted into an audible or textual format for the user.
4. Methods and Approaches in Text
Analytics
Simple Predictive Modeling
 Statistical technique to make predictions based on past
occurrences/data is called Predictive Modeling.
 Predictive Modeling involves the process of creating, testing
and validating the model to best predict the outcome. It is
done by running one or more algorithms on the data set
where prediction is going to be carried out.
 The seven steps involved in predictive modeling are:

1. Data Mining: The relevant data is mined from the


available chunk of data.

2. Understanding the Data: The data is then understood


to prepare the model.
4. Methods and Approaches in Text
Analytics
Simple Predictive Modeling

3. Preprocessing the Data: The data is preprocessed to prepare


the data model.
4. Model of Data: The model of data is created after preprocessing.
5. Evaluate model and select the best-fit model: The model
created is then evaluated and the best-fit model is selected for
deployment.
6. €€Deploy the model: The best-fit model is then deployed in
business.
7. Monitor and improve: The deployed model is monitored and
improved on timely basis.
5. Applications of Text Analytics

 Text analytics is used to analyze unstructured text, take out

important information from it and transform it into useful

information.

 Due to this benefit, text analytics find its applications in

various fields, some of which are as follows:

 €€ Sentiment Analysis

 €€ Emotion Detection

 €€ Scholarly Communication

 €€ Health

 €€ Visualization
Let’s Sum Up

 One of the most significant and popular techniques to describe


and infer the textual data is sentiment analysis.
 Topic Modeling is a statistical approach for discovering topic(s)
from a collection of text documents based on statistics of each
word.
 The LDA Algorithm classifies the Corpus into Topics
automatically by self-learning to assign probabilities to all
terms in the corpus.
 The Term Frequency tells about the importance of the word
with respect to total number of terms in the document
 Deriving detailed and structured information about the event
from text is called event extraction.
THANK YOU

You might also like