IMTC634 - Data Science - Chapter 7
IMTC634 - Data Science - Chapter 7
7 Let’s Sum Up 23
Learning Objectives
Text Mining is the first step before analysing the text data. It involves
cleaning the data so that the same is made ready for text analytics.
The various steps involved in the text mining process is shown in the
following figure :
Identification of a corpus
Bag of words in R
Sentiment analysis
Topic modeling
• Topic Modeling is a statistical approach for discovering
topic(s) from a collection of text documents based on
statistics of each word.
• Latent Dirichlet Allocation (LDA) is one of the most common
algorithms for topic modeling.
• The LDA Algorithm classifies the Corpus into Topics
automatically by self-learning to assign probabilities to all
terms in the corpus.
2. Text Mining Techniques
Term Frequency
• The Term Frequency tells about the importance of the word
with respect to total number of terms in the document.
• The ‘Term Frequency (TF)’ is usually measured along with
‘Inverse Document Frequency (IDF)’ as ‘TF-IDF’.
• ‘TF-IDF’ is abbreviation for ‘Term Frequency-Inverse
Document Frequency’. It is a statistic measure which tells
how a word is important in the given document.
2. Text Mining Techniques
Event Extraction
Suppose we want information of an event happened. Online
news has published this information in large text. Deriving
detailed and structured information about the event from
this text is called event extraction.
By event extraction, we identify Ws, i.e., Who, When, Where,
to Whom, Why and How.
In other words, event extraction identifies the relationship
between entities.
Suppose you are analyzing the information on joint venture.
Then we will be extracting partners, products, place, capital
and profits of the said joint venture.
3. Text Mining Technologies
Information Retrieval
Information Extraction
Clustering
Categorization
Summarization
3. Text Mining Technologies
Information Retrieval
Information Retrieval (IR) is extracting documents that
satisfies an information needed from within large collections.
These documents may be unstructured or semi-structured
and usually in text format. These documents are classified or
clustered as per the content or similarity in the content.
It is a very broad term and data extracted from different
sources is further processed as per the requirement for
decision-making.
In simple terms, you can say information retrieval gets sets
of relevant documents from the corpora or the masses.
3. Text Mining Technologies
Information Extraction
Extraction of structured information from unstructured and/or
semi-structured documents is known as information extraction.
In most of the cases, this activity concerns processing of human
language texts by means of Natural Language Processing (NLP).
Information Extraction is the activity by which the document is
processed with automatic annotation and extraction of content
from images, audio, video.
Internet Movie Database (IMDb) is an online database about the
information of world films, TV programs, home videos and video
games.
3. Text Mining Technologies
Clustering
When you search for something on a web search engine, you get
huge number of documents in response to search phrase you
entered. It becomes difficult for you to browse or to identify the
relevant information.
Clustering helps to group the retrieved documents into meaningful
categories. This grouping is done based on the descriptor (sets of
word) in the document. It is an unsupervised knowledge discovery
technique.
One of the common example of clustering is hierarchical
clustering.
In Hierarchical Clustering, each data point forms one cluster and
then pairs with the most adjacent cluster.
3. Text Mining Technologies
Categorization
‘Categorization’ refers to assigning the given document to a specific
category. A common example is segregating the application forms on
the basis of age, discipline, class, etc.
The categorization can be done on the basis of topics or its
attributes, such as type of document, author, year of printing,
subject, etc.
Categorization is also called ‘classification’ when you want to assign
instances of the appropriate class of your known types. If you are
using Gmail for handling emails, you find folders with names
Primary, Promotion, Social, Updates and Forum. Your emails are
being categorized into the previous mentioned categories.
3. Text Mining Technologies
Summarization
Summarization is shorter form of text derived from one or
more texts which gives important knowledge from the
original document.
The most important advantage of using a summary is that it
reduces the reading time.
Text Summarization methods can be classified into the
following types:
Extractive summarization
Abstractive summarization
Indicative summarization
Informative summarization
4. Methods and Approaches in Text
Analytics
2. Units of content
6. Drawing conclusions
4. Methods and Approaches in Text
Analytics
Natural Language Processing
Program computers to process and analyze the natural
language is called Natural Language Processing (NLP).
The NLP process is broken down into three parts. The first
task of NLP is to understand the natural language received
by the computer.
The next task is called the part-of-speech (POS) tagging or
word-category disambiguation.
The third step taken by an NLP is text-to-speech conversion.
At this stage, the computer programming language is
converted into an audible or textual format for the user.
4. Methods and Approaches in Text
Analytics
Simple Predictive Modeling
Statistical technique to make predictions based on past
occurrences/data is called Predictive Modeling.
Predictive Modeling involves the process of creating, testing
and validating the model to best predict the outcome. It is
done by running one or more algorithms on the data set
where prediction is going to be carried out.
The seven steps involved in predictive modeling are:
information.
Sentiment Analysis
Emotion Detection
Scholarly Communication
Health
Visualization
Let’s Sum Up