Chapter 7-Text Analytics, Text Mining
Chapter 7-Text Analytics, Text Mining
CHAPTER OVERVIEW
This chapter provides a rather comprehensive overview of text mining and one of its most
popular applications, sentiment analysis, as they both relate to business analytics and
decision support systems. Generally speaking, sentiment analysis is a derivative of text
mining, and text mining is essentially a derivative of data mining. Because textual data is
increasing in volume more than the data in structured databases, it is important to know
some of the techniques used to extract actionable information from the large quantity of
unstructured data.
CHAPTER OUTLINE
1
Copyright © 2014 Pearson Education, Inc.
7.3 NATURAL LANGUAGE PROCESSING
Application Case 7.2: Text Mining Improves Hong Kong
Government’s Ability to Anticipate and Address Public
Complaints
Section 7.3 Review Questions
2
Copyright © 2014 Pearson Education, Inc.
5. Financial Markets
6. Politics
7. Government Intelligence
8. Other Interesting Areas
Section 7.8 Review Questions
3
Copyright © 2014 Pearson Education, Inc.
Text analytics, text mining, and sentiment analysis are topics that students find
especially interesting and (relatively) easier than data mining. They are able to
relate to text mining applications. Students are familiar with social media outlets
such as Facebook and Twitter, and often contribute sentiments of their own. The
opening vignette discusses IBM’s Watson, which utilized text analytics (along
with many other AI-related technologies) to win at Jeopardy!. Students can come
up with examples of value-added text mining and sentiment analysis applications.
It will be useful to distinguish between text mining and the broader topic of text
analytics. This will be a good chapter to engage students in class discussions, and
especially to relate the concepts of the technologies to the various application
cases in the chapter. Students should also be able to discuss other possible
applications, perhaps even within the university or college that they are attending.
For example, how can text analytics be used to improve the student experience?
2. What technologies were used in building Watson (both hardware and software)?
Watson is built on the DeepQA framework. The hardware for this system involves
a massively parallel processing architecture. In terms of software, Watson uses a
variety of AI-related QA technologies, including text mining, natural language
processing, question classification and decomposition, automatic source
acquisition and evaluation, entity and relation detection, logical form generation,
and knowledge representation and reasoning.
3. What are the innovative characteristics of DeepQA architecture that made Watson
superior?
4
Copyright © 2014 Pearson Education, Inc.
scoring evidence, and merging and ranking hypotheses. More important than any
particular technique is the combination of overlapping approaches that can bring
their strengths to bear and contribute to improvements in accuracy, confidence,
and speed.
4. Why did IBM spend all that time and money to build Watson? Where is the ROI?
IBM’s goal was to advance computer science by exploring new ways for
computer technology to affect science, business, and society. The techniques IBM
developed with DeepQA and Watson are relevant in a wide variety of domains
central to IBM’s mission. For example, IBM is currently working on a version of
Watson to take on surmountable problems in healthcare and medicine. If
successful, this could give IBM a distinct competitive advantage in this important
technological application area.
The most obvious competitive smart machine, also developed by IBM, is Deep
Blue, which in 1997 defeated the world chess champion, Garry Kasparov. The
technology involved mostly brute force computing power, involving massive
parallelism. This enabled Deep Blue to evaluate 200 million positions per second,
with search heuristics enabling Deep Blue to search from six to twenty moves
ahead, far more than any previous chess playing program. Deep Blue was inspired
by and evolved from another chess playing system called Deep Thought, which
was developed at Carnegie Mellon and IBM.
Text analytics is a concept that includes information retrieval (e.g., searching and
identifying relevant documents for a given set of key terms) as well as
information extraction, data mining, and Web mining. By contrast, text mining is
primarily focused on discovering new and useful knowledge from textual data
sources. The overarching goal for both text analytics and text mining is to turn
unstructured textual data into actionable information through the application of
5
Copyright © 2014 Pearson Education, Inc.
natural language processing (NLP) and analytics. However, text analytics is a
broader term because of its inclusion of information retrieval. You can think of
text analytics as a combination of information retrieval plus text mining.
Text mining as a BI is increasing because of the rapid growth in text data and
availability of sophisticated BI tools. The benefits of text mining are obvious in
the areas where very large amounts of textual data are being generated, such as
law (court orders), academic research (research articles), finance (quarterly
reports), medicine (discharge summaries), biology (molecular interactions),
technology (patent files), and marketing (customer comments).
6
Copyright © 2014 Pearson Education, Inc.
formal representations (in the form of numeric and symbolic data) that are easier
for computer programs to manipulate.
Text mining uses natural language processing to induce structure into the text
collection and then uses data mining algorithms such as classification, clustering,
association, and sequence discovery to extract knowledge from it.
NLP moves beyond syntax-driven text manipulation (which is often called “word
counting”) to a true understanding and processing of natural language that
considers grammatical and semantic constraints as well as the context. The
challenges include:
7
Copyright © 2014 Pearson Education, Inc.
• Speech recognition.
• Text-to-speech.
• Text proofing.
• Optical character recognition.
1. List and briefly discuss some of the text mining applications in marketing.
Text mining can be used to increase cross-selling and up-selling by analyzing the
unstructured data generated by call centers.
See Figure 7.6 (p. 309). Text mining entails three tasks:
Establish the Corpus: Collect and organize the domain-specific
unstructured data
8
Copyright © 2014 Pearson Education, Inc.
Create the Term–Document Matrix: Introduce structure to the corpus
Extract Knowledge: Discover novel patterns from the T-D matrix
2. What is the reason for normalizing word frequencies? What are the common methods
for normalizing word frequencies?
The raw indices need to be normalized in order to have a more consistent TDM
for further analysis. Common methods are log frequencies, binary frequencies,
and inverse document frequencies.
1. What are some of the most popular text mining software tools?
9
Copyright © 2014 Pearson Education, Inc.
2. Why do you think most of the text mining tools are offered by statistics
companies?
Students should mention that many of the capabilities of data mining apply to text
mining. Since statistics companies offer data mining tools, offering text mining is
a natural business extension.
3. What do you think are the pros and cons of choosing a free text mining tool over a
commercial tool?
Free tools have fewer features, more difficult user-interfaces, lack support, and
have slower or reduced processing capabilities. The advantage of free tools is
obviously the cost.
Sentiment analysis tries to answer the question, “What do people feel about a
certain topic?” by digging into opinions of many using a variety of automated
tools. It is also known as opinion mining, subjectivity analysis, and appraisal
extraction
Sentiment analysis shares many characteristics and techniques with text mining.
However, unlike text mining, which categorizes text by conceptual taxonomies of
topics, sentiment classification generally deals with two classes (positive versus
negative), a range of polarity (e.g., star ratings for movies), or a range in strength
of opinion.
3. What are the common challenges that sentiment analysis has to deal with?
Sentiment that appears in text comes in two flavors: explicit, where the subjective
sentence directly expresses an opinion (“It’s a wonderful day”), and implicit,
where the text implies an opinion (“The handle breaks too easily”). Implicit
sentiment analysis is harder to analyze because it may not include words that are
10
Copyright © 2014 Pearson Education, Inc.
obviously evaluations or judgments. Another challenge involves the timeliness of
collection/analysis of textual data coming from a wide variety of data sources. A
third challenge is the difficulty of identifying whether a piece of text involves
sentiment or not, especially with implicit sentiment analysis. The same sorts of
issues involving text mining in natural language settings also apply to sentiment
analysis.
1. What are the most popular application areas for sentiment analysis? Why?
Various areas related to brand management can benefit from sentiment analysis.
This includes many public and private sectors including financial markets,
politics, and government intelligence. E-commerce sites, e-mail
filtration/prioritization, and citation analysis are just some of the application areas
that can benefit from the information derived from sentiment analysis.
Many financial analysts believe that the stock market is mostly sentiment driven,
so use of sentiment analysis has much relevance for financial markets. Automated
analysis of market sentiments using social media, news, blogs, and discussion
groups can help with predicting market movements. If done correctly, sentiment
analysis can identify short-term stock movements based on the buzz in the
market, potentially impacting liquidity and trading.
11
Copyright © 2014 Pearson Education, Inc.
Section 7.9 Review Questions
1. What are the main steps in carrying out sentiment analysis projects?
The first step when performing sentiment analysis of a text document is called
sentiment detection, during which text data is differentiated between fact and
opinion (objective vs. subjective). This is followed by negative-positive (N-P)
polarity classification, where a subjective text item is classified on a bipolar
range. Following this comes target identification (identifying the person, product,
event, etc. that the sentiment is about). Finally come collection and aggregation,
in which the overall sentiment for the document is calculated based on the
calculations of sentiments of individual phrases and words from the first three
steps.
2. What are the two common methods for polarity identification? What is the main
difference between the two?
Speech analytics is a growing field of science that allows users to analyze and
extract information from both live and recorded conversations. This technology
can deliver meaningful and quantitative business intelligence through the analysis
12
Copyright © 2014 Pearson Education, Inc.
of the millions of recorded calls that occur in customer contact centers around the
world. With respect to sentiment analysis, speech analytics can help to assess the
emotional states expressed in a conversation and on measuring the presence and
strength of positive and negative feelings that are exhibited by the participants.
This can be a valuable tool for customer relationship management and agent
training.
13
Copyright © 2014 Pearson Education, Inc.
Using dedicated analysts and state-of-the-art software tools (including specialized
text mining tools from ClearForest Corp.), Kodak continuously digs deep into
various data sources (patent databases, new release archives, and product
announcements) in order to develop a holistic view of the competitive landscape.
3. What were the challenges, the proposed solution, and the obtained results?
Kodak’s challenges are to apply more than a century’s worth of knowledge about
imaging science and technology to new uses and to secure those new uses with
patents. The problem is that it is nearly impossible to efficiently process such
enormous amounts of semistructured data (patent documents usually contain
partially structured and partially textual data). But through the use of text mining
tools, Kodak is able to obtain a wide range of benefits. These include enabling
competitive intelligence, making critical business decisions, identifying and
recruiting new talent, identifying unauthorized use of Kodak’s patents, identifying
complementary inventions for forming symbiotic partnerships, and preventing
competitors from creating similar products.
Application Case 7.2: Text Mining Improves Hong Kong Government’s Ability to
Anticipate and Address Public Complaints
1. How did the Hong Kong government use text mining to better serve its
constituents?
2. What were the challenges, the proposed solution, and the obtained results?
The major challenge facing the Hong Kong government was analyzing and
responding to public complaints to their call center. This involves addressing
answers to about 2.65 million calls and 98,000 e-mails per year. Originally, they
attempted to compile reports on complaint statistics for reference by government
departments manually. But through ‘eyeball’ observations, it was impossible to
effectively reveal new or more complex potential public issues and identify their
root causes, as most of the complaints were recorded in unstructured textual
format. So, the government decided to utilize text processing and mining
approaches to uncover trends, patterns, and relationships in the text data. This
allowed the government to better understand the voice of the people, improve
service delivery, make informed decisions, and develop smart strategies. The
result was a boost in public satisfaction.
14
Copyright © 2014 Pearson Education, Inc.
Application Case 7.3: Mining for Lies
3. What do you think are the main challenges for such an automated system?
One challenge is that the training the system depends on humans to ascertain the
truthfulness of statements in the training data itself. You can’t know for sure
whether these statements are true or false, so you may be using incorrect training
samples when “teaching” the machine learning system to predict lies in new text
data. (This answer will vary by student.)
Application Case 7.4: Text Mining and Sentiment Analysis Help Improve Customer
Service Performance
1. How did the financial services firm use text mining and text analytics to improve
its customer service performance?
The company used PolyAnalyst’s text analysis tools for extracting complex word
patterns, grammatical and semantic relationships, and expressions of sentiment.
The results were classified into context-specific themes to identify actionable
issues. The relationships between structured fields and text analysis results were
established to identify patterns and interactions. Results were presented via
graphical, interactive, web-based reports. Actionable issues were assigned to
relevant individuals responsible for their resolution, and criteria were established
to capture and detect compliance with the company’s Quality Standards.
2. What were the challenges, the proposed solution, and the obtained results?
15
Copyright © 2014 Pearson Education, Inc.
Continually monitoring service levels is essential for service quality control.
Customer surveys and associate-customer interactions are the best way to obtain
data for such monitoring. But manually evaluating this data is subjective, error-
prone, and time/labor-intensive. Therefore, the company needed a system for (1)
automatically evaluating associate-customer interactions for compliance with
quality standards and (2) analyzing survey responses to extract positive and
negative feedback, while allowing for the diversity of natural language
expression. The solution was to use PolyAnalyst. (See answer to Question #1 for
the remainder of the answer.)
1. How can text mining be used to ease the task of literature review?
2. What are the common outcomes of a text mining project on a specific collection
of journal articles? Can you think of other potential outcomes not mentioned in
this case?
1. What do you think are the common characteristics of the kind of challenges these
five companies were facing?
In all cases, the companies face the challenge of analyzing large amounts of
unstructured data in text form. Time is also a critical element; this analysis must
be done quickly and accurately. In most cases, the business challenge had to do
with customer satisfaction; in one it had to do with fraud detection.
16
Copyright © 2014 Pearson Education, Inc.
2. What are the types of solution methods and tools proposed in these case
synopses?
SAS Text Miner was the primary tool used in all cases. Additional tools (all from
SAS) included Enterprise Miner and BI Server. So, the primary method used for
solving these companies’ problems involved text and data mining and analytics.
3. What do you think are the key benefits of using text mining and advanced
analytics (compared to the traditional way to do the same)?
The key benefits include increased accuracy of analysis results, speed of response
to potential problems, cost savings, enhanced customer relationships, improved
product/service quality, and greater competitiveness.
Application Case 7.7: Whirlpool Achieves Customer Loyalty and Product Success
with Text Analytics
1. How did Whirlpool use capabilities of text analytics to better understand their
customers and improve product offerings?
Whirlpool uses Attensity products for deep text analytics of their multi-channel
customer data, which includes e-mails, CRM notes, repair notes, warranty data,
and social media. The company uses text analytics solutions every day to get to
the root cause of product issues and receive alerts on emerging issues. Users of
Attensity’s analytics products at Whirlpool include product/ service managers,
corporate/product safety staff, consumer advocates, service quality staff,
innovation managers, the Category Insights team, and all of Whirlpool’s
manufacturing divisions (across five countries).
2. What were the challenges, the proposed solution, and the obtained results?
Customer satisfaction and feedback are at the center of how Whirlpool drives its
overarching business strategy, so gaining insight into customer satisfaction and
product feedback is paramount. Whirlpool needed to more effectively understand
and react to customer and product feedback data, originating from blogs, e-mails,
reviews, forums, repair notes, and other data sources. Managers needed to report
on longitudinal data, and be able to compare issues by brand over time. The
solution was to use Attensity’s Text Analytics application. This enabled the
company to more proactively identify and mitigate quality issues before issues
escalated and claims were filed. Since then, Whirlpool has also been able to avoid
recalls, which increased customer loyalty and reduced costs (realizing 80%
savings on their costs of recalls due to early detection).
Application Case 7.8: Cutting Through the Confusion: Blue Cross Blue Shield of
North Carolina Uses Nexidia’s Speech Analytics to Ease Member Experience in
Healthcare
17
Copyright © 2014 Pearson Education, Inc.
1. For a large company like BCBSNC with a lot of customers, what does “listening
to customers” mean?
2. What were the challenges, the proposed solution, and the obtained results for
BCBSNC?
Contact center calls are costly and time-consuming, especially when dealing with
members’ “confusion calls.” This causes reductions in customer satisfaction.
Asking customer service professionals to more thoroughly document the nature of
the calls within the contact center desktop application is not a viable option
because this is largely a manual effort (using a desktop application), which adds
significantly to labor costs. BCBSNC’s solution was to leverage its partnership
with Nexidia, a leading provider of customer interaction analytics, and use speech
analytics to better understand the cause and depth of member confusion. The
company used sentiment analysis (utilizing the linguistic approach) to better
understand how members perceived the value they received from BCBSNC and
their overall opinion of the company. Based on what BCBSNC learned, the
company implemented strategies to improve member communication and
customer experience. This included development of more reader-friendly
literature and website redesigns to support easier navigation and education. As a
result, BCBSNC projects a 10 to 25 percent drop in “confusion calls,” resulting in
a better customer service experience and a lower cost to serve.
1. Explain the relationship among data mining, text mining, and sentiment analysis.
Technically speaking, data mining is a process that uses statistical, mathematical, and
artificial intelligence techniques to extract and identify useful information and
subsequent knowledge (or patterns) from large sets of data. Data mining is the
general concept. Text mining is a specific application of data mining: applying it to
unstructured text files. Sentiment analysis is a specialized form of text and data
mining that identifies and classifies terms in text sources according to sentiment (e.g.
judgment, opinion, and emotional content).
18
Copyright © 2014 Pearson Education, Inc.
Before making a decision to purchase any mining software organizations should
consider the standard criteria to use when investing in any major software:
cost/benefit analysis, people with the expertise to use the software and perform
the analyses, availability of data/information, and a business need for the software
and capabilities.
3. Discuss the differences and commonalities between text mining and sentiment
analysis.
4. In your own words, define text mining and discuss its most popular applications.
5. Discuss the similarities and differences between the data mining process (e.g.,
CRISP-DM) and the three-step, high-level text mining process explained in this
chapter.
Text mining entails three tasks: See Figure 7.6 (p. 309).
Establish the Corpus: Collect and organize the domain-specific
unstructured data
Create the Term–Document Matrix: Introduce structure to the corpus
Extract Knowledge: Discover novel patterns from the T-D matrix
6. What does it mean to introduce structure into the text-based data? Discuss the
alternative ways of introducing structure into text-based data.
Text mining, like other data mining approaches, are inductive approaches for
finding patterns and trends in data. One difference between text mining and other
data mining approaches is the use of natural language processing.
19
Copyright © 2014 Pearson Education, Inc.
Four possible approaches for inducing structure in text in order to extract
knowledge are (a) classification (grouping terms into predefined categories), (b)
clustering (coming up with “natural” groupings), (c) association rule learning
(finding frequent combinations of terms), and (d) trend analysis (recognizing
concept distributions based on specific collections of documents).
7. What is the role of natural language processing in text mining? Discuss the
capabilities and limitations of NLP in the context of text mining.
8. List and discuss three prominent application areas for text mining. What is the
common theme among the three application areas you chose?
Sentiment analysis tries to answer the question, “What do people feel about a
certain topic?” by digging into opinions of many using a variety of automated
tools. It is also known as opinion mining, subjectivity analysis, and appraisal
extraction.
Sentiment analysis shares many characteristics and techniques with text mining.
However, unlike text mining, which categorizes text by conceptual taxonomies of
topics, sentiment classification generally deals with two classes (positive versus
20
Copyright © 2014 Pearson Education, Inc.
negative), a range of polarity (e.g., star ratings for movies), or a range in strength
of opinion.
11. What are the common challenges that sentiment analysis has to deal with?
Sentiment that appears in text comes in two flavors: explicit, where the subjective
sentence directly expresses an opinion (“It’s a wonderful day”), and implicit,
where the text implies an opinion (“The handle breaks too easily”). Implicit
sentiment analysis is harder to analyze because it may not include words that are
obviously evaluations or judgments. Another challenge involves the timeliness of
collection/analysis of textual data coming from a wide variety of data sources. A
third challenge is the difficulty of identifying whether a piece of text involves
sentiment or not, especially with implicit sentiment analysis. The same sorts of
issues involving text mining in natural language settings also apply to sentiment
analysis.
12. What are the most popular application areas for sentiment analysis? Why?
Various areas related to brand management can benefit from sentiment analysis.
This includes many public and private sectors including financial markets,
politics, and government intelligence. E-commerce sites, e-mail
filtration/prioritization, and citation analysis are just some of the application areas
that can benefit from the information derived from sentiment analysis.
14. What would be the expected benefits and beneficiaries of sentiment analysis in
politics?
21
Copyright © 2014 Pearson Education, Inc.
predict who is more likely to win or lose. Sentiment analysis can help understand
what voters are thinking and can clarify a candidate’s position on issues.
Sentiment analysis can help political organizations, campaigns, and news analysts
to better understand which issues and positions matter the most to voters. The
technology was successfully applied by both parties to the 2008 and 2012
American presidential election campaigns.
Many financial analysts believe that the stock market is mostly sentiment driven,
so use of sentiment analysis has much relevance for financial markets. Automated
analysis of market sentiments using social media, news, blogs, and discussion
groups can help with predicting market movements. If done correctly, sentiment
analysis can identify short-term stock movements based on the buzz in the
market, potentially impacting liquidity and trading.
16. What are the main steps in carrying out sentiment analysis projects?
The first step when performing sentiment analysis of a text document is called
sentiment detection, during which text data is differentiated between fact and
opinion (objective vs. subjective). This is followed by negative-positive (N-P)
polarity classification, where a subjective text item is classified on a bipolar
range. Following this comes target identification (identifying the person, product,
event, etc. that the sentiment is about). Finally come collection and aggregation,
in which the overall sentiment for the document is calculated based on the
calculations of sentiments of individual phrases and words from the first three
steps.
17. What are the two common methods for polarity identification? What is the main
difference between the two?
18. Describe how special lexicons are used in identification of sentiment polarity.
22
Copyright © 2014 Pearson Education, Inc.
for each term (and its set of synonyms, or synset) in the lexicon. An additional
approach is to include other affective labels including emotion, cognitive state,
attitude, feeling, etc.
Speech analytics is a growing field of science that allows users to analyze and
extract information from both live and recorded conversations. This technology
can deliver meaningful and quantitative business intelligence through the analysis
of the millions of recorded calls that occur in customer contact centers around the
world. With respect to sentiment analysis, speech analytics can help to assess the
emotional states expressed in a conversation and on measuring the presence and
strength of positive and negative feelings that are exhibited by the participants.
This can be a valuable tool for customer relationship management and agent
training.
BBVA used text mining and sentiment analysis to help the company understand
what existing and potential clients say about it through social media. This enabled
the company to monitor its reputation and make improvements when needed.
BBVA used an IBM social media research asset called Corporate Brand
23
Copyright © 2014 Pearson Education, Inc.
Reputation Analysis (COBRA). The company followed up by implementing IBM
Cognos Consumer Insight to unify all its branches worldwide.
2. What were BBVA’s challenges? How did BBVA overcome them with text mining and
social media analysis?
BBVA’s challenge was to better manage its reputation in the face of myriads of
comments (both positive and negative) about BBVA that were being posted on
social media sites, blogs, and other commentaries worldwide. This challenge was
overcome by partnering with IBM to implement a global corporate-wide system
for reputation analysis. The results were a one percent increase in positive
comments and a 1.5 percent reduction in negative comments. In addition, BBVA
was better able to monitor reputations, respond quickly to “reputation risk”
indicators, unify the online measuring of its business strategies, and enable more
detailed, structured, and controlled online data analysis.
3. In what other areas, in your opinion, can BBVA use text mining?
Another possible way to use text mining is for analyzing customer queries
regarding BBVA’s products and services. By analyzing e-mails and client
phone conversations, BBVA can identify common customer issues and
thereby modify their customer service, as well as improve on quality of their
products and services.
Another possible way to use text mining is for analyzing news and financial
literature in order to better predict financial trends.
24
Copyright © 2014 Pearson Education, Inc.