0% found this document useful (0 votes)
8 views49 pages

Chapter 07 - in Class

Chapter 7 of the document discusses text mining, sentiment analysis, and social analytics, highlighting their importance in extracting knowledge from unstructured data. It covers the processes involved in text mining, the differentiation between text and data mining, and various applications across sectors such as marketing, medicine, and security. Additionally, it outlines the methods and challenges of natural language processing (NLP) and sentiment analysis, including the steps for detecting sentiment and identifying targets.

Uploaded by

sanasyed806
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views49 pages

Chapter 07 - in Class

Chapter 7 of the document discusses text mining, sentiment analysis, and social analytics, highlighting their importance in extracting knowledge from unstructured data. It covers the processes involved in text mining, the differentiation between text and data mining, and various applications across sectors such as marketing, medicine, and security. Additionally, it outlines the methods and challenges of natural language processing (NLP) and sentiment analysis, including the steps for detecting sentiment and identifying targets.

Uploaded by

sanasyed806
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 49

Analytics, Data Science and A I:

Systems for Decision Support


Eleventh Edition

Chapter 7
Text Mining, Sentiment Analysis,
and Social Analytics

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Learning Objectives (1 of 2)
7.1 Describe text mining and understand the need for
text mining
7.2 Differentiate among text analytics, text mining and
data mining
7.3 Understand the different application areas for text
mining
7.4 Know the process of carrying out a text mining
project
7.5 Appreciate the different methods to introduce
structure to text-based data
7.6 Describe sentiment analysis

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Learning Objectives (2 of 2)
7.7 Develop familiarity with popular applications of
sentiment analysis
7.8 Learn the common methods for sentiment analysis
7.9 Become familiar with speech analytics as it relates to
sentiment analysis
7.10 Learn three facets of Web analytics—content,
structure, and usage mining
7.11 Know social analytics including social media and social
network analyses

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Text Analytics and Text Mining (1 of 2)
Concepts In Text Analytics:
• Information Retrieval
• Information Extraction
• Text Mining = Information Extraction + Data Mining + Web
Mining
• Text Analytics =
Information Retrieval + Information Extraction + Data
Mining + Web Mining
• or simply
– Text Analytics = Information Retrieval + Text Mining

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Text Analytics and Text Mining (2 of 2)
Figure 7.2 Text Analytics, Related Application Areas, and
Enabling Disciplines.

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Text Mining Concepts (1 of 2)
• 85-90 percent of all corporate data is in unstructured form
• Unstructured corporate data is doubling in size every 18
months
– Tapping into these information sources to stay
competitive
• Answer: text mining
– A semi-automated process of extracting knowledge
from unstructured data sources
– a.k.a. text data mining or knowledge discovery in
textual databases

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Text Mining Concepts (2 of 2)
• Benefits of text mining in text-rich data environments
– e.g., law (court orders), academic research (research
articles), finance (quarterly reports), medicine
(discharge summaries), biology (molecular
interactions), technology (patent files), marketing
(customer comments), etc.
• Electronic communication records
– Spam filtering
– Email prioritization and categorization
– Automatic response generation

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Data Mining versus Text Mining
• Both seek for novel and useful patterns
• Both are semi-automated processes
• Difference is the nature of the data:
– Structured versus unstructured data
– Structured data: in databases
– Unstructured data: Word documents, PD F files, text
excerpts, XM L files…
• To perform text mining
– first, impose structure to the data
– Second, data mine the structured data.

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Text Mining Application Area
• Information extraction
• Topic tracking
• Summarization
• Categorization
• Clustering
• Concept linking
• Question answering

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Text Mining Terminology (1 of 2)
• Unstructured or semi-structured data
• Corpus (and corpora) - text
• Terms – a word or multiword phrase
• Concepts – higher level of features from text
• Stemming – identify root of words
• Stop words – filtered out for analysis
• Synonyms or polysemes (homonyms)
• Tokenizing – assignment of meaning to blocks of text

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Text Mining Terminology (2 of 2)
• Term dictionary
• Word frequency
• Part-of-speech tagging
• Morphology - internal structure of words
• Term-by-document matrix
– Occurrence matrix
• Singular value decomposition
– Dimensionality reduction of the term-by-document
matrix
RapidMiner Introduction

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Natural Language Processing (NL P)
(1 of 4)
• Structuring a collection of text
– Old approach: bag-of-words
– New approach: natural language processing (NLP)
• NL P
– a subfield of artificial intelligence and computational
linguistics
– the studies of "understanding" the natural human
language
– move beyond syntax-based to semantics-based text
mining

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Natural Language Processing (NL P)
(2 of 4)
• “Understanding”
– Human understands, what about computers?
– Natural language is vague, context driven
– True understanding requires extensive knowledge of a
topic
– Can/will computers ever understand natural language
the same/accurate way we do?

Brief Steps of NLP Process

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Natural Language Processing (NL P)
(3 of 4)
• Challenges in NL P
– Part-of-speech tagging
– Text segmentation
– Word sense disambiguation
– Syntax ambiguity
– Imperfect or irregular input
– Speech acts

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Natural Language Processing (NL P)
(4 of 4)
• Dream of A I community
– Capability of automatically reading and obtaining
knowledge from text
– Large Language Models such as ChatGPT!

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
NL P Task Categories
• Question answering
• Automatic summarization
• Natural language generation & understanding
• Machine translation
• Foreign language reading & writing
• Speech recognition
• Text to Speech (e.g. https://elevenlabs.io/ voice cloning,
voice dubbing)
• Text proofing
• Optical character recognition

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Text Mining Applications
• Marketing applications
– Enables better CR M
• Security applications
– Deception detection
– Track organized crimes
• Medicine and biology
– Literature-based gene identification
• Academic applications
– Research stream analysis

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Application Case 7.3 (1 of 4)
Mining for Lies
• Deception detection
– A difficult problem
– Detection is limited to only text
• The study
– Analyzed text-based testimonies of person of interests
at military bases
– Used only text-based features (cues)

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Application Case 7.3 (2 of 4)
Mining for Lies
Figure 7.3 Text-Based Deception-Detection Process.

Source: Fuller, C. M., D. Biros, & D. Delen. (2008, January). Exploration of Feature Selection and Advanced
Classification Models for High-Stakes Deception Detection. Proceedings of the Forty-First Annual Hawaii International
Conference on System Sciences (HICS S), Big Island, H I: IE E Press, pp. 80–99.
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Application Case 7.3 (3 of 4)
Mining for Lies
Table 7.1 Categories and Examples of Linguistic Features Used
in Deception Detection.
Number Construct (Category) Example Cues
1 Quantity Verb count, noun phrase count, etc.
2 Complexity Average number of clauses, average sentence
length, etc.
3 Uncertainty Modifiers, modal verbs, etc.
4 Nonimmediacy Passive voice, objectification, etc.
5 Expressivity Emotiveness
6 Diversity Lexical diversity, redundancy, etc.
7 Informality Typographical error ratio
8 Specificity Spatiotemporal information, perceptual
information, etc.
9 Affect Positive affect, negative affect, etc.

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Application Case 7.3 (4 of 4)
Mining for Lies
• 371 usable statements are generated
• 31 features are used
• Different feature selection methods used
• 10-fold cross validation is used
• Results (overall % accuracy)
– Logistic regression 67.28
– Decision trees 71.60
– Neural networks 73.46

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Text Mining Process (1 of 7)
• A Context Diagram for Text Mining Process

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Text Mining Process (2 of 7)
Figure 7.6 The Three-Step/Task Text Mining Process.

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Text Mining Process (3 of 7)
• Task 1: Establish the corpus
– Collect all relevant unstructured data
(e.g., textual documents, XM L files, emails, Web
pages, short notes, voice recordings…)
– Digitize, standardize the collection
(e.g., all in ASCI I text files)
– Place the collection in a common place
(e.g., in a flat file, or in a directory as separate files)
– Preprocessing the documents
 Segmentation <<< break data into sentences
 Tokenizing <<< break sentence into words
 Stop Words <<< mark down unimportant terms
 Stemming <<< same words with different prefix or suffix
 Lemmatization <<< learning that multiple words can have the same meaning
 Part of Speech Tagging
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Text Mining Process (4 of 7)
• Task 2: Create the Term–by–Document Matrix

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Text Mining Process (5 of 7)
• Task 2: Create the Term–by–Document Matrix (TD M)
– Should all terms be included?
– Stop words, include words
– Synonyms, homonyms
– Stemming
– What is the best representation of the indices (values
in cells)?
 Row counts
 Binary frequencies
 Log frequencies
 Inverse document frequencies
 TF-IDF
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Text Mining Process (6 of 7)
• Task 2: Create the Term–by–Document Matrix (TD M)
– TD M is a sparse matrix.
– How can we reduce the dimensionality of the TD M?
 Manual - a domain expert goes through it
 Eliminate terms with very few occurrences in very
few documents
 Transform the matrix using singular value
decomposition (SV D)

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Text Mining Process (7 of 7)
• Task 3: Extract patterns/knowledge
– Classification (text categorization)
– Clustering (natural groupings of text)
– Association
– Trend Analysis

Example: RapidMiner Document Classification Demo

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Sentiment Analysis
• Sentiment  belief, view, opinion, and conviction
• Sentiment analysis is trying to answer the question “What
do people feel about a certain topic?”
• By analyzing data related to opinions of many using a
variety of automated tools
• Used in variety of domains, but its application in CR M are
especially noteworthy

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Sentiment Analysis Applications
• Voice of the Customer (VO C)
• Voice of the Market (VO M)
• Voice of the Employee (VO E)
• Brand Management
• Financial Markets
• Politics
• Government Intelligence
• Other Interesting Areas: website design, ads placement,
review-oriented search engine

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Sentiment Analysis Process (1 of 3)

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Sentiment Analysis Process (2 of 3)
• Step 1: Sentiment Detection
– It is also called detection of objectivity
 Fact [= objectivity] versus Opinion [= subjectivity]
• Step 2: N-P Polarity Classification
– Given an opinionated piece of text, the goal is to
classify the opinion as falling under one of two
opposing sentiment polarities
 N [= negative] versus P [= positive]
– Using a Lexicon (e.g., Vader, others)
– Using a Collection of Training Documents with labels

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Sentiment Analysis Process (3 of 3)
• Step 3: Target Identification
– Accurately identify the target of the expressed
sentiment (e.g., a person, a product, and event, etc.)
 Level of difficulty → the application domain
• Step 4: Collection and Aggregation
– Once the sentiments of all text data points in the
document are identified and calculated, they are to be
aggregated
 Word → Statement → Paragraph → Document

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
P-N Polarity and S-O Polarity

Sentiment Analysis Example using RapidMiner

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Methods for Polarity Identification (1)
• Using a lexicon as a reference library
 AFINN: AFINN is a list of English words rated for valence with an integer between -5
(negative) and +5 (positive).
 SentiWordNet: SentiWordNet is a lexical resource for sentiment analysis that assigns a
sentiment score to each word in the WordNet database. The scores range from -1 (very
negative) to +1 (very positive).
 VADER: VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and
rule-based sentiment analysis tool that is specifically tuned to work well on social media
text.
 NRC Emotion Lexicon: The NRC Emotion Lexicon is a list of English words and their
associations with eight basic emotions (anger, fear, anticipation, trust, surprise,
sadness, joy, and disgust) and two sentiment scores (positive and negative).
 Hu and Liu Lexicon: The Hu and Liu Lexicon is a list of English words that have been
manually labeled as positive or negative by Hu and Liu in 2004.

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Methods for Polarity Identification (2)
• Using a collection of Training Documents
– Text Retrieval Conference
– Technology Insights 7.2
 Congressional Floor-Debate Transcripts
 Economining
 Cornell Movie Review Data Sets
 Stanford AI
 MPQA Corpus
 Multiple-Aspect Restaurant Review

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Web Mining Overview
• Web is the largest repository of data
• Data is in HTM L, XM L, text format
• Challenges (of processing Web data)
– Too much content
– Too complex
– Too dynamic
– Not specific to a domain
– The Web has everything

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Web Mining
Web mining (or Web data mining) is the process of
discovering intrinsic relationships from Web data (textual,
linkage, or usage)

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Web Content/Structure Mining
• Web Content Mining
– Mining the textual content on the Web
– Data collection via Web crawlers
• Web Structure Mining
– Web pages include hyperlinks
 Authoritative pages
 Hubs
 Hyperlink-induced topic search (HIT S) algorithm

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Web Usage Mining (1 of 2)
• Extraction of information from data generated through
Web page visits and transactions…
– data stored in server access logs, referrer logs, agent
logs, and client-side cookies
– user characteristics and usage profiles
– metadata, such as page attributes, content attributes,
and usage data
• Clickstream data
• Clickstream analysis

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Web Usage Mining (2 of 2)
• Web usage mining applications
– Determine the lifetime value of clients
– Design cross-marketing strategies across products
– Evaluate promotional campaigns
– Target electronic ads and coupons at user groups
based on user access patterns
– Predict user behavior based on previously learned
rules and users’ profiles
– Present dynamic information to users based on their
interests and profiles

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Web Usage Mining
(Clickstream Analysis)

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Web Analytics Metrics
Web Site Usability Traffic Source
• Page views • Referral Web sites
• Time on site • Search engines
• Downloads • Direct
• Click map • Offline campaigns
• Click paths • Online campaigns

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Web Analytics Metrics
Visitor Profiles Conversion Statistics
• Keywords • New visitors
• Content groupings • Returning visitors
• Geography • Leads
• Time of day • Sales/conversions
• Landing page profiles • Abandonment/exit rate

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
A Sample Web Analytics Dashboard

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Social Analytics –
Social Network Analysis
• Social Network is a social structure composed of
individuals link to each other
• Social Network Analysis help study relationships between
individuals, groups, organizations, societies
– Self organizing
– Emergent
– Complex
• Typical social network types
– Communication networks, community networks,
criminal networks, innovation networks, …

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Social Network Analysis Metrics
• Connections
• Distribution
– Bridge
– Centrality
– Density
– Distance
– Structural holes
• Segmentation
– Cliques and social circles
– Clustering coefficient
– Cohesion
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Social Analytics –
Social Media Analytics
• Systematic and scientific ways to consume the vast
amount of social media content
• Tools, and techniques for the betterment of an
organization’s competitiveness
• Tools to measure social media impact:
– Descriptive analytics: simple statistics on activity
characteristic and trends
– Social network analysis: links, influences, etc.
– Advanced analytics: predictive and text analytics on
contents, etc.
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Copyright

This work is protected by United States copyright laws and is


provided solely for the use of instructors in teaching their
courses and assessing student learning. Dissemination or sale of
any part of this work (including on the World Wide Web) will
destroy the integrity of the work and is not permitted. The work
and materials from it should never be made available to students
except by instructors using the accompanying text in their
classes. All recipients of this work are expected to abide by these
restrictions and to honor the intended pedagogical purposes and
the needs of other instructors who rely on these materials.

Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved

You might also like