TEXT ANALYTICS
C Sudhakar
CEO
Raskey Software Solutions Ltd
Email:
[email protected]Web:www.raskeysoft.com
SMART ANALYTICS
Start with strategy
Measure Metrics and Data
Apply analytics
Report results
Transform Business
TYPES OF ANALYTICS
Data analytics
Compete on Analytics
Text analytics
Video analytics
Social networking analytics
Web analytics
Speech analytics
TEXT ANALYTICS
Text analytics is the process of analyzing
unstructured text, extracting relevant
information, and transforming it into useful
business intelligence
Text analysis is now capable of telling us things
we did not already know and perhaps more
importantly had no way of knowing before.
Access to huge text data sets an improved
technical capability means we can now mine the
text for patterns and trends that can be
incredibly useful in business.
TEXT ANALYTICS TASKS INCLUDE
Text categorization
Text clustering
Concept extraction
Sentiment analysis
Document summarization
TEXT CATEGORIZATION
Text categorization applies some structure to
the text which can then be used for analysis
or query
Text analytics assigns a document to one or
more classes or categories according to the
subject or according to other attributes such
as document type, author, creation date etc.,
TEXT CLUSTERING
As the name would suggest text clustering
allows you to automatically cluster huge
repositories of text into meaningful topics or
categories for fast information retrieval or
filtering
CONCEPT EXTRACTION
This concept allows you to extract concepts
from text.
Meaning varies with concept
SENTIMENT ANALYSIS
Sentiment analysis (also known as opinion mining) refers to the use of
natural language processing, text analysis and computational linguistics
to identify and extract subjective information in source materials.
An important part of our information-gathering behavior has always
been to find out what other people think. With the growing availability
and popularity of opinion-rich resources such as online review sites and
personal blogs, new opportunities and challenges arise as people now
can, and do, actively use information technologies to seek out and
understand the opinions of others. The sudden eruption of activity in the
area of opinion mining and sentiment analysis, which deals with the
computational treatment of opinion, sentiment, and subjectivity in text,
has thus occurred at least in part as a direct response to the surge of
interest in new systems that deal directly with opinions as a first-class
object.
The basic purpose of sentiment analysis is to classify polarity of any
given text data as positive negative or neutral. Or star classification or a
scal classification.
EXAMPLE
(1) I bought an iPhone 2 days ago
. (2) It was such a nice phone.
(3) The touch screen was really cool.
(4) The voice quality was clear too.
(5) However, my mother was mad with me as I did
not tell her before I bought it.
(6) She also thought the phone was too expensive,
and wanted me to return it to the shop.
? The first thing that we may notice is that there are
several opinions in this review.
ANALYSIS
Sentences (2), (3) and (4) express three positive opinions, while
sentences (5) and (6) express negative opinions.
Then we also notice that the opinions all have some targets on which they are
expressed.
The opinion in sentence (2) is on iPhone as a whole,
the opinions in sentences (3) and (4) are on the touch screen and voice
quality features of iPhone respectively.
The opinion in sentence (6) is on the price of iPhone, but the opinion/emotion in
sentence (5) is on me, not iPhone.
This is an important point.
In an application, the user may be interested in opinions on certain targets, but
not on all (e.g., unlikely on me).
Finally, we may also notice the sources or holders of opinions.
The source or holder of the opinions in sentences (2), (3) and (4) is the author of
the review
(I), but in sentences (5) and (6) it is my mother. With this example in mind, we
can define sentiment
OBJECT AND FEATURE
In general, opinions can be expressed on
any target entity, e.g., a product, a service,
an individual, an organization, or an event.
We use the term object to denote the target
entity that has been commented on.
An object can have a set of components (or
parts) and a set of attributes (or properties)
[1, 4], which we collectively call the features
of the object.
TECHNICAL CHALLNGES
Object Identification
Feature grouping and synonym grouping
Opinion orientation classification
Integration
Identification of spam reviews/ documents
CLASSFICATION
Document-level sentiment analysis;
Sentence-level sentiment analysis;
Aspect-based sentiment analysis;
Comparative sentiment analysis; and,
Sentiment lexicon acquisition.
DOCUMENT SUMMRIZATION
Again as the name suggest this text analytic
tool allows you to automatically summarize
documents to retain the most important
points from the original document.
Extraction
Abstraction
SUMMARY
Text Analytics is particularly useful for
information retrieval, pattern recognition,
tagging and annotation, information
extraction, sentiment assessment and
predictive analytics.
A REAL TIME PROCESS
SMALL EXAMPLE IN AI
THIS APPROACH WORKS INCASE OF BOUNDED GROUND
CURATOR ENGINE INTELLIGENCE ENGINES
Domain Intelligence
Extraction Engine
Context Intelligence
Keyword Intelligence
Intent Analysis Engine
Lead Validity
Intelligence
Positive
Opportunity
DOMAIN INTELLIGENCE
Document
Url & Name
Negative
Url / Name
pattern
Url / Name
pattern
Unsure
Both
Positive and
Negative
Neither
Positive Nor
Negative
Challenges
Dmoz /
Jigsaw Data
Positive
Insufficient domain knowledge More elimination can be achieved with
more domain knowledge from source.
Solution
Insufficient domain knowledge SLED crawler and domain classification
should provide more knowledge
EXTRACTION ENGINE
Document
Text, Xml
and
Metadata
Old
Document
New
Document
Parser
Tika and
Pdf2Xml
Challenges
Non visible characters raises exceptions or misinterpretation (2%)
PdfMiner
schools is extracted as schools and changes the meaning.
Parser failures PdfMiner is an accurate parser but fails at times (10%)
Solutions
Parser Failures Using Tika and Pdf2Xml as a combination reduces context
leakage.
CONTEXT INTELLIGENCE
Parser
Document
Titles and
Headers
Positive
Unsure
Challenges
Ambiguous Context Misleads Decision
Negative
Job posting inside an agenda
Insufficient Context Context away from keyword location or missing
Solutions
Insufficient Context Extract context from various locations.
Information from source, directory information, domain intelligence,
etc.
KEYWORD INTELLIGENCE
Parser
Context
Around
Keyword
Paragraphs
Bullet
Points
Challenges
Identification of keyword phrases Reduces data leakage
Keyword specific intelligence Negative extensions, support words etc.
Tables
free wifi, wireless mouse, network security policy.
Solutions
Keyword specific intelligence Manually collected for popular keywords.
Use statistical bigram approach for other keywords.
INTENT ANALYSIS
Context
Around
Keyword
Paragraph
Direct
Relation
Indirect
Relation
Bullet Point
Header
Analysis
Bullet Point
Analysis
Table
Row
Analysis
Header
Analysis
INTENT ANALYSIS CHALLENGES
Human Ambiguity
Improved productivity and streamlined IT infrastructure through file
storage capabilities
The plan includes providing sufficient network capacity (This sentence
is present in an analysis document from a writer)
Machine Ambiguity
Authorize a purchase of storage area network equipment - keyword
is network equipment
The technology director shall enhance awareness regarding network
security
Solution
Experimenting by building probabilistic language models.
INTENT ANALYSIS CHALLENGES
Stanford Mistakes
Sometimes Stanford software we are using, builds wrong relations
Ex: IT Infrastructure , IT is identified as it.
Solution
Replace keyword with a generic keyword before parsing it with
stanford. The generic keyword shouldnt spoil the relations.
Indirect buying decision
Information security is recognized as a top management challenge
for the department
OTHER CHALLENGES
Noisy Keywords
Noisy Domains
Keywords like vmware, firewall and gis contributes lots of
noise
Unavoidable these keywords also contribute towards
positives.
Domains like itdashboard.gov contributes lots of noise.
Contributed 22% noise to Tegile leads in June.
Duplicates
Same domain documents appears multiple times, contributing
to duplicate documents
POSITIVE MARKED DOCUMENTS
45%
40%
40%
37%
35%
30%
32%
28%
25%
19%
20%
15%
10%
12%
8%
5%
0%
May
June
13%
Lost Business
Wrong Context
Rejected by reviewer
Approved
LEAD VALIDITY INTELLIGENCE
False
Positives
Lost
Business
Low
Budget
Wrong
Industry
Too Early
Others
Challenges
Duplicates
Company specific constraints Campus Management requires only
Higher education leads.
Identifying Budget Constraints Eg. < $10k
Solution
Implemented Patterns to identify Lost Businesses
AFTER APPLYING LOST BUSINESS PATTERNS
90%
79%
80%
70%
65%
57%
60%
50%
40%
43%
Identified L.B
Not identified L.B
35%
30%
21%
20%
10%
0%
Juniper
(55/120)
Google
(30/150)
Tegile (19/55)
CURRENTLY COMPANIES ARE WORKING ON
Probabilistic Language Models
Build semi supervised language models to handle machine
ambiguity.
Develop a diversified language based dataset for training.
Driver Based Patterns
Develop patterns specific to driver word.
Eg:
Provide Specifies intent of an action
Provides Specifies intent of solution/service
Keyword Intelligence
Methodologies to derive and handle keyword phrases.
Start with manually adding keyword phrases and slowing
move towards an automated system.
THANK YOU