0% found this document useful (0 votes)
7 views

10 - Session 10 - Text Analytics, Text Mining and Sentiment Analysis

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

10 - Session 10 - Text Analytics, Text Mining and Sentiment Analysis

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Text

Analytics,
Text Mining
and Sentiment
Analysis
Dealing with Text
• Data are represented in ways natural to problems from which they were
derived

• Vast amount of text..

• If we want to apply the many data mining tools that we have at our disposal,
we must
• either engineer the data representation to match the tools
(representation engineering), or
• build new tools to match the data
Text is “unstructured”
•Linguistic structure is intended for human communication and not
computers

Word order matters sometimes

Text can be dirty


•People write ungrammatically, misspell words, abbreviate unpredictably, and
punctuate randomly
•Synonyms, homograms, abbreviations, etc.

Context matters
Text Representation
• Goal: Take a set of documents –each of which is a relatively free-
form sequence of words– and turn it into our familiar feature-vector
form

• A collection of documents is called a corpus

• A document is composed of individual tokens or terms

• Each document is one instance


• but we don’t know in advance what the features will be
“Bag of Words”
• Treat every document as just a collection of individual words
• Ignore grammar, word order, sentence structure, and (usually)
punctuation
• Treat every word in a document as a potentially important keyword of the
document

• What will be the feature’s value in a given document?


• Each document is represented by a one (if the token is present in the
document) or a zero (the token is not present in the document)

• Straightforward representation

• Inexpensive to generate

• Tends to work well for many tasks


Pre-processing of Text
The following steps should be performed:

•The case should be normalized


• Every term is in lowercase

•Words should be stemmed


• Suffixes are removed
• E.g., noun plurals are transformed to singular forms

•Stop-words should be removed


• A stop-word is a very common word in English (or whatever language is being
parsed)
• Typical words such as the words the, and, of, and on are removed
Term Frequency

• Use the word count (frequency) in the document instead of just a


zero or one
• Differentiates between how many times a word is used
Normalized Term
Frequency
• Documents of various lengths

• Words of different frequencies


• Words should not be too common or too rare
• Both upper and lower limit on the number (or fraction) of documents in
which a word may occur
• Feature selection is often employed

• The raw term frequencies are normalized in some way,


• such as by dividing each by the total number of words in the document
• or the frequency of the specific term in the corpus
TF-IDF

TFIDF 𝑡, 𝑑 = TF 𝑡, 𝑑 × IDF 𝑡

• Inverse Document Frequency (IDF) of a term

Total number of documents


IDF 𝑡 = 1 + log
Number of documents containing 𝑡
TFIDF

Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.
Example: Jazz
Musicians
• 15 prominent jazz musicians and excerpts of their biographies from
Wikipedia

• Nearly 2,000 features after stemming and stop-word removal!

• Consider the sample phrase “Famous jazz saxophonist born in


Kansas who played bebop and latin”
Example: Jazz Musicians

Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.
Example: Jazz Musicians

Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.
Example: Jazz Musicians

Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.
Example: Jazz Musicians
Beyond “Bag of Words”
• 𝑁-gram Sequences

• Named Entity Extraction

• Topic Models
N-gram Sequences
• In some cases, word order is important and you want to preserve
some information about it in the representation

• A next step up in complexity is to include sequences of adjacent


words as terms

• Adjacent pairs are commonly called bi-grams

• Example: “The quick brown fox jumps”


• It would be transformed into {quick, brown, fox, jumps, quick_brown,
brown_fox, fox_jumps}

• N-grams they greatly increase the size of the feature set


Topic Models

Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.
Text Mining
Example
Task: predict the stock market based on the stories that appear on the
news wires
Mining News Stories to
Predict Stock Price
Movement

Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.
Text Mining
Secara umum, perbedaan antara text mining dan text analytics adalah bahwa text analytics
merupakan konsep yang lebih luas yang mencakup pencarian informasi (misalnya, mencari dan
mengidentifikasi dokumen yang relevan untuk sekumpulan istilah kunci tertentu) serta ekstraksi
informasi, data mining, dan Web mining.
Text Analytics =Information Retrieval +Information Extraction +Data Mining +
Web Mining
Or
Text Analytics =Information Retrieval +Text Mining

Text Mining adalah proses semi-otomatis untuk


mengekstraksi pola (informasi dan pengetahuan
yang berguna) dari sumber data yang tidak
terstruktur dalam jumlah besar. Penggalian teks
sama dengan penggalian data karena memiliki
tujuan yang sama dan menggunakan proses yang
sama, tetapi dengan penggalian teks, input untuk
prosesnya adalah kumpulan file data yang tidak
terstruktur (atau kurang terstruktur) seperti
dokumen Word, file PDF, kutipan teks, file XML,
dan sebagainya.
The implementation of text mining is highly needed and profitable in

Text Mining fields that produce very large amounts of textual data, such as law
(court orders), academic research (research articles), and finance
(quarterly reports).

Information extraction. Identify key phrases and relationships in text by searching for
predefined objects and sequences in text through pattern matching. The most common
form of information extraction

Topic tracking. Based on user profile and viewed documents


users, text mining can predict other documents that are of interest to the
Contoh, interaksi berbasis teks
user.
berbentuk bebas dengan pelanggan Summarization. Summarize documents to save readers time.
dalam bentuk keluhan (atau pujian) Categorization. Identifying the main themes of a document and then placing the
dan klaim garansi dapat digunakan document into a set of predetermined categories based on these themes.
untuk mengidentifikasi secara
objektif karakteristik produk dan
Clustering. Groups similar documents without having a predefined set of
layanan yang dianggap kurang
categories.
sempurna dan dapat digunakan
sebagai masukan Concept linking. Link related documents by identifying similar concepts.
untuk
pengembangan produk dan alokasi
layanan yang lebih baik. Question answering. Finds the best answer to a given question through
knowledge-based pattern matching.
NATURAL LANGUAGE
PROCESSING
Natural Language Processing (NLP) is a subfield of artificial intelligence and computational
engineering that studies the problem of "understanding" natural human language. The
goal is to convert human language descriptions (such as text documents) into more formal
representations (in the form of numerical and symbolic data) that are easier to manipulate
by computer programs.

NLP is closely related to text mining because NLP allows feature extraction
from unstructured text so that various data mining techniques can be used
to extract knowledge (new and useful patterns and relationships) from the
text. In simple terms, text mining is a combination of NLP and data mining,
where NLP provides the foundation for understanding and analyzing text in
depth.
NATURAL LANGUAGE
PROCESSING
The benefits of NLP include the ability to generate automatic summaries of text,
translate text from one language to another, recognize sentiment in text, and
more. However, NLP is also faced with several challenges, such as:

• Text Division: Languages such as Chinese, Japanese, and Thai do not have
single word boundaries, making identification of word boundaries difficult.
• Interpreting Word Meanings: Many words have more than one meaning, so
choosing the correct meaning requires consideration of context.
• Syntactic Ambiguity: Grammar in natural languages is often ambiguous,
requiring the incorporation of semantic and contextual information to select
appropriate sentence structures.
• Imperfect Input: Foreign accents, vocal errors, or typographical errors in text
make language processing more difficult.
• Language Activities: Sentences can often be thought of as actions, which
cannot always be determined from sentence structure alone.
TEXT MINING Some applications of text mining in marketing include:

APPLICATIONS 1.Sentiment Analysis: Analyze customer sentiment towards a product


or service through unstructured data such as user reviews.
2.Customer Relationship Management (CRM): Using text data to predict
customer behavior and improve customer retention.
3.Product Development: Analyze product attributes to optimize
assortment, product recommendations, and supplier selection.
Text mining can be used in security and counter-terrorism through:
1.Surveillance Systems: For example, ECHELON, is able to identify the
content of phone calls and emails to track suspicious
communications.
2.Intelligence Analysis: EUROPOL, the FBI, and the CIA use text mining
to analyze data to track organized crime activities.
3.Fraud Detection: Develop predictive models that differentiate
misleading statements from truthful ones based on text data and
voice recordings.
Figure shows a high-level context
diagram of a typical text mining
process. This diagram shows the scope
of the process, emphasizing the
process's interface with the larger
environment. In essence, these
diagrams draw boundaries around
specific processes to clearly identify
what is (and is not included) in the
text mining process.
01 Establish the Corpus
0C2
Collect all related reate the Term-Document Matrix
documents such as text, Create a term-document
XML files, emails, web matrix
pages, short notes, and (TDM) uses documents that
voice recordings. have been digitized and
organized (corpus). In TDM,
Next, convert it into a
each row represents a
uniform format, for document, while each column
example an ASCII text file, represents a term.
so that it can be processed
by the computer.
To obtain a more consistent term-document matrix (TDM) for
subsequent analysis, the indices need to be normalized. Some
commonly used normalization methods are as follows:

Log frequencies: Raw frequencies can be changed using


logarithmic function. This helps disguise raw frequencies and their
impact on subsequent analysis results.
Binary frequencies: This simple transformation method
indicates whether a term is present in a document or not. The result
is a TDM matrix containing only 1s and 0s.
Inverse document frequencies: This transformation
takes into account the relative frequency of terms in different
documents. This gives higher weight to terms that occur less
frequently but may be more specific in the context of the analysis.
Continued
Metode utama untuk mengekstraksi pengetahuan meliputi:
1. Classification: General processes in knowledge discovery
for analyzing complex data. The goal is to group data
instances into predefined categories. In the context of text
mining, this is known as text classification, where
03 Extract the Knowledge documents are assigned labels as pertopic
Extracting knowledge from
Well-structured TDM, coupled 2.Clustering: T h e m o s t p o p u l a r c l u s t e r i n g i s s c a t t e r / g a t h e r
with other structured data clustering and query-specific clustering.
elements, to discover new
patterns in the context of the
specific problem at hand. 3. Association: Refers to the direct relationship between
concept or set of concepts. This involves finding
interesting relationships between variables in large
databases.

4. Trend Analysis: Based on the idea that the distribution of


concepts is a function of the document collection. This makes it
possible to compare the distribution of concepts from two
different document collections to identify trends or changes
over time.
Sentiment Analysis Overview
Sentiment Analysis is an effort to understand what people feel and think about a particular topic
by exploring the opinions of many people with the help of automatic sentiment analysis tools that help
answer the question "How do other people feel about a topic" by investigating the opinions of many
people and bringing together researchers and practitioners related to the scope opinions are discussed
so as to create an opinion-oriented information system.

Sentiment Analysis In Business


In marketing and customer relationship management,
Sentiments that appear in opinions are usually of
two types: sentiment analysis is carried out The aim is to find out
Explicit sentiment (subjective sentences that what customers think about the products and services
directly express opinions offered and detect which opinions are favorable or
Implicit sentiment (sentences that are not direct unfavorable regarding the product or service
and in which they imply an opinion Source Sentiment collection :
Customer call center transcription
Social media posts
Sentiment Analysis has many other names, such as Online communities and forums Customer Surveys, etc.
opinion mining
subjectivity analysis
appraisal extraction
SENTIMENT ANALYSIS

Sentiment analysis, which is now a popular application in text analytics, has a broad impact in various fields. Compared to
expensive and time-consuming traditional sentiment analysis methods, text analytics technology-based approaches are capable of
automating data collection, filtering, classification, and clustering on a large scale. The app accesses a variety of data sources such
as social media, product reviews, service center call records and more.
Some key applications of sentiment analysis include:
Voice of the Customer (VOC): Using sentiment analysis to understand and manage customer complaints and compliments, helping
companies improve their products and services.
Voice of the Market (VOM): Understanding aggregate market opinions and trends, assisting companies in developing product
strategies and positioning themselves in competitive markets.
Voice of the Employee (VOE): Uses sentiment analysis to assess employee satisfaction, which can influence efforts to improve
customer satisfaction.
Brand Management: Using sentiment analysis to monitor opinions on social media to maintain or improve brand reputation.
Financial Markets: Applying sentiment analysis to predict financial market movements, using data from social media, news and
online discussions.
Politics: Analyze sentiment in political discussions to predict election outcomes and understand the issues that matter to voters.
Government Intelligence: Uses sentiment analysis to monitor public opinion regarding government policies and identify potential
threats based on negative communications.

In addition, sentiment analysis can also be used in e-commerce site design, ad placement, search engine management, and email
filtration and analysis. With a wide range of applications, sentiment analysis helps organizations understand and respond to opinions
and trends in various contexts.
SENTIMEN ANALYSIS PROCESS

STEP 1: SENTIMEN DETECTION

Sentiment detection aims to distinguish between facts and opinions in a


text, which can be thought of as classifying the text as objective or
subjective (O-S Polarity). Opinion detection is usually based on
examining adjectives in the text. (the sentence "This film is amazing!" is
considered opinionated because it contains the adjective "amazing").
Texts deemed to contain opinions will be forwarded to the next stage.

STEP 2: N-P POLARITY CLASSIFICATION

Aims to classify the opinion as positive, negative, or neutral sentiment


polarity. For example, product reviews can be considered positive or
negative depending on the words used, such as “good” or “bad.”
Additionally, it is also important to identify the strength of sentiment
(light, medium, or strong).
SENTIMEN ANALYSIS PROCESS
STEP3:TARGETIDENTIFICATION

Aims to identify the object discussed in


the opinion. Target identification is STEP 4 : COLLECTION AND AGGREGATION
important because it helps in
understanding the context of sentiment Once the sentiment of all the texts is analyzed, last step
and provides more specific information is to create agregrate and combine in one document. This
about the object being valued. can be done by summing the polarity and strength of all the
texts, or by using semantic aggregation techniques from
Determining targets in sentiment analysis natural language processing to create a final sentiment.
can be easy in some situations, such as
restaurant reviews. However, in news
texts or blogs that mention many objects,
determining targets can be difficult.
Sometimes there is more than one target,
as in comparison text. For example, in the
sentence "Smartphone A is better than
smartphone B", the two objects can be
ordered based on their benefits according
to the context of the text.
METHODS FOR POLARITY IDENTIFICATION
Text polarity can be identified at the word, body, sentence, or document level. The most detailed identification is
carried out at the word level.Once polarity identification is made at the word level, it can be aggregated to the next
higher level, until the desired level of aggregation of sentiment analysis is achieved.

Lexicon Using a Collection of Training


Documents
A lexicon is a catalog of words, synonyms and their meanings for a particular
This method uses statistical analysis and machine learning utilize resources in the
language. A commonly used lexicon for sentiment analysis is WordNet. WordNet
form of labeled documents (either manually by an annotator or using a rating
is a large lexical database of the English language, which groups words into sets
system such as a star/point system). After obtaining a labeled text dataset,
of cognitive synonyms (i.e. synsets) that each express a different concept.
various machine learning algorithms can be used to train sentiment classification.
Other examples of extensions are SentiWordNet (Provides positive, negative, Some popular algorithms for this task include artificial neural networks, support
and objectivity scores for each synset), and WordNet-Affect (Provides labels to vector machines, k-nearest neighbors, Naive Bayes, decision trees, and maximal
WordNet synsets using affective categories such as emotions, feelings, expectation based clustering.
attitudes, and so on).

\
SENTIMEN ANALYSIS AND SPEECH ANALYTICS
Speech Analytics is a science that enables the analysis and extraction of information from live and recorded
conversations. This analysis is used to gather information for security, improve media applications, and
provide business intelligence through worldwide customer call analysis.
In speech analytics, sentiment analysis focuses on assessing the emotional state of a conversation and
measuring the presence and strength of positive or negative feelings. The essence of automated sentiment
analysis involves building models to describe the relationship between features and content in audio and
perceived and expressed sentiment.

The Acoustic Approach The Linguistic Approach


This approach focuses on explicit indications of the sentiment and context of
The acoustic approach in sentiment analysis focuses on measuring audio
the conversational content in the audio.
features such as voice pitch, volume, intensity, and rate of speech to
understand a speaker's sentiment. In developing acoustic analysis tools, the
The simplest method in linguistic analysis is to capture keywords in audio that
system must be built based on a model that defines the sentiment being
indicate a particular sentiment. However, this approach is less popular due to
measured. This model is based on a database of audio features and how the
its limitations and lack of predictive accuracy.
presence of these features can indicate each measured sentiment.

Another approach involves building models based on linguistic elements to


predict specific sentiments in audio. The challenge is to collect linguistic
information from each audio corpus.
References

❑ Provost, F.; Fawcett, T.: Data Science for Business; Fundamental Principles of
Data Mining and Data- Analytic Thinking. O‘Reilly, CA 95472, 2013.
❑ Sharda, R., Delen, D., Turban, E., (2018). Business intelligence, Analytics, and
Data Science: A Managerial Perspective, 4th Edition, Pearson.
Thank You

You might also like