0% found this document useful (0 votes)
9 views

AFM_Module 4

The document discusses text mining, emphasizing its importance in extracting knowledge from unstructured data sources such as social media and documents. It differentiates text mining from data and web mining, outlines the text mining process, and highlights various applications and tools available for text analytics. Additionally, it covers key concepts like term-document matrices, TF-IDF, and challenges in natural language processing.

Uploaded by

Maham Masroor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

AFM_Module 4

The document discusses text mining, emphasizing its importance in extracting knowledge from unstructured data sources such as social media and documents. It differentiates text mining from data and web mining, outlines the text mining process, and highlights various applications and tools available for text analytics. Additionally, it covers key concepts like term-document matrices, TF-IDF, and challenges in natural language processing.

Uploaded by

Maham Masroor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Text analytics: Text mining

Software/tools
Week 4
Learning Objectives
• Describe text mining and understand the need for text
mining
• Differentiate between text mining, Web mining and data
mining
• Understand the different application areas for text mining
• Know the process of carrying out a text mining project
• Describe Web mining, its objectives, and its benefits
• Understand the three different branches of Web mining
• Web content mining
• Web structure mining
• Web usage mining
Text Analytics
Applying data analytics to derive knowledge from text

Huge amount of textual data is available in the form of


• Social media posts
• Tweets
• Question answer forums
• Blogs
• YouTube video comments
• SMS
• Product reviews
• News articles
How much textual data

• ~90 percent of corporate


data is in some kind of
unstructured form (e.g.,
text)
• Tapping into these
information sources is not
an option, but a need to stay
competitive
• Solution: text mining
• A semi-automated process
of extracting knowledge
from unstructured data
sources
Data Mining versus Text Mining
• Both seek for novel and useful patterns
• Both are semi-automated processes
• Difference is the nature of the data:
• Structured versus unstructured data
• Structured data: in databases
• Unstructured data: Word documents, PDF files, text extracts, XML files,
and so on
• Text mining – first, impose structure to the data, then mine the
structured data
Stakeholders of text analytics
• Government
• What is the response of people towards a particular policy?
• Advertisers
• What is trending that could be used for advertisement?
• Some promo codes are available
• Movie Makers
• What people disliked about a movie?
• This information is used to deliver in future what people want
• Brand Managers
• What value added services people want in a brand?
• How people respond to social responsibility campaigns of a brand?
• Academia
• Is this document plagiarized?
• Retrieve similar documents
Text analytics
• Benefits of text mining are obvious especially in text-rich data
environments
• e.g., law (court orders), academic research (research articles), finance
(quarterly reports), medicine (discharge summaries), biology
(molecular interactions), technology (patent files), marketing
(customer comments), etc.
• Electronic communization records (e.g., Email)
• Spam filtering
• Email prioritization and categorization
• Automatic response generation
Text Mining
• Information extraction
• Identification of key phrases and relationships within text by looking for
predefined objects and sequences in text by way of pattern matching
• Topic tracking
• Based on a user profile and documents that a user views, text mining can
predict other documents of interest to the user
• Summarization
• Summarizing a document to save time on the part of the reader
Text Mining
• Categorization
• Identifying the main themes of a document and then placing the document
into a predefined set of categories based on those themes
• Clustering
• Grouping similar documents without having a predefined set of categories
• Concept linking
• Connects related documents by identifying their shared concepts
• Question answering
• Finding the best answer to a given question through knowledge-driven
pattern matching
Text Mining Terminology
• Unstructured • Word frequency
or semistructured • Part-of-speech tagging (assigning
data A corpus (plural: corpora) refers to a large collection of written or
spoken text. This text can come from various sources, including:
parts of speech to each word, such
• Corpus Books
Articles
as noun, verb, adjective)
Websites

• Terms Speech transcripts


Social media posts
Email
• Morphology/word structure
• Concepts • Term-by-document matrix
• Stemming • Singular value decomposition
Morphology/Word Structure:

• Stop words
Morphology is the study of the internal structure of words, including prefixes, suffixes, and root words.
In topic modeling, considering morphology can help identify different forms of the same word (e.g., "play," "playing," "played") as representing the same underlying topic. This
can lead to a more accurate understanding of the thematic content.
Term-by-document Matrix:

• Synonyms
This is a fundamental data structure used in topic modeling.
Imagine a table where rows represent documents and columns represent unique words encountered across all documents.
Each cell contains a value that represents the weight or importance of a specific word in a particular document. This weight can be simply the word count (term frequency) or a
more sophisticated measure like TF-IDF.
Singular Value Decomposition (SVD):
SVD is a mathematical technique used for dimensionality reduction.
In topic modeling, the term-by-document matrix can be very large, with many words and documents. SVD helps decompose this matrix into a more manageable form by
identifying the most significant underlying themes (topics) within the data.
Text analytics: Tasks
Text analytics: Tasks
Text analytics: Tasks
Text analytics: Tasks
Text analytics: Text normalization

Applying stemming or lemmatization. Stemming reduces words to their root form (e.g., "running" → "run") while lemmatization considers the
context and reduces words to their dictionary form (e.g., "running" → "run").

Lemmatization:

Dictionary Based: Lemmatization uses a dictionary to map words to their dictionary form, also called a lemma. It considers the grammatical context of
the word to ensure the resulting lemma is an actual word.
Text analytics: more basics
Text Mining Process
Software/hardware limitations
Privacy issues
Linguistic limitations
Context diagram for the text mining process

Unstructured data (text) Extract Context-specific knowledge


knowledge
from available
Structured data (databases) data sources
A0

Domain expertise
Tools and techniques
Text Mining Process
Task 1 Task 2 Task 3

Establish the Corpus: Create the Term- Extract Knowledge:


Collect & Organize the Document Matrix: Discover Novel
Domain Specific Introduce Structure Patterns from the
Unstructured Data to the Corpus T-D Matrix

Feedback Feedback

The inputs to the process The output of the Task 1 is a The output of the Task 2 is a The output of Task 3 is a
includes a variety of relevant collection of documents in flat file called term-document number of problem specific
unstructured (and semi- some digitized format for matrix where the cells are classification, association,
structured) data sources such computer processing populated with the term clustering models and
as text, XML, HTML, etc. frequencies visualizations

The three-step text mining process


Text Mining Process
• Step 1: Establish the corpus
• Collect all relevant unstructured data
• (e.g., textual documents, XML files, emails, Web pages, short notes)
• Digitize, standardize the collection
• (e.g., all in ASCII text files)
• Place the collection in a common place
• (e.g., in a flat file, or in a directory as separate files)
Text Mining Process
• Step 2: Create the Term–by–Document Matrix
n t
ri ng
me ee
Terms
ri sk a ge g in
n t
e nt an en me
tm tm re lop
es jec ftwa e P
Documents inv p ro s o d ev SA ...
Document 1 1 1

Document 2 1

Document 3 3 1

Document 4 1

Document 5 2 1

Document 6 1 1
...
Text Mining Process
• Step 2: Create the Term–by–Document Matrix (TDM)
• Should all terms be included?
• Stop words, include words
• Synonyms, homonyms
• Stemming
• What is the best representation of the indices (values in cells)?
• Row counts; binary frequencies; log frequencies;
• Inverse document frequency

Inverse document frequency (IDF) is a metric used in text analysis, particularly in conjunction with term frequency (TF) to calculate a word's importance within a document relative to a collection of documents (corpus).

Here's how IDF works:

Focuses on Rare Words: IDF emphasizes words that are uncommon across the entire document collection. These uncommon words are likely more informative and specific to the document's content compared to
frequent words.
Text Mining Process
• Step 2: Create the Term–by–Document Matrix (TDM)
• TDM is a sparse matrix. How can we reduce the dimensionality of the
TDM?
• Manual - a domain expert goes through it
• Eliminate terms with very few occurrences in very few documents (?)
• Transform the matrix using SVD
• SVD is similar to principle component analysis
Domain expert review: While possible, manually going through a TDM to identify relevant terms is impractical for large datasets. It's time-consuming, subjective, and prone to human bias.
Automatic Approaches:

Thresholding: This method involves eliminating terms that occur in fewer than a certain number of documents or have a very low frequency within the corpus. This can be effective for removing noisy
terms with little information, but choosing the threshold value requires careful consideration to avoid eliminating potentially valuable terms.

Singular Value Decomposition (SVD): This is a powerful and widely used dimensionality reduction technique for TDM. It decomposes the matrix into three components:

U: A matrix representing the documents in a lower-dimensional space.


Σ: A diagonal matrix containing the singular values, which capture the variance in the data.
V^T: A matrix representing the terms in a lower-dimensional space (transposed).
Text Mining Process
• Step 3: Extract patterns/knowledge
• Classification (text categorization)
• Clustering (natural groupings of text)
• Improve search recall
• Improve search precision
• Scatter/gather
• Query-specific clustering
• Association rules
• Trend Analysis
Exploratory data analysis
• Gives insight about the data such as:
• Class distribution
• Top occurring words in the dataset
• Distribution of words per document
• These insights help in formulating solution strategies for the
task
• What preprocessing should be used?
• What classifier should be used?

• Sentiment Polarity Detection Dataset


• Clothing products review text, Reviewer info, rating and sentiment
Exploratory data analysis
Sentiment Polarity
Detection Dataset
• Sentiment labels
(-1,0,1)
• Rating distribution
• Distribution of age of
the reviewers
• Distribution of the text
length of the reviews
• Reviews per
department
Exploratory data analysis
• Frequency of top unigrams
before removing stopwords
• Frequency of top unigrams
after removing stopwords
• Frequency of top bigrams
before removing stopwords
• Frequency of top bigrams
after removing stopwords
• Frequency of top trigrams
before removing stopwords
• Frequency of top trigrams
after removing stopwords
Exploratory data analysis

An integral tool for text EDA is Word Cloud


Mining Text For Security…
Mining Text for Security and Counterterrorism

Cluster 1 Cluster 2 Cluster 3


(L) Kampala (P) Timothy McVeigh (E) election
(L) Uganda (P) Oklahoma City (P) Norodom Ranariddh
(P) Yoweri Museveni (P) Terry Nichols (P) Norodom Sihanouk
(L) Sudan (L) Bangkok
(L) Khartoum (L) Cambodia
(L) Southern Sudan (L) Phnom Penh
(L) Thailand
(P) Hun Sen
(O) Khmer Rouge
(P) Pol Pot
Text Mining Application
(research trend identification in literature)

• Mining the published Information systems literature


• MIS Quarterly (MISQ)
• Journal of MIS (JMIS)
• Information Systems Research (ISR)
• Covers 12-year period (1994-2005)
• 901 papers are included in the study
• Only the paper abstracts are used
• 9 clusters may be generated for further analysis
Discussion with paper
https://www.emerald.com/insight/content/doi/10.1108/ITP-03-2021-
0188/full/html#sec004
Text Mining Tools
• Commercial Software Tools
• SPSS PASW Text Miner
• Statistica Data Miner
• ATLAS.ti (https://atlasti.com/)
• Free Software Tools
• Netlytics (https://netlytic.org/)---Practice conducted in the class
• Voyant tool (https://voyant-tools.org/)---Practice conducted in the class
• Topic modeling tool from Google code
(https://code.google.com/archive/p/topic-modeling-tool/)
Vector space modeling Set of Words:
This is the most basic approach. It simply represents text as a collection of unique
words, without considering order or frequency.
Imagine a bag where you throw in all the unique words from a sentence or document.
For computers, this isn't very informative because it doesn't capture the relationships
between words or their importance within the text.
Bag-of-Words (BoW):
This is a more sophisticated approach that builds upon the idea of a set of words.
In BoW, we still represent text as a collection of words, but we also consider the
frequency of each word's occurrence.
Each word becomes a feature, and its value is typically the number of times it
appears in the document (term frequency).
This allows us to capture some basic information about the content based on how
often words appear.
Word Embedding:
This is a more advanced technique that represents words as numerical vectors.
These vectors capture the semantic meaning and relationships between words.
Words with similar meanings will have similar vector representations, even if they are
not the same word.
Word embeddings are created using machine learning techniques that analyze large
amounts of text data.
Vector space modeling
• Set-of-Words: Documents
represented by vectors ∈ {0, 1}|Σ|
• Bag-of-Words: Documents
represented by term-frequency
vectors ∈ N |Σ|
Issues with Sets and Bag of Words
• representation has associated high
computational complexity
• Dimensionality blow up, |Σ| could
be very large
Vector space modeling
Tf-idf (Term frequency-Inverse document frequency) is more refined
model to select features to represent texts TF-IDF (Term Frequency-Inverse Document Frequency) which
considers both term frequency and how rare the term is across the
entire document collection.

• Key idea is to find special words characterizing the document


• Frequency:
• Most frequent words implies most significant in doc
• Most frequent words (“the”, “are”, “and”) help English structure and build
ideas but not significant in characterizing documents
• Rarity: Indicator of topics are rare words
• rare words overall but concentrated in a few docs “batsman”, “prime-
minister”
• ball, bat, pitch, catch, run =⇒ cricket related doc
TF-IDF
TF-IDF

▪ df i = document frequency of term i


= number of documents containing term i
▪ IDFi = inverse document frequency of term i,
IDFi = log2 (N/ df i)
(N: total number of documents)
TF-IDF
TF-IDF-Example
Natural Language Processing (NLP)
• Structuring a collection of text
• Old approach: bag-of-words
• New approach: natural language processing
• NLP is
• a very important concept in text mining
• a subfield of artificial intelligence and computational linguistics
• the studies of "understanding" the natural human language
• Syntax versus semantics-based text mining
Natural Language Processing (NLP)
• What is “Understanding” ?
• Human understands, what about computers?
• Natural language is vague, context driven
• True understanding requires extensive knowledge of a topic
Natural Language Processing (NLP)
• Challenges in NLP
• Part-of-speech tagging
• Text segmentation
• Word sense disambiguation
• Syntax ambiguity
• Imperfect or irregular input
• Speech acts

• Dream of AI community
• to have algorithms that are capable of automatically reading and
obtaining knowledge from text
NLP Task Categories
• Information retrieval/recovery
• Information extraction
• Named-entity recognition
• Question answering
• Automatic summarization
• Natural language generation and understanding
• Machine translation
• Foreign language reading and writing
• Text proofing
Web Mining Overview
• Web is the largest repository of data
• Data is in HTML, XML, text format
• Challenges (of processing Web data)
• The Web is too big for effective data mining
• The Web is too complex
• The Web is too dynamic
• The Web is not specific to a domain
• The Web has everything

• Opportunities and challenges are great!


Web Mining
• Web mining (or Web data mining) is the process of discovering
intrinsic relationships from Web data (textual, linkage, or usage)

Web Mining

Web Content Mining Web Structure Mining Web Usage Mining


Source: unstructured Source: the unified Source: the detailed
textual content of the resource locator (URL) description of a Web
Web pages (usually in links contained in the site’s visits (sequence
HTML format) Web pages of clicks by sessions)
Web Content/Structure Mining
• Mining of the textual content on the Web
• Data collection via Web crawlers

• Web pages include hyperlinks


• Hubs
• hyperlink-induced topic search (HITS)
Web Usage Mining
• Extraction of information from data generated through Web
page visits and transactions
• data stored in server access logs, referrer logs, agent logs, and client-
side cookies
• user characteristics and usage profiles
• metadata, such as page attributes, content attributes, and usage data
Web Usage Mining
• Web usage mining applications
• Determine the lifetime value of clients
• Design cross-marketing strategies across products
• Evaluate promotional campaigns
• Target electronic ads and coupons at user groups based on user access
patterns
• Predict user behavior based on previously learned rules and users'
profiles
• Present dynamic information to users based on their interests and
profile
Web Usage Mining
Pre-Process Data Extract Knowledge
Website
User / Collecting Usage patterns
Customer Merging User profiles
Cleaning Page profiles
Structuring Visit profiles
- Identify users Customer value
- Identify sessions
- Identify page views
- Identify visits
Weblogs

How to better the data


How to improve the Web site
How to increase the customer value
Web Mining Success Stories
• Amazon.com, Ask.com,
• Website Optimization

Customer Interaction Analysis of Interactions Knowledge about the Holistic


on the Web View of the Customer

Web
Analytics

Voice of
Customer

Customer Experience
Management

You might also like