0% found this document useful (0 votes)

7 views

Text Mining

The document discusses text mining techniques and applications. It defines key concepts in text mining such as tokenization, stemming, stop words, and topic modeling. Various text mining tasks and methods are also explained such as sentiment analysis, document similarity, and latent semantic analysis.

Uploaded by

pojemonoy

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Text Mining

Uploaded by

pojemonoy

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 25

June 6, 2024

TEXT MINING

1
 Application of data
mining to non-structured
or less structured text
files.
 Find the “hidden”
content of documents,
including additional
useful relationships
Text  Relate documents across
Mining previous unnoticed
divisions
 Group documents by
common themes
3

Need for Text Mining

Where is it used?
5

What is Text Data Mining?

 Peoples’ first thought:
 Make it easier to find things on the Web.
 But this is information retrieval!
 Information Extraction (IE)
 Extract facts about pre-specified entities, events or
relationships from unrestricted text sources.
 No novelty: only information already present is
extracted.
 The metaphor of extracting ore from rock:
 Does make sense for extracting documents of interest
from a huge pile.
 But does not reflect notions of DM in practice.
6 June 6, 2024

Data Mining Vs Text Mining

 Quite similar to data mining except that DM finds

patterns in data stored in structural DB’s
 But for text mining the input is a collection of
unstructured files…collection of word docs, pdf’s etc….
 So text mining can be thought of as a 2-step process
 Imposing structure to the text-based sources
 Extracting relevant info from structured text-based data
using DM tools and techniques
7 June 6, 2024

NLP
 NLP is the language computers and smartphones use to
understand our language both spoken and typed
 Uses the concepts of both Computer Sc. and AI
 Text mining is the process of deriving high quality
information from the text
 We want to turn text into data for analysis via application of
Natural Language Processing
8 June 6, 2024

Applications of NLP
9 June 6, 2024

Text Mining Lingo

 Unstructured data Vs structured data

 Corpus: large and structured set of texts prepared for the
purpose of conducting knowledge discovery.
 Tokenization: breaking a sentence into words
 Terms: Single word or multiword phrase extracted directly
from corpus of a specific domain by means of NLP methods.
10 June 6, 2024

Text Mining Lingo

 Word frequency. # of times a word is found in specific doc.
 Stemming: reducing inflected words to their stem form. Eg.,
stemming terms such as argue, argued, argues and arguing, would
result in the stem argue.
 Stop / noise words/ exclusion list: filtered out prior to processing
text, most NLP tools use a list that includes articles (a, am, the, of),
auxiliary verbs (is, are, was, were, etc.).
11 June 6, 2024

Text Mining Lingo

 Synonyms : Synonyms are syntactically different words (i.e.,
spelled differently) with identical or at least similar meanings
(e.g., movie, film, and motion picture).
 Term-by-document matrix describes freq. of terms
occurring in corpus. rows correspond to documents and
columns to terms.
12 June 6, 2024

Text Mining Lingo

 Bag of Words: NLP technique of feature extraction with text
data. It shows the occurrence of words within a document
disregarding the grammatical details and the word order.
Using it we convert variable-length texts into a fixed-length
vector i.e. text into its equivalent vector of numbers.
 Document Embedding: is a result of the second attention
layer, the sentence, that is the aggregation of all the
sentences that appear in the document, that have been
previously processed on a word level.
What is SA & OM?
 Identify the orientation of opinion in a piece of text

The movie The movie The movie

was fabulous! stars Mr. X was horrible!

 Can be generalized to a wider set of emotions

Motivation
 Knowingsentiment is a very natural ability of a human being.
Can a machine be trained to do it?

 SA aims at getting sentiment-related knowledge especially

from the huge amount of information on the internet

 Canbe generally used to understand opinion in a set of

documents
Tripod of Sentiment Analysis
Cognitive
Science

Sentiment
Analysis

Machine Natural
Learning Language
Processing
16 June 6, 2024

 Sentence 1: “This is a good job. I will not miss it for anything”

 Sentence 2: ”This is not good at all”
 For this example, vocabulary is of 5 words only.
• good
• job
• miss
• not
• all
 So, the respective vectors for these sentences are:
 “This is a good job. I will not miss it for anything” =[1,1,1,1,0]
 ”This is not good at all”=[1,0,0,1,1]
 N-Gram Model: is an N-token sequence of words: a 2-gram
(bigram)/ 3-gram (trigram) is a 3-word sequence of words .
17 June 6, 2024

 TF-IDF: term freq.–inverse document frequency, is a

numerical statistic reflecting importance of a word to a
document in a corpus.
 TF-IDF value increases proportionally to the number of times a
word appears in the document and is offset by the number of
documents in the corpus that contain the word, which helps to
adjust the fact that some words appear more frequently in
general.
 TF-IDF is the most popular term-weighting schemes today;
83% of text-based recommender systems in digital libraries use
tf–idf
 High value for TF-IDF indicates that the term doesn’t occur
frequently in the collection of documents taken as a whole, but
appears quite frequently in a specified document.
 TF-IDF value close to 0 indicates that the term appears
frequently in the collection, but rarely in a specific document.
18 June 6, 2024

Documents Semantic Analysis

 Explore document content quickly and efficiently
 Compare subgroup to main group
 Extract keywords
 What are the documents talking about?
 Explore document maps
19 June 6, 2024

Text Mining Lingo

 Concordance finds the queried word in a text and displays the
context in which this word is used giving the analyst the
opportunity to view different perspectives on a text.
 It is a generated list over every occurrence of a given word in a
digital corpus with the context (a certain number of words
before and after the keyword).
 The search term and its co-text are arranged so that the textual
environment can be assessed and patterns surrounding the
search term can be identified visually.
20 June 6, 2024

Text Mining Lingo

 Similarity Hashing computes similarity hashes for the

given corpus, allowing the user to find duplicates,
plagiarism or textual borrowing or paraphrasing in
documents for legal or academic use.
 It numerically scores & indicates how similar two texts are
based on their content, structure, or style.
 Text similarity measures can be utilized to perform various
tasks which require comparing, matching, or grouping texts
based on their similarity or difference.
21 June 6, 2024

 Additionally, you can find relevant or similar documents,

articles, or products based on a query or reference text.
 Texts can also be clustered or categorized into topics,
themes, or genres according to their content.
 Furthermore, texts can be summarized by identifying the
most important sentences or simplified by measuring their
consistency and readability.
22 June 6, 2024

Text Mining Lingo

 According to some sources, the average person generates in

excess of 2.7MB of digital data per second, of which 80-
90% is unstructured.
 Consider a scenario where a business employs a single
individual to review each piece of unstructured data and
segment them based on the underlying topic. It would be an
impossible task.
 The solution is topic modeling.
 Topic Modeling: is a frequently used approach to discover
hidden semantic patterns portrayed by a text corpus and
automatically identify topics that exist inside it.
23 June 6, 2024

Text Mining Lingo

 statistical
modeling that leverages unsupervised machine
learning to analyze and identify clusters or groups of
similar words within corpus.
 For example, a topic modeling algorithm may be deployed
to determine whether the contents of a document imply it’s
an invoice, complaint, or contract.
 topic modeling aids businesses in:
• real-time analysis on unstructured textual data
• Learn from unstructured data at scale
• Build a consistent understanding of data, regardless of its
format.
24 June 6, 2024

 Latent semantic indexing (LSI) is an indexing and retrieval

method that uses a mathematical technique called singular value
decomposition (SVD) to identify patterns in the relationships
between the terms and concepts contained in an unstructured
collection of text.

 Latent Dirichlet allocation (LDA) is a Bayesian network (and,

therefore, a generative statistical model) for modeling automatically
extracted topics in textual corpora. In this, observations (e.g.,
words) are collected into documents, and each word's presence is
attributable to one of the document's topics

 Multidimensional scaling (MDS) is a means of visualizing the

level of similarity of individual cases of a dataset.ttributable to
one of the document's topics. Each document will contain a small
number of topics.
25 June 6, 2024

Colloquial Thai - The Complete Course For Beginners (Second Edition) - PDF Room
100% (2)
Colloquial Thai - The Complete Course For Beginners (Second Edition) - PDF Room
401 pages
Scelsi Kho Lo Transposition 1
No ratings yet
Scelsi Kho Lo Transposition 1
8 pages
Certificate - A.Tawodzera TEFL - 013951
No ratings yet
Certificate - A.Tawodzera TEFL - 013951
2 pages
Text and Web Mining
No ratings yet
Text and Web Mining
44 pages
AFM_Module 4
No ratings yet
AFM_Module 4
48 pages
Lecture 5- Text Mining Sentiment and Social Media Analytics
No ratings yet
Lecture 5- Text Mining Sentiment and Social Media Analytics
52 pages
Lecture 6-Text Mining and Sentiment Analysis
No ratings yet
Lecture 6-Text Mining and Sentiment Analysis
57 pages
Unit I –Text Mining
No ratings yet
Unit I –Text Mining
48 pages
43.IJCSCN PreprocessingTechniquesforTextMining Ilamathi Nithya
No ratings yet
43.IJCSCN PreprocessingTechniquesforTextMining Ilamathi Nithya
11 pages
Section 2 Text Analytics and Text Mining Overview
No ratings yet
Section 2 Text Analytics and Text Mining Overview
47 pages
Chapter 5 Predictive Analytics II Text^j Web^j and Social Media Analytics
No ratings yet
Chapter 5 Predictive Analytics II Text^j Web^j and Social Media Analytics
5 pages
Screenshot 2024-06-04 at 12.02.17 AM
No ratings yet
Screenshot 2024-06-04 at 12.02.17 AM
23 pages
DS Finalexam (Thxtoshravani)
No ratings yet
DS Finalexam (Thxtoshravani)
31 pages
Module 4
No ratings yet
Module 4
63 pages
Text Analytics and Text Mining Overview
No ratings yet
Text Analytics and Text Mining Overview
16 pages
Seven Text Mining Techniques
No ratings yet
Seven Text Mining Techniques
21 pages
Different Text Mining Techniques
No ratings yet
Different Text Mining Techniques
4 pages
Concept Mining: Fundamentals and Applications
From Everand
Concept Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in
No ratings yet
Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in
122 pages
Chapter 03---Sharda 11e Full Accessible Ppt 07
No ratings yet
Chapter 03---Sharda 11e Full Accessible Ppt 07
29 pages
Text Mining: Tools, Techniques, and Applications
No ratings yet
Text Mining: Tools, Techniques, and Applications
19 pages
IMTC634_Data Science_Chapter 7
No ratings yet
IMTC634_Data Science_Chapter 7
24 pages
CH 06 PPTaccessible
No ratings yet
CH 06 PPTaccessible
71 pages
Exam-2
No ratings yet
Exam-2
5 pages
1 Text Mining Review Slides
No ratings yet
1 Text Mining Review Slides
78 pages
Text Mining & Applications in Social Media: by Anthony Yang
No ratings yet
Text Mining & Applications in Social Media: by Anthony Yang
30 pages
Text Mining: A Burgeoning Technology For Knowledge Extraction
100% (1)
Text Mining: A Burgeoning Technology For Knowledge Extraction
5 pages
Text Mining: Concepts, Process and Applications: January 2013
No ratings yet
Text Mining: Concepts, Process and Applications: January 2013
5 pages
10 - Session 10 - Text Analytics, Text Mining and Sentiment Analysis
No ratings yet
10 - Session 10 - Text Analytics, Text Mining and Sentiment Analysis
36 pages
Text Data Mining: Part-I
No ratings yet
Text Data Mining: Part-I
104 pages
Introduction To Text Mining
No ratings yet
Introduction To Text Mining
82 pages
UNIT - 1 Text Mining
No ratings yet
UNIT - 1 Text Mining
18 pages
1-What Is Text Mining - IBM
No ratings yet
1-What Is Text Mining - IBM
5 pages
Method Section-Seminar Paper
No ratings yet
Method Section-Seminar Paper
6 pages
Chapter 07 - in class
No ratings yet
Chapter 07 - in class
49 pages
Text Mining
No ratings yet
Text Mining
16 pages
Case Study On Text Mining
No ratings yet
Case Study On Text Mining
8 pages
Simad University: Chapter 7: Text and Web Mining
No ratings yet
Simad University: Chapter 7: Text and Web Mining
6 pages
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-02-19 Reference-Material-I
No ratings yet
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-02-19 Reference-Material-I
42 pages
Survey Data Analysis
No ratings yet
Survey Data Analysis
17 pages
web and text mining
No ratings yet
web and text mining
6 pages
05b.BDA (18CS72) Module-5 Text Mining
No ratings yet
05b.BDA (18CS72) Module-5 Text Mining
23 pages
Text and Web Analytics
No ratings yet
Text and Web Analytics
48 pages
Introduction To Text Mining
No ratings yet
Introduction To Text Mining
45 pages
Dept. of ISE, Acit 1
No ratings yet
Dept. of ISE, Acit 1
12 pages
Lecture 10 - Data Mining in Practice
No ratings yet
Lecture 10 - Data Mining in Practice
41 pages
DATA MINING IN BUSINESS INTELLIGENCE
No ratings yet
DATA MINING IN BUSINESS INTELLIGENCE
63 pages
Text Mining: Lecturer: Dr. Nguyen Thi Ngoc Anh
No ratings yet
Text Mining: Lecturer: Dr. Nguyen Thi Ngoc Anh
27 pages
Text Mining PPT Merged
100% (1)
Text Mining PPT Merged
58 pages
Statistical Language Processing
No ratings yet
Statistical Language Processing
32 pages
Unit V - Web and Text Mining
No ratings yet
Unit V - Web and Text Mining
35 pages
Turban Dss9e Ch07
No ratings yet
Turban Dss9e Ch07
45 pages
Decision Support and Business Intelligence Systems (9 Ed., Prentice Hall) Text and Web Mining
100% (1)
Decision Support and Business Intelligence Systems (9 Ed., Prentice Hall) Text and Web Mining
45 pages
Information Retrieval
No ratings yet
Information Retrieval
3 pages
Isba 1 Finals Reviewer
No ratings yet
Isba 1 Finals Reviewer
3 pages
BDA3
No ratings yet
BDA3
61 pages
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Text Mining Introduction
No ratings yet
Text Mining Introduction
6 pages
An Overview on Extractive Text Summariza
No ratings yet
An Overview on Extractive Text Summariza
13 pages
What Is Text Mining
No ratings yet
What Is Text Mining
9 pages
Applied Text Analysis
No ratings yet
Applied Text Analysis
13 pages
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Text
100% (2)
Text
259 pages
Semantics (2.0)
No ratings yet
Semantics (2.0)
110 pages
Jamb past questions and answers on Comprehension and summary for UTME candidate
No ratings yet
Jamb past questions and answers on Comprehension and summary for UTME candidate
9 pages
GMATWhizExcel Tracker Verbal Final
No ratings yet
GMATWhizExcel Tracker Verbal Final
315 pages
(商务英语播客·电子书系列) Business English Pod (BEP) 1 Presenting for Success
No ratings yet
(商务英语播客·电子书系列) Business English Pod (BEP) 1 Presenting for Success
102 pages
Middle English Period
100% (1)
Middle English Period
22 pages
6th Standard Activities
No ratings yet
6th Standard Activities
4 pages
Prepositions and Gerunds
No ratings yet
Prepositions and Gerunds
2 pages
Lang 100 - Midterm Test
No ratings yet
Lang 100 - Midterm Test
4 pages
History of French
No ratings yet
History of French
31 pages
Modulo 5 Corte de Pelo
No ratings yet
Modulo 5 Corte de Pelo
15 pages
Apuntes B2
No ratings yet
Apuntes B2
9 pages
Part of Speech-Sorting Class 5
No ratings yet
Part of Speech-Sorting Class 5
2 pages
CD - CH2 - Lexical Analysis
No ratings yet
CD - CH2 - Lexical Analysis
67 pages
Cpe Uoe Unit 1+2 Test
No ratings yet
Cpe Uoe Unit 1+2 Test
7 pages
DLL - English 1 - Q4 - W3
No ratings yet
DLL - English 1 - Q4 - W3
6 pages
Love Definition & Meaning - Merriam-Webster
No ratings yet
Love Definition & Meaning - Merriam-Webster
10 pages
Be + Participle: Passive Voice
No ratings yet
Be + Participle: Passive Voice
5 pages
Lesson 25 27
No ratings yet
Lesson 25 27
6 pages
Final-DSC-2024-District Wise DR Vacancy
100% (1)
Final-DSC-2024-District Wise DR Vacancy
37 pages
Toefl Basing Kayes
No ratings yet
Toefl Basing Kayes
3 pages
B2 - Worksheet 3
No ratings yet
B2 - Worksheet 3
2 pages
1 Mark Gandaa Dundaa - Aspects of Birfor Phonology - 2013
No ratings yet
1 Mark Gandaa Dundaa - Aspects of Birfor Phonology - 2013
210 pages
Clil Related Bibliography Updated To 28
No ratings yet
Clil Related Bibliography Updated To 28
119 pages
FLUENCY SECRETS - Lesson Notes
100% (2)
FLUENCY SECRETS - Lesson Notes
14 pages
q2 w5 To w8 S. Test Oral Com
No ratings yet
q2 w5 To w8 S. Test Oral Com
4 pages
The Relationship Between Receptive and Productive Vocabulary Knowledge A Perspective From Vocabulary Use in Sentence Writing
No ratings yet
The Relationship Between Receptive and Productive Vocabulary Knowledge A Perspective From Vocabulary Use in Sentence Writing
15 pages
JAIST
No ratings yet
JAIST
13 pages