0% found this document useful (0 votes)
7 views

Text Mining

The document discusses text mining techniques and applications. It defines key concepts in text mining such as tokenization, stemming, stop words, and topic modeling. Various text mining tasks and methods are also explained such as sentiment analysis, document similarity, and latent semantic analysis.

Uploaded by

pojemonoy
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Text Mining

The document discusses text mining techniques and applications. It defines key concepts in text mining such as tokenization, stemming, stop words, and topic modeling. Various text mining tasks and methods are also explained such as sentiment analysis, document similarity, and latent semantic analysis.

Uploaded by

pojemonoy
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

June 6, 2024

TEXT MINING

1
 Application of data
mining to non-structured
or less structured text
files.
 Find the “hidden”
content of documents,
including additional
useful relationships
Text  Relate documents across
Mining previous unnoticed
divisions
 Group documents by
common themes
3

Need for Text Mining


4

Where is it used?
5

What is Text Data Mining?


 Peoples’ first thought:
 Make it easier to find things on the Web.
 But this is information retrieval!
 Information Extraction (IE)
 Extract facts about pre-specified entities, events or
relationships from unrestricted text sources.
 No novelty: only information already present is
extracted.
 The metaphor of extracting ore from rock:
 Does make sense for extracting documents of interest
from a huge pile.
 But does not reflect notions of DM in practice.
6 June 6, 2024

Data Mining Vs Text Mining

 Quite similar to data mining except that DM finds


patterns in data stored in structural DB’s
 But for text mining the input is a collection of
unstructured files…collection of word docs, pdf’s etc….
 So text mining can be thought of as a 2-step process
 Imposing structure to the text-based sources
 Extracting relevant info from structured text-based data
using DM tools and techniques
7 June 6, 2024

NLP
 NLP is the language computers and smartphones use to
understand our language both spoken and typed
 Uses the concepts of both Computer Sc. and AI
 Text mining is the process of deriving high quality
information from the text
 We want to turn text into data for analysis via application of
Natural Language Processing
8 June 6, 2024

Applications of NLP
9 June 6, 2024

Text Mining Lingo

 Unstructured data Vs structured data


 Corpus: large and structured set of texts prepared for the
purpose of conducting knowledge discovery.
 Tokenization: breaking a sentence into words
 Terms: Single word or multiword phrase extracted directly
from corpus of a specific domain by means of NLP methods.
10 June 6, 2024

Text Mining Lingo


 Word frequency. # of times a word is found in specific doc.
 Stemming: reducing inflected words to their stem form. Eg.,
stemming terms such as argue, argued, argues and arguing, would
result in the stem argue.
 Stop / noise words/ exclusion list: filtered out prior to processing
text, most NLP tools use a list that includes articles (a, am, the, of),
auxiliary verbs (is, are, was, were, etc.).
11 June 6, 2024

Text Mining Lingo


 Synonyms : Synonyms are syntactically different words (i.e.,
spelled differently) with identical or at least similar meanings
(e.g., movie, film, and motion picture).
 Term-by-document matrix describes freq. of terms
occurring in corpus. rows correspond to documents and
columns to terms.
12 June 6, 2024

Text Mining Lingo


 Bag of Words: NLP technique of feature extraction with text
data. It shows the occurrence of words within a document
disregarding the grammatical details and the word order.
Using it we convert variable-length texts into a fixed-length
vector i.e. text into its equivalent vector of numbers.
 Document Embedding: is a result of the second attention
layer, the sentence, that is the aggregation of all the
sentences that appear in the document, that have been
previously processed on a word level.
What is SA & OM?
 Identify the orientation of opinion in a piece of text

The movie The movie The movie


was fabulous! stars Mr. X was horrible!

 Can be generalized to a wider set of emotions


Motivation
 Knowingsentiment is a very natural ability of a human being.
Can a machine be trained to do it?

 SA aims at getting sentiment-related knowledge especially


from the huge amount of information on the internet

 Canbe generally used to understand opinion in a set of


documents
Tripod of Sentiment Analysis
Cognitive
Science

Sentiment
Analysis

Machine Natural
Learning Language
Processing
16 June 6, 2024

 Sentence 1: “This is a good job. I will not miss it for anything”


 Sentence 2: ”This is not good at all”
 For this example, vocabulary is of 5 words only.
• good
• job
• miss
• not
• all
 So, the respective vectors for these sentences are:
 “This is a good job. I will not miss it for anything” =[1,1,1,1,0]
 ”This is not good at all”=[1,0,0,1,1]
 N-Gram Model: is an N-token sequence of words: a 2-gram
(bigram)/ 3-gram (trigram) is a 3-word sequence of words .
17 June 6, 2024

 TF-IDF: term freq.–inverse document frequency, is a


numerical statistic reflecting importance of a word to a
document in a corpus.
 TF-IDF value increases proportionally to the number of times a
word appears in the document and is offset by the number of
documents in the corpus that contain the word, which helps to
adjust the fact that some words appear more frequently in
general.
 TF-IDF is the most popular term-weighting schemes today;
83% of text-based recommender systems in digital libraries use
tf–idf
 High value for TF-IDF indicates that the term doesn’t occur
frequently in the collection of documents taken as a whole, but
appears quite frequently in a specified document.
 TF-IDF value close to 0 indicates that the term appears
frequently in the collection, but rarely in a specific document.
18 June 6, 2024

Documents Semantic Analysis


 Explore document content quickly and efficiently
 Compare subgroup to main group
 Extract keywords
 What are the documents talking about?
 Explore document maps
19 June 6, 2024

Text Mining Lingo


 Concordance finds the queried word in a text and displays the
context in which this word is used giving the analyst the
opportunity to view different perspectives on a text.
 It is a generated list over every occurrence of a given word in a
digital corpus with the context (a certain number of words
before and after the keyword).
 The search term and its co-text are arranged so that the textual
environment can be assessed and patterns surrounding the
search term can be identified visually.
20 June 6, 2024

Text Mining Lingo

 Similarity Hashing computes similarity hashes for the


given corpus, allowing the user to find duplicates,
plagiarism or textual borrowing or paraphrasing in
documents for legal or academic use.
 It numerically scores & indicates how similar two texts are
based on their content, structure, or style.
 Text similarity measures can be utilized to perform various
tasks which require comparing, matching, or grouping texts
based on their similarity or difference.
21 June 6, 2024

 Additionally, you can find relevant or similar documents,


articles, or products based on a query or reference text.
 Texts can also be clustered or categorized into topics,
themes, or genres according to their content.
 Furthermore, texts can be summarized by identifying the
most important sentences or simplified by measuring their
consistency and readability.
22 June 6, 2024

Text Mining Lingo

 According to some sources, the average person generates in


excess of 2.7MB of digital data per second, of which 80-
90% is unstructured.
 Consider a scenario where a business employs a single
individual to review each piece of unstructured data and
segment them based on the underlying topic. It would be an
impossible task.
 The solution is topic modeling.
 Topic Modeling: is a frequently used approach to discover
hidden semantic patterns portrayed by a text corpus and
automatically identify topics that exist inside it.
23 June 6, 2024

Text Mining Lingo


 statistical
modeling that leverages unsupervised machine
learning to analyze and identify clusters or groups of
similar words within corpus.
 For example, a topic modeling algorithm may be deployed
to determine whether the contents of a document imply it’s
an invoice, complaint, or contract.
 topic modeling aids businesses in:
• real-time analysis on unstructured textual data
• Learn from unstructured data at scale
• Build a consistent understanding of data, regardless of its
format.
24 June 6, 2024

 Latent semantic indexing (LSI) is an indexing and retrieval


method that uses a mathematical technique called singular value
decomposition (SVD) to identify patterns in the relationships
between the terms and concepts contained in an unstructured
collection of text.

 Latent Dirichlet allocation (LDA) is a Bayesian network (and,


therefore, a generative statistical model) for modeling automatically
extracted topics in textual corpora. In this, observations (e.g.,
words) are collected into documents, and each word's presence is
attributable to one of the document's topics

 Multidimensional scaling (MDS) is a means of visualizing the


level of similarity of individual cases of a dataset.ttributable to
one of the document's topics. Each document will contain a small
number of topics.
25 June 6, 2024

You might also like