0% found this document useful (0 votes)
7 views19 pages

Week 12

The document provides an overview of text mining, emphasizing the transformation of unstructured data into structured formats for analysis. It covers foundational concepts in linguistics, natural language processing, and the evolution of large language models, alongside the text mining process, including corpus preparation and sentiment analysis. Additionally, it discusses various analytical techniques such as topic modeling, web mining, and the importance of word relationships in understanding text data.

Uploaded by

Riya singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views19 pages

Week 12

The document provides an overview of text mining, emphasizing the transformation of unstructured data into structured formats for analysis. It covers foundational concepts in linguistics, natural language processing, and the evolution of large language models, alongside the text mining process, including corpus preparation and sentiment analysis. Additionally, it discusses various analytical techniques such as topic modeling, web mining, and the importance of word relationships in understanding text data.

Uploaded by

Riya singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

BUSINESS INTELLIGENCE &

ANALYTICS

Introduction to
Text mining
Saji K Mathew, PhD
Professor, Department of Management Studies
INDIAN INSTITUTE OF TECHNOLOGY MADRAS
Overview
} Unstructured data: Word documents, PDF files, text
excerpts, XML files, and so on
} Text mining – first, impose structure to the data, then
mine the structured data
} Related disciplines: NLP (Computer science), Linguistics,
Cognitive psychology
Linguistics foundations
} Philosophy of language
} Mental representation of language, its expression in written
form
} Generative capacity of the mind
1. Rationalist approach (innateness)
} Language formed in mind not by the senses but is fixed in
advance, presumably by genetic inheritance (Chomsky, 1986)
2. Empiricist approach (behaviourism)
} Language learning dominated by sensory inputs

Statistical NLP belongs to the second school


Natural Language Processing (NLP) foundations
} A text without context is a pretext: The meaning of a word
is defined by the circumstances of its use (Wittgenstein,1968)

} NLP analyses
} Lexical (word level meaning): Bag of words
} Syntactic (structure connecting words)
} Semantic (meaning as a whole, theme)

“The vodka was good, but the meat was rotten”


The cusp: Large Language Models (LLM)
} Foundation of Generative AI, which generates new content
} ChatGPT, DALL-E
} Stochastic parrots? (lacks in connection, composition)
Evolution of LLM
} Bag of words to semantic meaning through Word2Vec
(Mikolov et al. 2013)
} Transformer architecture and deep learning (Vaswani et al.
2017; Wei et al. 2022)
} Overcomes the limitation of BP algorithm for ANN training
} (Vanishing gradient problem)
Text mining process
Corpus preparation
} Token: a string of contiguous alphanumeric characters
with space on either side;
} Word, punctuations, numbers
} Tokenization: Identification of tokens in a text
} Document: A sequence of N words denoted by w =
(w1,w2,…wN), where wn is the nth word in the sequence.
} Corpus: A collection of M documents denoted by D =
{w1,w2,…wN}
} Stemming (lemmatizations): strips different forms of a
word into one stem (normalization)
} Go, gone, going
Term–by–Document Matrix (TDM)
e nt ri ng
Terms k em ee
t ris n ag n gin nt
e n a e m e
s tm e c tm w are e l op
e j ft v P
Documents inv pro so de SA ...
Document 1 1 1

Document 2 1

Document 3 3 1

Document 4 1

Document 5 2 1

Document 6 1 1
...

Terms: words, n-grams (sequence of n words considered together)


Features: Group of terms
Descriptive analysis
} Word counts, term frequency (tf), inverse document
frequency
} Similarity between documents
} Correlation between words/n-grams
} Graphical analysis of text structures
} Word counts, word clouds, associations, sentiments…
Importance of words
Two approaches: (a) Inclusion/exclusion approach using
stop-words (b) Quantification using tf-idf measures
} Stop-words: Words that are not useful for an analysis,
typically very common words such as “the”, “of”, “to”, and
so forth in English. Stop words could be added/removed
from source lexicons
} Quantifying what a document is about: tf*idf
Term frequency (tf):
Word count in a document/number of words in document
Inverse document frequency (idf):
Idf = ln(ndocuments/ndocuments containing term)

The statistic tf-idf is intended to measure how important a word is to a


document in a collection (or corpus) of documents
Relationship between words

} Co-occurrence of words
} n-grams: n successive items; bigram commonly used (counts)
} pair-wise correlation (phi-coefficient): how often they appear
together in a section relative to how often they appear
individually

How much more likely it is that either both word X and Y


appear, or neither do, than that one appears without the
other (-1 to +1)
Cosine similarity
} Term frequency vectors are usually sparse (0 elements)
} Common zero values between two vectors (documents)
not useful; Non-zero common values matter
} Cosine similarity measures distance between documents
} 0 means 90º (orthogonal); 1 means 0º (full similarity)
nt ng
me eri
Terms sk ge ine
t ri an
a en
g
en
t
en tm re pm
tm jec wa lo
es oft ve P
Documents inv pro s de SA ...
Document 1 1 1

Document 2 1

Document 3 3 1

Document 4 1

Document 5 2 1

Document 6 1 1
...
Sentiment analysis
} aka opinion mining
} Words carry emotions
} Use sentiment datasets to score/classify words
} Eg.: AFINN (-5 to +5), Bing (+ve, -ve), NRC (positive, negative,
anger, anticipation, disgust, fear, joy, sadness, surprise, and trust)
} Approaches:
} Word by word, bigram, POS tagging
The case of “Slumdog Millionaire”
} http://www.youtube.com/watch?v=AIzbwV7on6Q
} http://www.youtube.com/watch?v=LenAIw95L-s
Topic modeling
} Unsupervised document classification technique, similar
to clustering
} Latent Dirichlet Allocation (LDA) is a probabilistic
approach to topic modelling
} Treats each document as a mixture of topics, and each topic as
a mixture of words
} Each document may contain words from several topics in
particular proportions. For example, in a two-topic model we
could say “Document 1 is 90% topic A and 10% topic B, while
Document 2 is 30% topic A and 70% topic B.”
} A two-topic model of American news, with one topic for
“politics” (PM, parliament, budget) and one for “entertainment.”
(movies, dance, music). Here words can be shared between
topics
Parts of speech parse tree
Language comprehension-cognitive
psychology (Hunt & Ellis, 2004)
} Functions of language
} Speech act
} Question, command, request etc.
} Propositional content
} Ideas, thoughts etc. in one sentence
} Thematic structure
} Theme of a speech in a context
} Language structure
} Phonemes (basic sound, vowel) and morphemes (word/phrase)
} Linguistic analyses
} Lexical (word level meaning)
} Syntactic (structure connecting words)
} Semantic (meaning as a whole, theme)
Web mining
} Data mining efforts on the web, web mining, fall in three
categories:
} Content mining
} Mining the real content of web pages covering text, graphics and
videos
} Structure mining
} Intra-page (tags) and inter-page (hyperlinks)
} Usage mining
} Web logs that describe the pattern of use of web: IP addresses, page
references, time stamps
} User profiling
} User’s demographic information

You might also like