Week 12
Week 12
ANALYTICS
Introduction to
Text mining
Saji K Mathew, PhD
Professor, Department of Management Studies
INDIAN INSTITUTE OF TECHNOLOGY MADRAS
Overview
} Unstructured data: Word documents, PDF files, text
excerpts, XML files, and so on
} Text mining – first, impose structure to the data, then
mine the structured data
} Related disciplines: NLP (Computer science), Linguistics,
Cognitive psychology
Linguistics foundations
} Philosophy of language
} Mental representation of language, its expression in written
form
} Generative capacity of the mind
1. Rationalist approach (innateness)
} Language formed in mind not by the senses but is fixed in
advance, presumably by genetic inheritance (Chomsky, 1986)
2. Empiricist approach (behaviourism)
} Language learning dominated by sensory inputs
} NLP analyses
} Lexical (word level meaning): Bag of words
} Syntactic (structure connecting words)
} Semantic (meaning as a whole, theme)
Document 2 1
Document 3 3 1
Document 4 1
Document 5 2 1
Document 6 1 1
...
} Co-occurrence of words
} n-grams: n successive items; bigram commonly used (counts)
} pair-wise correlation (phi-coefficient): how often they appear
together in a section relative to how often they appear
individually
Document 2 1
Document 3 3 1
Document 4 1
Document 5 2 1
Document 6 1 1
...
Sentiment analysis
} aka opinion mining
} Words carry emotions
} Use sentiment datasets to score/classify words
} Eg.: AFINN (-5 to +5), Bing (+ve, -ve), NRC (positive, negative,
anger, anticipation, disgust, fear, joy, sadness, surprise, and trust)
} Approaches:
} Word by word, bigram, POS tagging
The case of “Slumdog Millionaire”
} http://www.youtube.com/watch?v=AIzbwV7on6Q
} http://www.youtube.com/watch?v=LenAIw95L-s
Topic modeling
} Unsupervised document classification technique, similar
to clustering
} Latent Dirichlet Allocation (LDA) is a probabilistic
approach to topic modelling
} Treats each document as a mixture of topics, and each topic as
a mixture of words
} Each document may contain words from several topics in
particular proportions. For example, in a two-topic model we
could say “Document 1 is 90% topic A and 10% topic B, while
Document 2 is 30% topic A and 70% topic B.”
} A two-topic model of American news, with one topic for
“politics” (PM, parliament, budget) and one for “entertainment.”
(movies, dance, music). Here words can be shared between
topics
Parts of speech parse tree
Language comprehension-cognitive
psychology (Hunt & Ellis, 2004)
} Functions of language
} Speech act
} Question, command, request etc.
} Propositional content
} Ideas, thoughts etc. in one sentence
} Thematic structure
} Theme of a speech in a context
} Language structure
} Phonemes (basic sound, vowel) and morphemes (word/phrase)
} Linguistic analyses
} Lexical (word level meaning)
} Syntactic (structure connecting words)
} Semantic (meaning as a whole, theme)
Web mining
} Data mining efforts on the web, web mining, fall in three
categories:
} Content mining
} Mining the real content of web pages covering text, graphics and
videos
} Structure mining
} Intra-page (tags) and inter-page (hyperlinks)
} Usage mining
} Web logs that describe the pattern of use of web: IP addresses, page
references, time stamps
} User profiling
} User’s demographic information