Text Mining
Text Mining
Sachin Joshi
8/27/16
Logistics
• Course outcomes:
• Python programming proficiency (to the extend of stand-alone ones);
• Linguistics basics;
• Mining algorithm in real world applications;
• The ability to tackle more challenging NLP and text mining problems;
• Awareness of the values locked up in text data;
• Both computation- and data-driven thinking (good for big data and
analytics jobs).
• Some running and cool projects to brag about!
CSE 398/498 7
Text data
Non-text data
• Tweets
• Images
• Reviews • Videos
• Government reports • Temperatures
• Scientific papers • Time series
• News • Location
• Books • Graph/networks
• Messages
CSE 398/498 8
4
What is text mining
Real world Human beings Text data
Real world
Opinions: positive,
negative, neutral, etc.
Customer relation management
Introduction
Outline
• Sentence segmentation.
• Word tokenization.
• Word normalization:
§ case folding
§ lemmatization
§ stemming.
• Text representations.
A big picture
• Decide the tokens/vocabulary of your corpus.
• Further tasks:
§ word collocation (phrase level)
§ classification, clustering, topic models (document level)
§ syntax parsing (sentence level)
§ semantic analysis: entity resolution, relation detection (all levels)
§ sentiment analysis (all levels).
Indexing and IR
Topic modeling
& normalization
Clustering
Classification
Sequential model
Sentence segmentation
• Why: we want to study properties of sentences sometimes.
• What are the boundaries between sentences? Punctuations:
§ Question mark: “?”
§ Exclamation (!)
§ Semi-colon (;)
§ Period (.): 500.00 dollars, Ph.D,
§ Comma (,): 10,500 dollars
§ Quotation (“): “Bye”, I said
§ Ampersand (&): AT&T, Barnes & Nobel
• More advanced methods are based on machine learning: check
the surrounding of a punctuation to decide whether it is a
boundary or not.
• NLTK uses Punkt sentence segmenter.
Word tokenization
• A sequence of characters -> sequence of meaningful tokens.
• Example:
Notice their
From IIR:
difference?
From FSNLP:
Tokenization
• How to define a valid token is task-dependent.
• A simple space separator is not enough: “San” and “Francisco”? “Mar 2015”?
• Do we care about phrases? “San Francisco”? “New York”?
• Special characters like “$10”? Or hashtag on Twitter “#LU”
• Apostrophes (’): “doesn’t” or “does” and “n’t”? In sentiment analysis, negative is
informative. “rock ‘n’ roll”, “Tom’s place”
§ Commas: “100, 000 dollars” or (“100” “000” “dollars”)
§ Hyphens: “soon-to-be” or (“soon” “to” “be”), “Hewlett-Packard”
§ Email addresses, dates, URLs (Usually treated separately).
§ In practice, no tokenization is perfect.
§ Instead, it is usually via fast programmed automata via regular
expressions.
Case folding
• Usually we want lower-case the capital letters at the beginning
of a sentence.
• Example: “He went to church” -> “he”, “went”, “to”, “church”
• Counter-examples: “Kennedy was shot”? “USA” -> “usa”?
• Case can be informative.
• “US” -> “us”: country name vs. a pronoun, big loss of information.
• “C.A.T” -> “cat”: company name vs. an animal
• Rule-based: Only lower case the first letter of a sentence and all
words in titles, leaving other things un-touched.
• Machine learning: sequence model with rich features.
Lemmatization
• Lemma -> lemmatization
• A lemma is a major entry or base form in an English dictionary.
• Examples:
§ “is”, “are”, “were” share the lemma “be”.
§ “dinner” and “dinners” share the lemma “dinner”
Some information is lost.
“He is reading detective stories.”
Lemmatization
• More formally, we want to break a word into a few parts to
recover the most basic component in the word (morphology).
• A word consists of morphemes /mofims/
§ stem morphemes: the basic meaning of the word;
§ affix morphemes: added meanings
§ Example: “dog” -> “dog”, “cats” -> “cat” + “s”
§ “organization” -> “organize” -> “organ
• Morphological Parsing is the technical term for this word
breaking process
Stemming (Porter Stemmer)
• A simple but crude rule-based lemmatization method.
Text representation
All linear algebra
• Vector space (or bag-of-words) models: concepts and
§ Word order does not matter (or lost). operations hold in this
§ Boolean. vector space.
§ Term-frequency.
§ Term-frequency inverse-document-frequency
• Sequences of tokens (after text pre-processing).
• Word order matters and shall be modeled.
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Tokens Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 157 73 0 0 0 0
Brutus 4 157 0 1 0 0
Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0
Relevance information is now better preserved: try to find docs containing “Antony”
Usually we take the log the frequencies to avoid scaling problem.
Bag of words model
• Vector representation doesn’t consider the ordering of words in
a document
• John is quicker than Mary and Mary is quicker than John have
the same vectors
• This is called the bag of words model.
• In a sense, this is a step back: The positional index was able to
distinguish these two documents.
Document frequency
• Rare terms are more informative than frequent terms
• Recall stop words
• Consider a term (e.g., “Calpurnia”) in the query that is rare in
the collection.
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 157 73 0 0 0 0
Brutus 4 157 0 1 0 0
Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0
Think about finding docs relevant to the query “Caesar” & “Calpurnia”.
Document frequency
• Frequent terms are less informative than rare terms
• Consider a query term that is frequent in the collection (e.g., high,
increase, line)
• A document containing such a term is more likely to be relevant than
a document that doesn’t
• But it’s not a sure indicator of relevance.
• → For frequent terms, we want high positive weights for words like
high, increase, and line
• Given equal term frequencies, want lower weights for rare terms.
• We will use document frequency (df) to capture this.
idf weight
• dft is the document frequency of t: the number of documents
that contain t
• dft is an inverse measure of the informativeness of t
• dft ≤ N
• We define the idf (inverse document frequency) of t by
idft = log10 ( N/dft )
• We use log (N/dft) instead of N/dft to “dampen” the effect of idf. (Think
about N=1M, df=100 and 10.
idf example, suppose N = 1 million
• There is one idf value for each term t in a collection.
term dft idft
calpurnia 1
animal 100
sunday 1,000
fly 10,000
under 100,000
the 1,000,000
tf-idf weighting
• The tf-idf weight of a term is the product of its tf weight and its
idf weight.
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Documents as vectors
• So we have a |V|-dimensional vector space
• Terms are axes of the space
• Documents are points or vectors in this space
• Very high-dimensional: tens of millions of dimensions when you
apply this to a web search engine
• These are very sparse vectors - most entries are zero.
Documents in vector space with two terms
Finance
Great
Summary
• Low level text processing.
• Bag-of-words or vector space text representations.
• Distance/similarity measures.
• Coming up:
• Classification based on the vector space of terms.
Text Mining
Text Classification
Text classification
• Why text classification
§ Spam detection;
§ Finding relevant documents;
§ Sentiment analysis
Formulation (Supervised learning)
§ Given:
§ A document d (usually in a vector space).
§ A fixed set of classes:
C = {c1, c2,…, cJ}
§ A training set D of documents each with a label in C
§ Determine:
§ A learning method or algorithm which will enable us to learn a classifier f
§ For a test document d, we assign it the class f(d) ∈ C
Classifiers
§ Supervised learning
§ Naive Bayes (simple, common).
§ k-Nearest Neighbors (simple, powerful)
§ Support-vector machines (new, generally more powerful)
§ … plus many other methods
§ No free lunch: requires hand-classified training data
§ But data can be built up (and refined) by amateurs
§ Many commercial systems use a mixture of methods
Classification using bag-of-words
I love this movie! It's sweet,
but with satirical humor. The
dialogue is great and the
)=c
adventure scenes are fun… It
f(
manages to be whimsical and
romantic while laughing at the
conventions of the fairy tale
genre. I would recommend it to
just about anyone. I've seen it
several times, and I'm always
happy to see it again whenever
I have a friend who hasn't seen
it yet.
great 2
f( )=c
love 2
recommend 1
laugh 1
happy 1
... ...
Features
§ Features = axes in the vector space.
§ Supervised learning classifiers can use any sort of feature
§ URL, email address, punctuation, capitalization, dictionaries, network
features
§ In the bag of words view of documents
§ We use only word features
§ we use all of the words (vocabulary) in the text (not a subset)
Evaluating
§ Evaluation must be done on test data that are independent of
the training data
§ Sometimes use cross-validation (averaging results over multiple
training and test splits of the overall data)
§ Easy to get good performance on a test set that was available
to the learner during training (e.g., just memorize the test set)
§ Measures: precision, recall, F1, classification accuracy
§ Classification accuracy: r/n where n is the total number of test docs and r
is the number of test docs correctly classified
A running example
§ Classify webpages from CS departments into:
§ student, faculty, course, project
§ Train on ~5,000 hand-labeled web pages
§ Cornell, Washington, U.Texas, Wisconsin
§ Crawl and classify a new site (CMU) using Naïve Bayes
§ Results
Classification Using Vector Spaces
Sec.14.1
Government
Science
Arts
Sec.14.1
Government
Science
Arts
Sec.14.1
Is this
similarity
hypothesis
true in
general?
Government
Science
Arts
• Where Dc is the set of all documents that belong to class c and v(d) is the vector
space representation of d.
• Testing: assign test documents to the category with the closest prototype
vector based on cosine similarity.
Rocchio Properties
• Forms a simple generalization of the examples in each class (a
prototype). may be problematic
• Prototype vector does not need to be averaged or otherwise
normalized for length since cosine similarity is insensitive to
vector length.
• Classification is based on similarity to class prototypes.
• Does not guarantee classifications are consistent with the given
training data.
Why not?
CSE 398/498 19
Rocchio Anomaly
• Prototype models have problems with polymorphic (disjunctive)
categories.
CSE 398/498 20
10
k Nearest Neighbor Classification
• kNN = k Nearest Neighbor
• It is a supervised learning method.
• Training: storing the representations of the training examples in D.
• Testing: classify a document d into class c:
§ Define k-neighborhood N as k nearest neighbors of d
§ Count number of documents i in N that belong to c
§ Estimate P(c|d) as i/k
§ Choose as class argmaxc P(c|d) [ = majority class]
P(science| )?
Government
Science
Arts
Properties of kNN
• kNN performance sensitive to k.
• When k=1?
• Noise (i.e., an error) in the category label of a single training example.
• More robust alternative is to find the k most-similar examples
and return the majority category of these k examples.
• When k=the number of all documents?
• Value of k is typically odd to avoid ties; 3 and 5 are most
common.
• Time complexity when testing:𝑂(𝑛|𝑉|) where 𝑛 is the number of
training documents and |𝑉| is the vocabulary size.
Properties of kNN
• Nearest neighbor method depends on a similarity (or distance)
metric.
• Simplest for continuous m-dimensional instance: space is
Euclidean distance.
• Simplest for m-dimensional binary instance space: is Hamming
distance (number of feature values that differ).
• For text, cosine similarity of tf.idf weighted vectors is typically
most effective.
9/20/16
CSE 398/498 25
CSE 398/498 26
13
Optimization for logistic regression
Gradient of log-likelihood
Linear classification
• Many common text classifiers are linear classifiers
• Naïve Bayes
• Perceptron
• Rocchio
• Logistic regression
• Support vector machines (with linear kernel)
• Linear regression with threshold
• Despite this similarity, noticeable performance differences
• For separable problems, there is an infinite number of separating hyperplanes. Which
one do you choose?
• What to do for non-separable problems?
• Different training methods pick different hyperplanes
• Classifiers more powerful than linear often don’t perform better on text problems.
Linear classification
• Can find separating hyperplane by linear programming
(or can iteratively fit solution via perceptron):
• separator can be expressed as ax + by = c
?
Text Mining
Clustering
Topics
• Clustering.
• Motivation
• Quality of clustering
• Clustering methods.
• Flat clustering (K-means)
What is clustering
• Clustering: the process of grouping a set of objects into classes
of similar objects
• Documents within a cluster should be similar.
• Documents from different clusters should be dissimilar.
• The commonest form of unsupervised learning
• Unsupervised learning = learning from raw data, as opposed to
supervised data where a classification of examples is given
Multiple meanings
of the word
“Cluster”, each
meaning is
represented by
a set of documents
6
Helping information retrieval
• Cluster hypothesis - Documents in the same cluster behave similarly with respect to
relevance to information needs
• Therefore, to improve search recall:
• Cluster docs in corpus a priori
• When a query matches a doc D, also return other docs in the cluster containing D
• Hope if we do this: The query “car” will also return docs containing automobile
• Because clustering grouped together docs containing car with those containing
automobile.
Motivating example
• Word clustering
• grouping words with
similar topic
together
• 5 topics: shopping,
tech, tagging, rdf,
firefox
8
Issues of clustering
• How many clusters?
• Do you know that before clustering?
• Too many?
• Too few?
• Which distance measure to adopt? e.g. cosine vs. Euclidean?
Categorization of clustering algorithms
• Based on methodology.
• Flat algorithms
• Usually start with a random (partial) partitioning
• Refine it iteratively
• K means clustering
• (Model based clustering)
• Hierarchical algorithms
• Bottom-up, agglomerative
• (Top-down, divisive)
K-means
• Assumes documents are real-valued vectors.
• Clusters based on centroids (aka the center of gravity or mean)
of points in a cluster, c:
! 1 !
µ(c) = ∑ x
| c | x!∈c
• Reassignment of instances to clusters is based on distance to
the current cluster centroids.
• (Or one can equivalently phrase it in terms of similarities)
K-Means Algorithm
• Select K random docs {s1, s2,… sK} as seeds.
• Until clustering converges (or other stopping criterion):
• For each doc di:
• Assign di to the cluster cj such that dist(xi, sj) is minimal.
• (Next, update the seeds to the centroid of each cluster)
• For each cluster cj
• sj = µ(cj)
x x
Reassign clusters
x
x
Compute centroids
Reassign clusters
Converged!
When to stop?
• Several possibilities, e.g.,
• A fixed number of iterations.
• Doc partition unchanged.
• Centroid positions don’t change.
• Are the last two conditions the same?
Convergence
• Convergence?
• Why should the K-means algorithm ever reach a fixed
point?
• A state in which clusters don’t change.
• K-means is a special case of a general procedure known as
the Expectation Maximization (EM) algorithm.
• EM is known to converge.
• Number of iterations could be large.
• But in practice usually isn’t!
Convergence
• Define goodness measure of cluster k as sum of squared
distances from cluster centroid:
• Gk = Σ i (di – ck )2 (sum over all di in cluster k)
• G = Σk Gk
• Reassignment monotonically decreases G since each
vector is assigned to the closest centroid.
Convergence of K-Means
• Recomputation monotonically decreases each Gk since
(mk is number of members in cluster k):
• Σ (di – a)2 reaches minimum for:
• Σ –2(di – a) = 0
• Σ di = Σ a
• mK a = Σ di
• a = (1/ mk ) Σ di = ck
• K-means typically converges quickly
Sensitivity to seed set selection
• Results can vary based on random
seed selection. Example showing
• Some seeds can result in poor sensitivity to seeds
convergence rate, or convergence
to sub-optimal clusterings.
• Select good seeds using a heuristic
(e.g., doc least similar to any existing In the above, if you start
with B and E as centroids
mean) you converge to {A,B,C}
• Try out multiple starting points and {D,E,F}
• Initialize with the results of another If you start with D and F
you converge to
method. {A,B,D,E} {C,F}
How many clusters?
• Number of clusters K is given
• Partition n docs into
predetermined number of
clusters
• Finding the “right” number of
clusters is part of the problem
• Given docs, partition into an
“appropriate” number of subsets.
• E.g., for Google news - we know
the number of clusters (sports,
politics, finance).
Text Mining
Word
Collocation
Word collocation
• A phrase is compositional if its meaning can be predicted from
the meaning of its parts
• Collocations have limited compositionality
• there is usually an element of meaning added to the combination
• Ex: strong tea
• Idioms are the most extreme examples of non-compositionality
• Ex: to hear it through the grapevine
Word collocation
• We cannot substitute near-synonyms for the components of a
collocation.
• Strong is a near-synonym of powerful
• strong tea ?powerful tea
• yellow is as good a description of the color of white wines
• white wine ?yellow wine
• Many collocations cannot be freely modified with additional
lexical material or through grammatical transformations
• weapons of mass destruction --> ?weapons of massive destruction
• to be fed up to the back teeth --> ?to be fed up to the teeth in the back
Types of collocations
• Verb particle/phrasal verb constructions
• to go down, to check out,…
• Proper nouns
• John Smith
• Terminological expressions
• concepts and objects in technical domains
• hydraulic oil filter
• Idioms
• to hear it through the grapevines.
Why study word collocation
• In natural language generation
• The output should be natural
• make a decision ?take a decision
• In lexicography
• Identify collocations to list them in a dictionary
• To distinguish the usage of synonyms or near-synonyms
• In parsing
• To give preference to most natural attachments
• plastic (can opener) ? (plastic can) opener
• In corpus linguistics and psycholinguists
• Ex: To study social attitudes towards different types of substances
• strong cigarettes/tea/coffee
• powerful drug
(Near-)Synonyms
• To determine if 2 words are synonyms-- Principle of substitutability:
• 2 words are synonym if they can be substituted for one another in
some?/any? sentence without changing the meaning or acceptability of the
sentence
• How big/large is this plane?
• Would I be flying on a big/large or small plane?
• Method:
§ Select the most frequently occurring bigrams (sequence of 2 adjacent
words)
Example
• Except for “New York”, all bigrams are
pairs of function words
• Need some additional information to filter
these out
“The portfolio is fine except for the fact that the last
movement of sonata #6 is missing .”
• Assume bi-gram
Applications
• Machine translation, OCR (optical character recognition),
speech recognition, natural language generation.
• Pick the most likely sentence.
What’s the best n?
• Reliability vs discrimination
• “large green ___________”
tree? mountain? frog? car?
• swallowed the large green ________”
pill? broccoli?
Example
• Estimate what comes next after “monstrous”
• Use the concordance function in NLTK on a corpus:
CSE 398/498 39
CSE 398/498 40
20