0% found this document useful (0 votes)
28 views62 pages

Text Mining

The document discusses the key concepts and techniques involved in text mining including tokenization, normalization, representations, and various text mining applications and tasks. It provides an overview of the course outcomes and topics that will be covered related to text mining.

Uploaded by

chandna.atu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views62 pages

Text Mining

The document discusses the key concepts and techniques involved in text mining including tokenization, normalization, representations, and various text mining applications and tasks. It provides an overview of the course outcomes and topics that will be covered related to text mining.

Uploaded by

chandna.atu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Text Mining

Research Computing Center

Sachin Joshi
8/27/16

Logistics
• Course outcomes:
• Python programming proficiency (to the extend of stand-alone ones);
• Linguistics basics;
• Mining algorithm in real world applications;
• The ability to tackle more challenging NLP and text mining problems;
• Awareness of the values locked up in text data;
• Both computation- and data-driven thinking (good for big data and
analytics jobs).
• Some running and cool projects to brag about!

CSE 398/498 7

What is text mining Data Mining (Data Science)

Text data
Non-text data
• Tweets
• Images
• Reviews • Videos
• Government reports • Temperatures
• Scientific papers • Time series
• News • Location
• Books • Graph/networks
• Messages

CSE 398/498 8

4
What is text mining
Real world Human beings Text data

Mining as reverse engineering


• Infer what the real world is;
Decision making! • Infer what the human beings are thinking;
• Infer the language itself.

What is text mining


• A more practical view

Real world

Computing Modeling Linguistics


What is text mining
• Example
Text data Knowledge
Attractive Nakiri.(double bevel) I have been using Shun
Attractive
Classic Santokus and utility knives for almost everything
so I felt a need to try something "new." This was it. This Blade -> Thin
blade is quite thin and based on the specs it is made of a Steel -> Good quality
good quality steel. It isn't at the very top in terms of Blade -> less brittle
hardness, but I hope that also means that it will be less Blade -> Easy to sharpen
brittle and potentially easier to sharpen.

Decision Prob(buying) = 95%

Text mining vs. other approaches


• Data Science: focus on data processing, with simple mining
models.
• Data Mining: focus on general mining techniques for more
general data formats.
• AI: a general area, providing some techniques for text mining.
• Database: focuses on structured data, while text data are highly
unstructured.
• traditional NLP: some techniques can be used for text mining,
but it focuses more on text analysis (beat a sentence to death).
Why text mining
• Texts are every where!
• Texts have valuable but hidden knowledge.
• Many useful and real-world applications
• Stock market (deep)
• Customer survey (Amazon.com)
• Policy and government (opengov.com)
• Question and answering systems (Baidu’s medical QA)
• Many more …

Stock market prediction


• Predict whether a stock will go up or down using opinion
expressed in public forum and news.
Entities: companies,
countries, people, etc.

Opinions: positive,
negative, neutral, etc.
Customer relation management

• Know what the customers like


• and don’t like about a product.

• Possibly recommend alternative


• products.

• Sway customers opinions via


• incentive.

• Retain leaving customers.

Scientific literal management


• Categorization of publications;
• Information retrieval;
• Discovering scientific hypothesis;
• Influential paper discovery;
• Trending topics for research
Question answering

How to do text mining


• Tools:
Computers: store and process big text data. (programming
and data structures)

Linguistics: human knowledge about syntax, semantics, etc.


(of English)

Statistical and machine learning models: infer the hidden


knowledge about the real world in the text data. (Calculus,
linear algebra, probability and statistics.)
Text Mining

Introduction

Outline
• Sentence segmentation.
• Word tokenization.
• Word normalization:
§ case folding
§ lemmatization
§ stemming.
• Text representations.
A big picture
• Decide the tokens/vocabulary of your corpus.
• Further tasks:
§ word collocation (phrase level)
§ classification, clustering, topic models (document level)
§ syntax parsing (sentence level)
§ semantic analysis: entity resolution, relation detection (all levels)
§ sentiment analysis (all levels).

Text processing pipeline


• Get the tokens!
Vector space model

Indexing and IR

Topic modeling

& normalization
Clustering

Classification

Sequential model
Sentence segmentation
• Why: we want to study properties of sentences sometimes.
• What are the boundaries between sentences? Punctuations:
§ Question mark: “?”
§ Exclamation (!)
§ Semi-colon (;)
§ Period (.): 500.00 dollars, Ph.D,
§ Comma (,): 10,500 dollars
§ Quotation (“): “Bye”, I said
§ Ampersand (&): AT&T, Barnes & Nobel
• More advanced methods are based on machine learning: check
the surrounding of a punctuation to decide whether it is a
boundary or not.
• NLTK uses Punkt sentence segmenter.

Word tokenization
• A sequence of characters -> sequence of meaningful tokens.
• Example:

Notice their
From IIR:
difference?

From FSNLP:
Tokenization
• How to define a valid token is task-dependent.
• A simple space separator is not enough: “San” and “Francisco”? “Mar 2015”?
• Do we care about phrases? “San Francisco”? “New York”?
• Special characters like “$10”? Or hashtag on Twitter “#LU”
• Apostrophes (’): “doesn’t” or “does” and “n’t”? In sentiment analysis, negative is
informative. “rock ‘n’ roll”, “Tom’s place”
§ Commas: “100, 000 dollars” or (“100” “000” “dollars”)
§ Hyphens: “soon-to-be” or (“soon” “to” “be”), “Hewlett-Packard”
§ Email addresses, dates, URLs (Usually treated separately).
§ In practice, no tokenization is perfect.
§ Instead, it is usually via fast programmed automata via regular
expressions.

Stop word removal


• Stop word list

• To remove or not to remove, that’s question:


• Some stop words have no meaning: “the”, “a”, “for”
• Stop word removal can reduce data size.
• But stop word are critical elements in syntax and semantics. NLP
usually keeps the stop words to facilitate the analysis of a whole
sentence.
Word normalization
• After tokenization, we may have two words that can belong to
the same class.
• Turn multiple words into a single class (may be incorrect).
• Examples:
• (“is” and “was” may be considered equivalent).
• USA and U.S.A have the same meaning
• There are various kinds of word normalization:
§ case folding.
§ lemmatization.
§ stemming.
§ semantic links (“auto” and “car”).

Case folding
• Usually we want lower-case the capital letters at the beginning
of a sentence.
• Example: “He went to church” -> “he”, “went”, “to”, “church”
• Counter-examples: “Kennedy was shot”? “USA” -> “usa”?
• Case can be informative.
• “US” -> “us”: country name vs. a pronoun, big loss of information.
• “C.A.T” -> “cat”: company name vs. an animal
• Rule-based: Only lower case the first letter of a sentence and all
words in titles, leaving other things un-touched.
• Machine learning: sequence model with rich features.
Lemmatization
• Lemma -> lemmatization
• A lemma is a major entry or base form in an English dictionary.
• Examples:
§ “is”, “are”, “were” share the lemma “be”.
§ “dinner” and “dinners” share the lemma “dinner”
Some information is lost.
“He is reading detective stories.”

“He be read detective story.”

Lemmatization
• More formally, we want to break a word into a few parts to
recover the most basic component in the word (morphology).
• A word consists of morphemes /mofims/
§ stem morphemes: the basic meaning of the word;
§ affix morphemes: added meanings
§ Example: “dog” -> “dog”, “cats” -> “cat” + “s”
§ “organization” -> “organize” -> “organ
• Morphological Parsing is the technical term for this word
breaking process
Stemming (Porter Stemmer)
• A simple but crude rule-based lemmatization method.

• A word is passed through the stemmer multiple times, with the


output of the last pass as the input to the current pass.
• Can have a lot of errors: a small test
• “organization” -> “organize” -> “organ”
• “noisy” -> “noise”

Example of different stemmers


Stemming vs. lemmatization
• Stemming is crude and rule-based.
• Lemmatization involves a dictionary and morphological analysis
of words. (requiring more linguistic knowledge).
• Example:
• Stemming: “saw” -> “s”
• Lemmatization: “saw” -> “see” (verb) or “saw” (noun)

• When is stemming useful and harmful?


• Examples?

Text representation
All linear algebra
• Vector space (or bag-of-words) models: concepts and
§ Word order does not matter (or lost). operations hold in this
§ Boolean. vector space.
§ Term-frequency.
§ Term-frequency inverse-document-frequency
• Sequences of tokens (after text pre-processing).
• Word order matters and shall be modeled.

Requires a set of math tools


beyond linear algebra.
A corpus as a Boolean matrix
Documents

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Tokens Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0

Each document is represented by a binary vector ∈ {0,1}|V|

Issues with such text representation?


Think about finding relevant docs using keyword “Antony”

Term frequency matrix


• Consider the number of occurrences of a term in a document:
• Each document is a count vector in ℕv: a column below

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 157 73 0 0 0 0
Brutus 4 157 0 1 0 0
Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0

Relevance information is now better preserved: try to find docs containing “Antony”
Usually we take the log the frequencies to avoid scaling problem.
Bag of words model
• Vector representation doesn’t consider the ordering of words in
a document
• John is quicker than Mary and Mary is quicker than John have
the same vectors
• This is called the bag of words model.
• In a sense, this is a step back: The positional index was able to
distinguish these two documents.

Document frequency
• Rare terms are more informative than frequent terms
• Recall stop words
• Consider a term (e.g., “Calpurnia”) in the query that is rare in
the collection.
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 157 73 0 0 0 0
Brutus 4 157 0 1 0 0
Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0

Think about finding docs relevant to the query “Caesar” & “Calpurnia”.
Document frequency
• Frequent terms are less informative than rare terms
• Consider a query term that is frequent in the collection (e.g., high,
increase, line)
• A document containing such a term is more likely to be relevant than
a document that doesn’t
• But it’s not a sure indicator of relevance.
• → For frequent terms, we want high positive weights for words like
high, increase, and line
• Given equal term frequencies, want lower weights for rare terms.
• We will use document frequency (df) to capture this.

idf weight
• dft is the document frequency of t: the number of documents
that contain t
• dft is an inverse measure of the informativeness of t
• dft ≤ N
• We define the idf (inverse document frequency) of t by
idft = log10 ( N/dft )
• We use log (N/dft) instead of N/dft to “dampen” the effect of idf. (Think
about N=1M, df=100 and 10.
idf example, suppose N = 1 million
• There is one idf value for each term t in a collection.
term dft idft
calpurnia 1
animal 100
sunday 1,000
fly 10,000
under 100,000
the 1,000,000

idft = log10 ( N/dft )

tf-idf weighting
• The tf-idf weight of a term is the product of its tf weight and its
idf weight.

w t ,d = log(1 + tft ,d ) × log10 ( N / df t )


• Best known weighting scheme in information retrieval
• Note: the “-” in tf-idf is a hyphen, not a minus sign!
• Alternative names: tf.idf, tf x idf
• Increases with the number of occurrences within a document
• Increases with the rarity of the term in the collection
Binary → count → weight matrix

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 5.25 3.18 0 0 0 0.35


Brutus 1.21 6.1 0 1 0 0
Caesar 8.59 2.54 0 1.51 0.25 0
Calpurnia 0 1.54 0 0 0 0
Cleopatra 2.85 0 0 0 0 0
mercy 1.51 0 1.9 0.12 5.25 0.88
worser 1.37 0 0.11 4.15 0.25 1.95

Each document is now represented by a real-valued


vector of tf-idf weights ∈ R|V|

Documents as vectors
• So we have a |V|-dimensional vector space
• Terms are axes of the space
• Documents are points or vectors in this space
• Very high-dimensional: tens of millions of dimensions when you
apply this to a web search engine
• These are very sparse vectors - most entries are zero.
Documents in vector space with two terms

Finance

Great

All concepts and operations in linear algebra apply


• Distance of two vectors.

• Angle between two vectors (cos similarity).


Distance vs. similarity

Notation: di and q are all documents.

Summary
• Low level text processing.
• Bag-of-words or vector space text representations.
• Distance/similarity measures.

• Coming up:
• Classification based on the vector space of terms.
Text Mining

Text Classification

Text classification
• Why text classification
§ Spam detection;
§ Finding relevant documents;
§ Sentiment analysis
Formulation (Supervised learning)

§ Given:
§ A document d (usually in a vector space).
§ A fixed set of classes:
C = {c1, c2,…, cJ}
§ A training set D of documents each with a label in C
§ Determine:
§ A learning method or algorithm which will enable us to learn a classifier f
§ For a test document d, we assign it the class f(d) ∈ C

Classifiers
§ Supervised learning
§ Naive Bayes (simple, common).
§ k-Nearest Neighbors (simple, powerful)
§ Support-vector machines (new, generally more powerful)
§ … plus many other methods
§ No free lunch: requires hand-classified training data
§ But data can be built up (and refined) by amateurs
§ Many commercial systems use a mixture of methods
Classification using bag-of-words
I love this movie! It's sweet,
but with satirical humor. The
dialogue is great and the

)=c
adventure scenes are fun… It

f(
manages to be whimsical and
romantic while laughing at the
conventions of the fairy tale
genre. I would recommend it to
just about anyone. I've seen it
several times, and I'm always
happy to see it again whenever
I have a friend who hasn't seen
it yet.

Classification using bag-of-words

great 2

f( )=c
love 2
recommend 1
laugh 1
happy 1
... ...
Features
§ Features = axes in the vector space.
§ Supervised learning classifiers can use any sort of feature
§ URL, email address, punctuation, capitalization, dictionaries, network
features
§ In the bag of words view of documents
§ We use only word features
§ we use all of the words (vocabulary) in the text (not a subset)

Feature Selection: Why?


§ Text collections have a large number of features
§ 10,000 – 1,000,000 unique words … and more
Feature Selection: Why?
§ Selection may make a particular classifier feasible
§ Some classifiers can’t deal with 1,000,000 features
§ Reduces training time
§ Training time for some methods is quadratic or worse in the number of features
§ Makes runtime models smaller and faster
§ Can improve generalization (performance)
§ Eliminates noise features
§ Avoids overfitting

Evaluating
§ Evaluation must be done on test data that are independent of
the training data
§ Sometimes use cross-validation (averaging results over multiple
training and test splits of the overall data)
§ Easy to get good performance on a test set that was available
to the learner during training (e.g., just memorize the test set)
§ Measures: precision, recall, F1, classification accuracy
§ Classification accuracy: r/n where n is the total number of test docs and r
is the number of test docs correctly classified
A running example
§ Classify webpages from CS departments into:
§ student, faculty, course, project
§ Train on ~5,000 hand-labeled web pages
§ Cornell, Washington, U.Texas, Wisconsin
§ Crawl and classify a new site (CMU) using Naïve Bayes

§ Results
Classification Using Vector Spaces

§ In vector space classification, training set corresponds to a


labeled set of points (equivalently, vectors)
§ Premise 1: Documents in the same class form a contiguous
region of space
§ Premise 2: Documents from different classes don’t overlap
(much)
§ Learning a classifier: build surfaces to delineate classes in the
space

Sec.14.1

Documents in a Vector Space

Government

Science

Arts
Sec.14.1

Test Document of what class?

Government

Science

Arts

Sec.14.1

Test Document = Government

Is this
similarity
hypothesis
true in
general?

Government

Science

Arts

Our focus: how to find good separators


Rocchio Classifier
• Training:
• Use standard tf-idf weighted vectors to represent text documents
• For training documents in each category, compute a prototype vector by
summing the vectors of the training documents in the category.
• Prototype = centroid of members of class

• Where Dc is the set of all documents that belong to class c and v(d) is the vector
space representation of d.
• Testing: assign test documents to the category with the closest prototype
vector based on cosine similarity.

Illustration of Rocchio Text Categorization


9/20/16

Rocchio Properties
• Forms a simple generalization of the examples in each class (a
prototype). may be problematic
• Prototype vector does not need to be averaged or otherwise
normalized for length since cosine similarity is insensitive to
vector length.
• Classification is based on similarity to class prototypes.
• Does not guarantee classifications are consistent with the given
training data.
Why not?

CSE 398/498 19

Rocchio Anomaly
• Prototype models have problems with polymorphic (disjunctive)
categories.

CSE 398/498 20

10
k Nearest Neighbor Classification
• kNN = k Nearest Neighbor
• It is a supervised learning method.
• Training: storing the representations of the training examples in D.
• Testing: classify a document d into class c:
§ Define k-neighborhood N as k nearest neighbors of d
§ Count number of documents i in N that belong to c
§ Estimate P(c|d) as i/k
§ Choose as class argmaxc P(c|d) [ = majority class]

Example: k=6 (6NN)

P(science| )?

Government

Science

Arts
Properties of kNN
• kNN performance sensitive to k.
• When k=1?
• Noise (i.e., an error) in the category label of a single training example.
• More robust alternative is to find the k most-similar examples
and return the majority category of these k examples.
• When k=the number of all documents?
• Value of k is typically odd to avoid ties; 3 and 5 are most
common.
• Time complexity when testing:𝑂(𝑛|𝑉|) where 𝑛 is the number of
training documents and |𝑉| is the vocabulary size.

Properties of kNN
• Nearest neighbor method depends on a similarity (or distance)
metric.
• Simplest for continuous m-dimensional instance: space is
Euclidean distance.
• Simplest for m-dimensional binary instance space: is Hamming
distance (number of feature values that differ).
• For text, cosine similarity of tf.idf weighted vectors is typically
most effective.
9/20/16

Illustration of 3 Nearest Neighbor for Text Vector


Space

CSE 398/498 25

kNN vs. Rocchio


• Nearest Neighbor tends to handle polymorphic categories better
than Rocchio.

Why kNN can handle this?

CSE 398/498 26

13
Optimization for logistic regression

Likelihood function: Log-likelihood function:

Gradient of log-likelihood

Linear classification
• Many common text classifiers are linear classifiers
• Naïve Bayes
• Perceptron
• Rocchio
• Logistic regression
• Support vector machines (with linear kernel)
• Linear regression with threshold
• Despite this similarity, noticeable performance differences
• For separable problems, there is an infinite number of separating hyperplanes. Which
one do you choose?
• What to do for non-separable problems?
• Different training methods pick different hyperplanes
• Classifiers more powerful than linear often don’t perform better on text problems.
Linear classification
• Can find separating hyperplane by linear programming
(or can iteratively fit solution via perceptron):
• separator can be expressed as ax + by = c

Find a,b,c, such that


ax + by > c for red points
ax + by < c for blue points.

Example linear text classifier


• Class: “interest” (as in interest rate)
• Example features of a linear classifier
• wi ti wi ti
• 0.70 prime • −0.71 dlrs
• 0.67 rate • −0.35 world
• 0.63 interest • −0.33 sees
• 0.60 rates • −0.25 year
• 0.46 discount • −0.24 group
• 0.43 bundesbank • −0.24 dlr

• To classify, find dot product of feature vector and weights


Which hyperplane?
• Lots of possible solutions for a,b,c.
• Some methods find a separating
hyperplane, but not the optimal one
[according to some criterion of expected goodness]
• E.g., perceptron
• Most methods find an optimal separating
hyperplane
• Which points should influence optimality?
• All points
• Linear/logistic regression
• Naïve Bayes
• Only “difficult points” close to decision
boundary
• Support vector machines

Properties of text classification


• High dimensional data: thousands or millions of features, some
relevant, many are irrelevant
• Documents are zero along almost all axes
• Most document pairs are very far apart (i.e., not strictly
orthogonal, but only share very common words and a few
scattered others)
• In classification terms: often document sets are separable, for
most any classification
• This is part of why linear classifiers are quite successful in this
domain
More than one class
• Multi-labeled classification
• A document can belong to 0, 1, or >1 classes.
• Decompose into n binary problems
• Quite common for documents
• Multi-class classification
• Classes are mutually exclusive.
• Each document belongs to exactly one class
• E.g., digit recognition. Digits are mutually exclusive
One-vs-all classificaiton
• Build a separator between each class and its complementary
set (docs from all other classes).
• Given test doc, evaluate it for membership in each class.
• Assign document to class with:
• maximum score(s)
• maximum confidence(s)
• maximum probability (probabilities)
?
?
?

?
Text Mining

Clustering

Topics
• Clustering.
• Motivation
• Quality of clustering
• Clustering methods.
• Flat clustering (K-means)
What is clustering
• Clustering: the process of grouping a set of objects into classes
of similar objects
• Documents within a cluster should be similar.
• Documents from different clusters should be dissimilar.
• The commonest form of unsupervised learning
• Unsupervised learning = learning from raw data, as opposed to
supervised data where a classification of examples is given

A data set with clear cluster structure


• Grouping the following points into 3 groups (clusters), based on
similarity/distance.
Motivating example
• Document clustering

Words having multiple meanings

Multiple meanings
of the word
“Cluster”, each
meaning is
represented by
a set of documents

6
Helping information retrieval
• Cluster hypothesis - Documents in the same cluster behave similarly with respect to
relevance to information needs
• Therefore, to improve search recall:
• Cluster docs in corpus a priori
• When a query matches a doc D, also return other docs in the cluster containing D
• Hope if we do this: The query “car” will also return docs containing automobile
• Because clustering grouped together docs containing car with those containing
automobile.

Cluster for “car” and “automobile”

Motivating example
• Word clustering
• grouping words with
similar topic
together
• 5 topics: shopping,
tech, tagging, rdf,
firefox

8
Issues of clustering
• How many clusters?
• Do you know that before clustering?
• Too many?
• Too few?
• Which distance measure to adopt? e.g. cosine vs. Euclidean?
Categorization of clustering algorithms
• Based on methodology.
• Flat algorithms
• Usually start with a random (partial) partitioning
• Refine it iteratively
• K means clustering
• (Model based clustering)
• Hierarchical algorithms
• Bottom-up, agglomerative
• (Top-down, divisive)

Categorization of clustering algorithms


• Based on results.
• Hard clustering: Each document belongs to exactly one cluster
• More common and easier to do
• Soft clustering: A document can belong to more than one cluster.
• Makes more sense for applications like creating browsable hierarchies
• You may want to put a pair of sneakers in two clusters: (i) sports apparel and
(ii) shoes
• You can only do that with a soft clustering approach.
Partitioning Algorithms
• Partitioning method: Construct a partition of n documents
into a set of K clusters
• Given: a set of documents and the number K
• Find: a partition of K clusters that optimizes the chosen
partitioning criterion
• Globally optimal
• Intractable for many objective functions
• Ergo, exhaustively enumerate all partitions
• Effective heuristic methods: K-means and K-medoids
algorithms

K-means
• Assumes documents are real-valued vectors.
• Clusters based on centroids (aka the center of gravity or mean)
of points in a cluster, c:
! 1 !
µ(c) = ∑ x
| c | x!∈c
• Reassignment of instances to clusters is based on distance to
the current cluster centroids.
• (Or one can equivalently phrase it in terms of similarities)
K-Means Algorithm
• Select K random docs {s1, s2,… sK} as seeds.
• Until clustering converges (or other stopping criterion):
• For each doc di:
• Assign di to the cluster cj such that dist(xi, sj) is minimal.
• (Next, update the seeds to the centroid of each cluster)
• For each cluster cj
• sj = µ(cj)

A running example (K=2)


Pick seeds
Reassign clusters
Compute centroids

x x
Reassign clusters
x
x
Compute centroids
Reassign clusters
Converged!
When to stop?
• Several possibilities, e.g.,
• A fixed number of iterations.
• Doc partition unchanged.
• Centroid positions don’t change.
• Are the last two conditions the same?

Convergence
• Convergence?
• Why should the K-means algorithm ever reach a fixed
point?
• A state in which clusters don’t change.
• K-means is a special case of a general procedure known as
the Expectation Maximization (EM) algorithm.
• EM is known to converge.
• Number of iterations could be large.
• But in practice usually isn’t!
Convergence
• Define goodness measure of cluster k as sum of squared
distances from cluster centroid:
• Gk = Σ i (di – ck )2 (sum over all di in cluster k)
• G = Σk Gk
• Reassignment monotonically decreases G since each
vector is assigned to the closest centroid.

Convergence of K-Means
• Recomputation monotonically decreases each Gk since
(mk is number of members in cluster k):
• Σ (di – a)2 reaches minimum for:
• Σ –2(di – a) = 0
• Σ di = Σ a
• mK a = Σ di
• a = (1/ mk ) Σ di = ck
• K-means typically converges quickly
Sensitivity to seed set selection
• Results can vary based on random
seed selection. Example showing
• Some seeds can result in poor sensitivity to seeds
convergence rate, or convergence
to sub-optimal clusterings.
• Select good seeds using a heuristic
(e.g., doc least similar to any existing In the above, if you start
with B and E as centroids
mean) you converge to {A,B,C}
• Try out multiple starting points and {D,E,F}
• Initialize with the results of another If you start with D and F
you converge to
method. {A,B,D,E} {C,F}
How many clusters?
• Number of clusters K is given
• Partition n docs into
predetermined number of
clusters
• Finding the “right” number of
clusters is part of the problem
• Given docs, partition into an
“appropriate” number of subsets.
• E.g., for Google news - we know
the number of clusters (sports,
politics, finance).
Text Mining

Word
Collocation

What is word collocation


• “an expression consisting of two or more words that correspond
to some conventional way of saying things.”

• “Collocations of a given word are statements of the habitual or


customary places of that word”

• Examples: “stiff breeze”, “strong tea”, “powerful drug”, “broad


daylight”, “weapons of mass destruction”, “make up”, “check in”
What is word collocation
• (Choueka, 1988)
[A collocation is defined as] “a sequence of two or more consecutive words, that
has characteristics of a syntactic and semantic unit, and whose exact and
unambiguous meaning or connotation cannot be derived directly from the
meaning or connotation of its components."
• Criteria:
• non-compositionality
• non-substitutability
• non-modifiability
• non-translatable word for word

Word collocation
• A phrase is compositional if its meaning can be predicted from
the meaning of its parts
• Collocations have limited compositionality
• there is usually an element of meaning added to the combination
• Ex: strong tea
• Idioms are the most extreme examples of non-compositionality
• Ex: to hear it through the grapevine
Word collocation
• We cannot substitute near-synonyms for the components of a
collocation.
• Strong is a near-synonym of powerful
• strong tea ?powerful tea
• yellow is as good a description of the color of white wines
• white wine ?yellow wine
• Many collocations cannot be freely modified with additional
lexical material or through grammatical transformations
• weapons of mass destruction --> ?weapons of massive destruction
• to be fed up to the back teeth --> ?to be fed up to the teeth in the back

Types of collocations
• Verb particle/phrasal verb constructions
• to go down, to check out,…
• Proper nouns
• John Smith
• Terminological expressions
• concepts and objects in technical domains
• hydraulic oil filter
• Idioms
• to hear it through the grapevines.
Why study word collocation
• In natural language generation
• The output should be natural
• make a decision ?take a decision
• In lexicography
• Identify collocations to list them in a dictionary
• To distinguish the usage of synonyms or near-synonyms
• In parsing
• To give preference to most natural attachments
• plastic (can opener) ? (plastic can) opener
• In corpus linguistics and psycholinguists
• Ex: To study social attitudes towards different types of substances
• strong cigarettes/tea/coffee
• powerful drug

(Near-)Synonyms
• To determine if 2 words are synonyms-- Principle of substitutability:
• 2 words are synonym if they can be substituted for one another in
some?/any? sentence without changing the meaning or acceptability of the
sentence
• How big/large is this plane?
• Would I be flying on a big/large or small plane?

• Miss Nelson became a kind of big / ?? large sister to Tom.


• I think I made a big / ?? large mistake.
Frequency based method
• Justeson and Katz’s filter
• Hypothesis:
§ if 2 words occur together very often, they must be interesting
candidates for a collocation

• Method:
§ Select the most frequently occurring bigrams (sequence of 2 adjacent
words)

Example
• Except for “New York”, all bigrams are
pairs of function words
• Need some additional information to filter
these out

Tag Pattern Example


AN linear function
NN regression coefficient
AA N Gaussian random variable
ANN cumulative distribution function
NAN mean squared error
NNN class probability function
N PN degrees of freedom
Example

• Based on POS tags and frequency


• Simple method that works very well.

“The portfolio is fine except for the fact that the last
movement of sonata #6 is missing .”

[(’The’, ’DT’), (’portfolio’, ’NN’), (’is’, ’VBZ’), (’fine’, ’JJ’),


(’except’, ’IN’), (’for’, ’IN’), (’the’, ’DT’), (’fact’, ’NN’), (’that’,
’IN’), (’the’, ’DT’), (’last’, ’JJ’), (’movement’, ’NN’), (’of’,
’IN’), (’sonata’, ’NN’), (’#6’, ’CD’), (’is’, ’VBZ’), (’missing’,
’VBG’), (’.’, ’.’)]

Subtle difference between “strong” and “powerful”


• On a 14 million word corpus from the New-York Times (Aug.-
Nov. 1990)
n-grams
• Similar to word collocation, studies the relationships between
words.
• n-grams are predictive: use the previous n-1 words to predict
the n-th word, think in a probabilistic way:
• Examples:
• Tri-gram: use the first two words to predict the third.
• “large green ___________”
tree? mountain? frog? car?
• 5-th grams:
• swallowed the large green ________”
pill? broccoli?
Why n-gram?
• Compute the probability of a sentence:
• Pr (“This is a valid one”) >> Pr (“ate top an in and”)

• Assume bi-gram

Applications
• Machine translation, OCR (optical character recognition),
speech recognition, natural language generation.
• Pick the most likely sentence.
What’s the best n?
• Reliability vs discrimination
• “large green ___________”
tree? mountain? frog? car?
• swallowed the large green ________”
pill? broccoli?

• larger n: more information about the context of the specific


instance (greater discrimination)

• smaller n: more instances in training data, better statistical


estimates (more reliability)
9/27/16

Example
• Estimate what comes next after “monstrous”
• Use the concordance function in NLTK on a corpus:

CSE 398/498 39

Estimate the probabilities


• Maximum likelihood estimate with multinomial probabilistic
model, based on word counts C(w1,w2)/C(w1)

too few non-zeros

too many zeros:


this is not the
true distribution!

CSE 398/498 40

20

You might also like