0% found this document useful (0 votes)

5 views

cs4811-ch10c-clustering

The document discusses various aspects of symbol-based machine learning, focusing on clustering techniques and algorithms such as k-means and CLUSTER/2. It highlights the importance of clustering in discovering patterns in unclassified data and addresses challenges like the curse of dimensionality and determining the optimal number of clusters. Additionally, it explores applications of clustering in fields like data mining, document organization, and automatic directory construction.

Uploaded by

pudanasin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

cs4811-ch10c-clustering

Uploaded by

pudanasin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 35

10c Machine Learning: Symbol-based

10.0 Introduction 10.5 Knowledge and Learning

10.1 A Framework for 10.6 Unsupervised Learning
Symbol-based Learning
10.7 Reinforcement Learning
10.2 Version Space Search
10.8 Epilogue and
10.3 The ID3 Decision Tree References
Induction Algorithm
10.9 Exercises
10.4 Inductive Bias and
Learnability

Additional references for the slides:

Jeffrey Ullman’s clustering slides:
www-db.stanford.edu/~ullman/cs345-notes.html
Ernest Davis’ clustering slides:
www.cs.nyu.edu/courses/fall02/G22.3033-008/index.htm 1
Unsupervised learning

2
Example: a cholera outbreak in London

Many years ago, during a cholera outbreak in London, a

physician plotted the location of cases on a map.
Properly visualized, the data indicated that cases
clustered around certain intersections, where there were
polluted wells, not only exposing the cause of cholera,
but indicating what to do about the problem.

X X
X
XX XX
X X X
X X
X X
X X
X X
XX
X

3
Conceptual Clustering

The clustering problem

Given
• a collection of unclassified objects, and
• a means for measuring the similarity of objects
(distance metric),
find
• classes (clusters) of objects such that some
standard of quality is met (e.g., maximize the
similarity of objects in the same class.)
Essentially, it is an approach to discover a useful
summary of the data.

4
Conceptual Clustering (cont’d)

Ideally, we would like to represent clusters and

their semantic explanations. In other words, we
would like to define clusters extensionally (i.e.,
by general rules) rather than intensionally (i.e.,
by enumeration).
For instance, compare
{ X | X teaches AI at MTU CS}, and
{ John Lowther, Nilufer Onder}

5
Curse of dimensionality

• While clustering looks intuitive in 2

dimensions, many applications involve 10 or
10,000 dimensions
• High-dimensional spaces look different: the
probability of random points being close drops
quickly as the dimensionality grows

6
Higher dimensional examples

• Observation that customers who buy diapers are more

likely to buy beer than average allowed supermarkets to
place beer and diapers nearby, knowing many
customers would walk between them. Placing potato
chips between increased the sales of all three items.

7
Skycat software

8
Skycat software (cont’d)

• Skycat is a catalog of sky objects

• Objects are represented by their radiation in 9
dimensions (each dimension represents radiation in one
band of the spectrum
• Skycat clustered 2 x 109 sky objects into similar
objects e.g., stars, galaxies, quasars, etc.
• The Sloan Sky Survey is a newer, better version to
catalog and cluster the entire visible universe.
Clustering sky objects by their radiation levels in
different bands allowed astronomers to distinguish
between galaxies, nearby stars, and many other kinds of
celestial objects.

9
Clustering CDs

• Intuition: music divides into categories and

customers prefer a few categories
• But what are categories really?
• Represent a CD by the customers who bought it
• Similar CDs have similar sets of customers and
vice versa

10
The space of CDs

• Think of a space with one dimension for each

customer
• Values in a dimension may be 0 or 1 only
• A CD’s point in this space is
(x1, x2, …, xn), where xi = 1 iff the ith customer
bought the CD
• Compare this with the correlated items matrix:
rows = customers
columns = CDs

11
Clustering documents

• Query “salsa” submitted to MetaCrawler returns 246

documents in 15 clusters, of which the top are:
 Puerto Rico; Latin Music (8 docs)

 Follow Up Post; York Salsa Dancers (20 docs)

 music; entertainment; latin; artists (40 docs)

 hot; food; chiles; sauces; condiments; companies (79 docs)

 pepper; onion; tomatoes (41 docs)

• The clusters are: dance, recipe, clubs, sauces, buy,

mexican, bands, natural, …

12
Clustering documents (cont’d)

• Documents may be thought of as points in a high-

dimensional space, where each dimension
corresponds to one possible word.
• Clusters of documents in this space often
correspond to groups of documents on the same
topic, i.e., documents with similar sets of words may
be about the same topic
• Represent a document by a vector (x1, x2, …, xn),
where xi = 1 iff the ith word (in some order) appears in
the document
• n can be infinite

13
Analyzing protein sequences

• Objects are sequences of {C, A, T, G}

• Distance between sequences is “edit
distance,” the minimum number of inserts and
deletes to turn one into the other
• Note that there is a “distance,” but no
convenient space of points

14
Measuring distance

• To discuss, whether a set of points is close enough

to be considered a cluster, we need a distance
measure D(x,y) that tells how far points x and y are.
• The axioms for a distance measure D are:

1. D(x,x) = 0 A point is distance 0

from itself
2. D(x,y) = D(y,x) Distance is symmetric

3. D(x,y) ≤ D(x,z) + D(z,y) The triangle inequality

4. D(x,y) ≥ 0 Distance is positive

15
K-dimensional Euclidean space

The distance between any two points, say

a = [a1, a2, … , ak] and b = [b1, b2, … , bk]
is given some manner such as:
b
1. Common distance (“L2 norm”) :
k
i =1 (ai - bi)2
a
2. Manhattan distance (“L1 norm”): b

i k=1 |ai - bi|

3. Max of dimensions (“L norm”): a
b

maxi =1
k |a - b |
i i

a 16
Non-Euclidean spaces

Here are some examples where a distance measure

without a Euclidean space makes sense.
• Web pages: Roughly 108-dimensional space where
each dimension corresponds to one word. Rather
use vectors to deal with only the words actually
present in documents a and b.
• Character strings, such as DNA sequences: Rather
use a metric based on the LCS---Lowest Common
Subsequence.
• Objects represented as sets of symbolic, rather
than numeric, features: Rather base similarity on the
proportion of features that they have in common.

17
Non-Euclidean spaces (cont’d)

object1 = {small, red, rubber, ball}

object2 = {small, blue, rubber, ball}
object3 = {large, black, wooden, ball}

similarity(object1, object2) = 3 / 4
similarity(object1, object3) =
similarity(object2, object3) = 1/4
Note that it is possible to assign different
weights to features.

18
Approaches to Clustering

Broadly specified, there are two classes of

clustering algorithms:
1. Centroid approaches: We guess the centroid
(central point) in each cluster, and assign points
to the cluster of their nearest centroid.
2. Hierarchical approaches: We begin assuming
that each point is a cluster by itself. We
repeatedly merge nearby clusters, using some
measure of how close two clusters are (e.g.,
distance between their centroids), or how good a
cluster the resulting group would be (e.g., the
average distance of points in the cluster from the
resulting centroid.)

19
The k-means algorithm
•Pick k cluster centroids.
•Assign points to clusters by picking the
closest centroid to the point in question. As
points are assigned to clusters, the centroid of
the cluster may migrate.
Example: Suppose that k = 2 and we assign
points 1, 2, 3, 4, 5, in that order. Outline circles
represent points, filled circles represent
centroids. 1 5

3
4
20
The k-means algorithm example (cont’d)

1 5 1 5

2 2
3 3
4 4

1 5 1 5

2 2
3 3
4 4

21
Issues

• How to initialize the k centroids?

Pick points sufficiently far away from any other
centroid, until there is k.
• As computation progresses, one can decide to
split one cluster and merge two, to keep the total
at k. A test for whether to do so might be to ask
whether doing so reduces the average distance
from points to their centroids.
• Having located the centroids of k clusters, we
can reassign all points, since some points that
were assigned early may actually wind up closer
to another centroid, as the centroids move about.

22
Issues (cont’d)

• How to determine k?
One can try different values for k until the
smallest k such that increasing k does not
much decrease the average points of points to
their centroids.

XX
X X X
X X
X X X X
X
X

X
X X
X X X

23
Determining k
X
X X
X X X X When k = 1, all the points are
X X
X X X in one cluster, and the average
X distance to the centroid will be
high.
X
X X
X X X

X
XX
X X
When k = 2, one of the clusters
X
X
X
X X X
will be by itself and the other
X
X two will be forced into one
cluster. The average distance
X of points to the centroid will
X
X X
X X
shrink considerably.
24
Determining k (cont’d)
X
X X
X X X X When k = 3, each of the
X X
X X X apparent clusters should be a
X cluster by itself, and the
average distance from the
X
X
X
points to their centroids
X X X shrinks again.

X When k = 4, then one of the

X X
X X X X true clusters will be artificially
X X
X X X partitioned into two nearby
X
clusters. The average distance
to centroid will drop a bit, but
X
X
X not much.
X X X
25
Determining k (cont’d)

Average
radius

1 2 3 4
k

This failure to drop further suggests that k = 3

is right. This conclusion can be made even if
the data is in so many dimensions that we
cannot visualize the clusters.

26
The CLUSTER/2 algorithm

1. Select k seeds from the set of observed

objects. This may be done randomly or
according to some selection function.
2. For each seed, using that seed as a positive
instance and all other seeds as negative
instances, produce a maximally general
definition that covers all of the positive and
none of the negative instances (multiple
classifications of non-seed objects are
possible.)

27
The CLUSTER/2 algorithm (cont’d)

3. Classify all objects in the sample according to

these descriptions. Replace each maximally
specific description that covers all objects in the
category (to decrease the likelihood that classes
overlap on unseen objects.)
4. Adjust remaining overlapping definitions.
5. Using a distance metric, select an element
closest to the center of each class.
6. Repeat steps 1-5 using the new central elements
as seeds. Stop when clusters are satisfactory.

28
The CLUSTER/2 algorithm (cont’d)

7. If clusters are unsatisfactory and no

improvement occurs over several iterations,
select the new seeds closest to the edge of the
cluster.

29
The steps of a CLUSTER/2 run

30
A COBWEB
clustering for four
one-celled
organisms
(Gennari et
al.,1989)

Note: we will skip

the COBWEB
algorithm
Related communities

• data mining (in databases, over the web)

• statistics
• clustering algorithms
• visualization
• databases

32
Clustering vs. classification

• Clustering is when the clusters are not known

• If the system of clusters is known, and the
problem is to place a new item into the proper
cluster, this is classification

33
Cluster structure

• Hierarchical vs flat
• Overlap
 Disjoint partitioning, e.g., partition congressmen by state
 Multiple dimensions of partitioning, each disjoint, e.g.,
partition congressmen by state; by party; by
House/Senate
 Arbitrary overlap, e.g., partition bills by congressmen
who voted for them

• Exhaustive vs. non-exhaustive

• Outliers: what to do?
• How many clusters? How large?

34
More on document clustering
• Applications
 Structuring search results
 Suggesting related pages
 Automatic directory construction / update
 Finding near identical pages
 Finding mirror pages (e.g., for propagating updates)

 Eliminate near-duplicates from results page

 Plagiarism detection

 Lost and found (find identical pages at different URLs at

different times)
• Problems
 Polysemy, e.g., “bat,” “Washington,” “Banks”
 Multiple aspects of a single topic
 Ultimately amounts to general problem of information
structuring
35

Using The Configuration Item (CI) Import Template
No ratings yet
Using The Configuration Item (CI) Import Template
2 pages
clustering
No ratings yet
clustering
62 pages
Chapter 6
No ratings yet
Chapter 6
62 pages
Clustering, A Tool To Analyze Data Points
No ratings yet
Clustering, A Tool To Analyze Data Points
61 pages
Lecture 8
No ratings yet
Lecture 8
56 pages
Week6_clustering_regression
No ratings yet
Week6_clustering_regression
101 pages
Clusters
No ratings yet
Clusters
64 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
S VD For Clustering
No ratings yet
S VD For Clustering
10 pages
UNIT5
No ratings yet
UNIT5
60 pages
Module 5 - Clustering - Afterclassb
No ratings yet
Module 5 - Clustering - Afterclassb
49 pages
Clustering Class Ppt
No ratings yet
Clustering Class Ppt
103 pages
Soft Vs Hard Clustering
No ratings yet
Soft Vs Hard Clustering
5 pages
Clustering
No ratings yet
Clustering
75 pages
Clustering
No ratings yet
Clustering
39 pages
ML4 Unsupervised Learning
No ratings yet
ML4 Unsupervised Learning
60 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
Clustering Lec 1 Introduction To Clustering
No ratings yet
Clustering Lec 1 Introduction To Clustering
48 pages
datamining-lect8
No ratings yet
datamining-lect8
79 pages
Clustering Analysis
No ratings yet
Clustering Analysis
102 pages
Lect 10 DM
No ratings yet
Lect 10 DM
36 pages
Clustering
No ratings yet
Clustering
65 pages
1. Clustering
No ratings yet
1. Clustering
75 pages
K-Means Clustering and Related Algorithms: Ryan P. Adams
No ratings yet
K-Means Clustering and Related Algorithms: Ryan P. Adams
16 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
Unit 5
No ratings yet
Unit 5
63 pages
Clustering-Part1
No ratings yet
Clustering-Part1
79 pages
2021 Clustering
No ratings yet
2021 Clustering
50 pages
Clustering
No ratings yet
Clustering
75 pages
Unsupervised Learning Modi
No ratings yet
Unsupervised Learning Modi
16 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
ML-07-clustering
No ratings yet
ML-07-clustering
56 pages
Datamining-lect5 - Clustering. the K-means Algorithm. Hierarchical Clustering. the DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-lect5 - Clustering. the K-means Algorithm. Hierarchical Clustering. the DBSCAN Algorithm. Clustering Evaluation
110 pages
Lecture 2.1.1 to 2.1.2 (1)
No ratings yet
Lecture 2.1.1 to 2.1.2 (1)
97 pages
Clustering
No ratings yet
Clustering
34 pages
ML L14 Clustering
No ratings yet
ML L14 Clustering
59 pages
Lect 4
No ratings yet
Lect 4
34 pages
Chapter 5 Clustering
No ratings yet
Chapter 5 Clustering
40 pages
Mod 4 - CLustering
No ratings yet
Mod 4 - CLustering
55 pages
Clustering
No ratings yet
Clustering
12 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
17 pages
Lect 12
No ratings yet
Lect 12
80 pages
III-clustering
No ratings yet
III-clustering
87 pages
Cluster
100% (1)
Cluster
72 pages
M5
No ratings yet
M5
40 pages
Module5 QB 1
No ratings yet
Module5 QB 1
21 pages
6.nsupervised Learning Clustering Lecture 7 Slides For4962
No ratings yet
6.nsupervised Learning Clustering Lecture 7 Slides For4962
37 pages
ML 03 Clustering
No ratings yet
ML 03 Clustering
63 pages
K Means Clustering Lecture
No ratings yet
K Means Clustering Lecture
32 pages
4.5-Cluster Analysis
No ratings yet
4.5-Cluster Analysis
17 pages
CLUSTERING CLASSIFICATION AND INTRO NEURAL NETWORK
No ratings yet
CLUSTERING CLASSIFICATION AND INTRO NEURAL NETWORK
168 pages
4 Clustering1
No ratings yet
4 Clustering1
41 pages
7 Cluster Analysis
No ratings yet
7 Cluster Analysis
62 pages
unsupervised_learning_1
No ratings yet
unsupervised_learning_1
40 pages
DMDW 5th Module
No ratings yet
DMDW 5th Module
28 pages
Agglomerative Clustering
No ratings yet
Agglomerative Clustering
44 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
ML - 8
No ratings yet
ML - 8
70 pages
The Summation of Series
From Everand
The Summation of Series
Harold T. Davis
4/5 (1)
Two Dimensional Geometric Model: Understanding and Applications in Computer Vision
From Everand
Two Dimensional Geometric Model: Understanding and Applications in Computer Vision
Fouad Sabry
No ratings yet
Securing JSON and AJAX Messages With F5 BIG-IP ASM Solution Profile
No ratings yet
Securing JSON and AJAX Messages With F5 BIG-IP ASM Solution Profile
2 pages
SSH Key Pair Generation v2
No ratings yet
SSH Key Pair Generation v2
7 pages
[FREE PDF sample] Modern X86 Assembly Language Programming: Covers x86 64-bit, AVX, AVX2, and AVX-512 Daniel Kusswurm ebooks
100% (2)
[FREE PDF sample] Modern X86 Assembly Language Programming: Covers x86 64-bit, AVX, AVX2, and AVX-512 Daniel Kusswurm ebooks
65 pages
WSA 9-1 User Guide
No ratings yet
WSA 9-1 User Guide
496 pages
SAP Process Integration
50% (2)
SAP Process Integration
126 pages
A Marching Strip
No ratings yet
A Marching Strip
9 pages
Candidate Evaluation Details: Hidayath Ali Mokula
No ratings yet
Candidate Evaluation Details: Hidayath Ali Mokula
2 pages
Chapter - 8-dbms
No ratings yet
Chapter - 8-dbms
31 pages
Untitled
No ratings yet
Untitled
10 pages
Final Assignment
No ratings yet
Final Assignment
17 pages
The Ultimate C - C - TPLM22 - 67 - SAP Certified Application Associate - SAP Project Systems With SAP ERP 6.0 EHP7
No ratings yet
The Ultimate C - C - TPLM22 - 67 - SAP Certified Application Associate - SAP Project Systems With SAP ERP 6.0 EHP7
2 pages
Spesifikasi Teknis Peralatan
No ratings yet
Spesifikasi Teknis Peralatan
23 pages
MIS Internet of Things (Iot) : Mark George 158064
No ratings yet
MIS Internet of Things (Iot) : Mark George 158064
15 pages
MPLS
No ratings yet
MPLS
54 pages
Social network security unit 1
No ratings yet
Social network security unit 1
2 pages
Advanced View Pic Microcontroller Projects List - PIC Microcontroller
No ratings yet
Advanced View Pic Microcontroller Projects List - PIC Microcontroller
206 pages
Form Ceklist Harian Alat Radiologi
No ratings yet
Form Ceklist Harian Alat Radiologi
3 pages
Check Your Warranty or Service Status
No ratings yet
Check Your Warranty or Service Status
2 pages
Contoh Makalah Dalam Bahasa Inggris
No ratings yet
Contoh Makalah Dalam Bahasa Inggris
6 pages
XDM - Next-Generation MSPP: Erwin Filmer Solutions Director April 2007
No ratings yet
XDM - Next-Generation MSPP: Erwin Filmer Solutions Director April 2007
30 pages
Bagatrix - Basic Math Solved! Problem 1 1 / 31 / 201 3
No ratings yet
Bagatrix - Basic Math Solved! Problem 1 1 / 31 / 201 3
2 pages
Biometric Attendance Document
No ratings yet
Biometric Attendance Document
5 pages
Instructional Media and Technology (PGDT 425: by Abraham A
No ratings yet
Instructional Media and Technology (PGDT 425: by Abraham A
62 pages
Data Sheet 6ES7515-2FM02-0AB0: General Information
No ratings yet
Data Sheet 6ES7515-2FM02-0AB0: General Information
8 pages
Research Article
No ratings yet
Research Article
10 pages
Assignment: Department of Management Information System
No ratings yet
Assignment: Department of Management Information System
3 pages
Data and computer communications networking and internetworking 1st Edition Gurdeep S. Hura - Read the ebook online or download it as you prefer
100% (2)
Data and computer communications networking and internetworking 1st Edition Gurdeep S. Hura - Read the ebook online or download it as you prefer
80 pages
Assignment 2 IT Infrastructure
No ratings yet
Assignment 2 IT Infrastructure
2 pages
Radiacode RC-10x_Device_Manual
No ratings yet
Radiacode RC-10x_Device_Manual
43 pages

cs4811-ch10c-clustering

Uploaded by

cs4811-ch10c-clustering

Uploaded by

10c Machine Learning: Symbol-based

10.0 Introduction 10.5 Knowledge and Learning

Additional references for the slides:

Many years ago, during a cholera outbreak in London, a

The clustering problem

Ideally, we would like to represent clusters and

• While clustering looks intuitive in 2

• Observation that customers who buy diapers are more

• Skycat is a catalog of sky objects

• Intuition: music divides into categories and

• Think of a space with one dimension for each

• Query “salsa” submitted to MetaCrawler returns 246

 Follow Up Post; York Salsa Dancers (20 docs)

 music; entertainment; latin; artists (40 docs)

 hot; food; chiles; sauces; condiments; companies (79 docs)

 pepper; onion; tomatoes (41 docs)

• The clusters are: dance, recipe, clubs, sauces, buy,

• Documents may be thought of as points in a high-

• Objects are sequences of {C, A, T, G}

• To discuss, whether a set of points is close enough

1. D(x,x) = 0 A point is distance 0

3. D(x,y) ≤ D(x,z) + D(z,y) The triangle inequality

4. D(x,y) ≥ 0 Distance is positive

The distance between any two points, say

i k=1 |ai - bi|

Here are some examples where a distance measure

object1 = {small, red, rubber, ball}

Broadly specified, there are two classes of

• How to initialize the k centroids?

X When k = 4, then one of the

This failure to drop further suggests that k = 3

1. Select k seeds from the set of observed

3. Classify all objects in the sample according to

7. If clusters are unsatisfactory and no

Note: we will skip

• data mining (in databases, over the web)

• Clustering is when the clusters are not known

• Exhaustive vs. non-exhaustive

 Eliminate near-duplicates from results page

 Lost and found (find identical pages at different URLs at

You might also like