0% found this document useful (0 votes)
23 views22 pages

ClusteringSlides Stanford

This document discusses machine learning clustering techniques. Clustering involves partitioning a set of data points into groups (clusters) so that items within each cluster are more similar to each other than items in other clusters, based on a distance function. K-means clustering is introduced, which aims to partition data into k clusters by minimizing the sum of squared distances between data points and their assigned cluster centers. Examples are provided clustering European cities based on geographic distance and temperature. Uses of clustering include classification by assigning labels to clusters, identifying similar items, and detecting anomalies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views22 pages

ClusteringSlides Stanford

This document discusses machine learning clustering techniques. Clustering involves partitioning a set of data points into groups (clusters) so that items within each cluster are more similar to each other than items in other clusters, based on a distance function. K-means clustering is introduced, which aims to partition data into k clusters by minimizing the sum of squared distances between data points and their assigned cluster centers. Examples are provided clustering European cities based on geographic distance and temperature. Uses of clustering include classification by assigning labels to clusters, identifying similar items, and detecting anomalies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Machine Learning - Clustering

CS102
Spring 2020

Clustering CS102
Data Tools and Techniques
§ Basic Data Manipulation and Analysis
Performing well-defined computations or asking
well-defined questions (“queries”)
§ Data Mining
Looking for patterns in data
§ Machine Learning
Using data to build models and make predictions
§ Data Visualization
Graphical depiction of data
§ Data Collection and Preparation

Clustering CS102
Machine Learning
Using data to build models and make predictions
Supervised machine learning
• Set of labeled examples to learn from: training data
• Develop model from training data
• Use model to make predictions about new data
Unsupervised machine learning
• Unlabeled data, look for patterns or structure
(similar to data mining)

Clustering CS102
Clustering
Like classification, data items consist of values
for a set of features (numeric or categorical)
§ Medical patients
Feature values: age, gender, symptom1-severity,
symptom2-severity, test-result1, test-result2

§ Web pages
Feature values: URL domain, length, #images, heading1,
heading2, …, headingn

§ Products
Feature values: category, name, size, weight, price

Clustering CS102
Clustering
Like classification, data items consist of values
for a set of features (numeric or categorical)
§ Medical patients Unlike classification,
Feature values: age, gender,
there is no label
symptom1-severity,
symptom2-severity, test-result1, test-result2

§ Web pages
Feature values: URL domain, length, #images, heading1,
heading2, …, headingn

§ Products
Feature values: category, name, size, weight, price

Clustering CS102
Clustering
Like K-nearest neighbors, for any pair of data items
i1 and i2, from their feature values can compute
distance function: distance(i1,i2)
Example:
Features - gender, profession, age, income, postal-code
person1 = (male, teacher, 47, $25K, 94305)
person2 = (female, teacher, 43, $28K, 94309)
distance(person1, person2)

distance() can be defined as inverse of similarity()

Clustering CS102
Clustering
GOAL: Given a set of data items, partition
them into groups (= clusters) so that items
within groups are close to each other based
on distance function
Ø Sometimes number of clusters is pre-specified
Ø Typically clusters need not be same size

Clustering CS102
Some Uses for Clustering
§ Classification!
• Assign labels to clusters
• Now have labeled training data
for future classification
§ Identify similar items
• For substitutes or recommendations
• For de-duplication
§ Anomaly (outlier) detection
• Items that are far from any cluster

Clustering CS102
K-Means Clustering
Reminder: for any pair of data items i1 and i2
have distance(i1,i2)
For a group of items, the mean value (centroid)
of the group is the item i (in the group or not)
that minimizes the sum of distance(i,i’) for all i’
in the group

Clustering CS102
K-Means Clustering
For a group of items, the mean value (centroid)
of the group is the item i (in the group or not)
that minimizes the sum of distance(i,i’) for all i’
in the group
§ Error for each item: distance d from the mean
for its group; squared error is d 2
§ Error for the entire clustering:
sum of squared errors (SSE)

Remind you of anything?

Clustering CS102
K-Means Clustering
Given set of data items and desired
number of clusters k, K-means groups the
items into k clusters minimizing the SSE

§ Extremely difficult to compute efficiently


Ø In fact, impossible
§ Most algorithms compute
an approximate solution
(might not be absolute
lowest SSE)

Clustering CS102
Clustering European Cities
By geographic distance, then by temperature

Clustering CS102
Clustering European Cities
Distance = actual distance, k = 5

Clustering CS102
Clustering European Cities
Distance = actual distance, k = 8, with cluster means

Clustering CS102
Clustering European Cities
Distance = actual distance, k = 2, with cluster means

Clustering CS102
Clustering European Cities
Distance = actual distance, k = 30

Clustering CS102
Clustering European Cities
Distance = temperature, k = 5

Clustering CS102
Clustering European Cities
Distance = temperature, k = 8, with means

Clustering CS102
Clustering European Cities
Distance = temperature, k = 2

Clustering CS102
Clustering European Cities
Distance = temperature, k = 3

Clustering CS102
Clustering European Cities
Distance = temperature, k = 30

Clustering CS102
Some Uses for Clustering
§ Classification
• Assign labels to clusters
• Now have labeled training data
for future classification
§ Identify similar items
• For substitutes or recommendations
• For de-duplication
§ Anomaly (outlier) detection
• Items that are far from any cluster

Clustering CS102

You might also like