0% found this document useful (0 votes)

76 views54 pages

Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny

This document provides an overview of unsupervised learning techniques, including K-means clustering, DBSCAN, and matrix factorization. It discusses K-means clustering and how it works to group data points into K number of clusters based on minimizing distance between points and cluster centers. It then covers DBSCAN, a density-based clustering method that can find arbitrarily shaped clusters unlike K-means. Finally, it introduces matrix factorization as an approach for dimensionality reduction in large, sparse datasets.

Uploaded by

Pedro Jesús García Ramos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

76 views54 pages

Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny

Uploaded by

Pedro Jesús García Ramos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

Introduction to Data Science

Lecture 8
Unsupervised Learning

CS 194 Fall 2015

John Canny
Outline
• Unsupervised Learning
• K-Means clustering
• DBSCAN
• Matrix Factorization
• Performance
Machine Learning
• Supervised: We are given input/output samples (X, y) which
we relate with a function y = f(X). We would like to “learn” f,
and evaluate it on new data. Types:
• Classification: y is discrete (class labels).
• Regression: y is continuous, e.g. linear regression.

• Unsupervised: Given only samples X of the data, we compute

a function f such that y = f(X) is “simpler”.
• Clustering: y is discrete
• Y is continuous: Matrix factorization, Kalman filtering,
unsupervised neural networks.
Machine Learning
• Unsupervised:
• Cluster some hand-written digit data into 10 classes.
• What are the top 20 topics in Twitter right now?
• Find and cluster distinct accents of people at Berkeley.
Techniques
• Supervised Learning:
• kNN (k Nearest Neighbors)
• Naïve Bayes
• Linear + Logistic Regression
• Support Vector Machines
• Random Forests
• Neural Networks
• Unsupervised Learning:
• Clustering
• Matrix Factorization
• HMMs (Hidden Markov Models)
• Neural Networks
Clustering – Why?
Examples:
• Segment: image segmentation
• Compression: Cluster-based kNN, e.g. handwritten
digit recognition.
• Underlying process: Accents of people at Berkeley
(??) – because place of origin strongly influences the
accent you have.
Stereotypical Clustering
Note: Points are samples plotted in feature space, e.g. 10,000-
dimensional space for 100x100 images.
Clustering for Segmentation

Note: We break an image into regions of points with similar

features (Brox and Malik, ECCV 2010).
Condensation/Compression

We don’t claim that clusters model underlying structure, but that

they give a reduced-variance “sample” of the data.
“Cluster Bias”
• Human beings conceptualize the world through categories
represented as examplars (Rosch 73, Estes 94).

• We tend to see cluster structure whether it is there or not.

• Works well for dogs, but…
Cluster Bias
Netflix
• More of a continuum than discrete clusters
• Factor models, kNN do much better than discrete cluster
models.
“Cluster Bias”
Upshot:
• Clustering is used more than it should be, because
people assume an underlying domain has discrete
classes in it.
• This is especially true for characteristics of people,
e.g. Myers-Briggs personality types like “ENTP”.
• In reality the underlying data is usually continuous.
• Just as with Netflix, continuous models (matrix
factorization, “soft” clustering, kNN) tend to do
better.
Terminology
• Hierarchical clustering: clusters form a hierarchy. Can
be computed bottom-up or top-down.
• Flat clustering: no inter-cluster structure.

• Hard clustering: items assigned to a unique cluster.

• Soft clustering: cluster membership is a real-valued
function, distributed across several clusters.
K-means clustering
The standard k-means algorithm is based on Euclidean
distance.
The cluster quality measure is an intra-cluster measure
only, equivalent to the sum of item-to-centroid kernels.

A simple greedy algorithm locally optimizes this measure

(usually called Lloyd’s algorithm):
• Find the closest cluster center for each item, and assign it to that
cluster.
• Recompute the cluster centroid as the mean of items, for the
newly-assigned items in the cluster.
K-means clustering
Cluster centers – can pick by sampling the input data.
K-means clustering
Assign points to closest center
K-means clustering
Recompute centers (old = cross, new = dot)
K-means clustering
Iterate:
• For fixed number of iterations
• Until no change in assignments
• Until small change in quality
K-means Initialization
We need to pick some points for the first round of the algorithm:
• Random sample: Pick a random subset of k points from the
dataset.
• K-Means++: Iteratively construct a random sample with good
spacing across the dataset.

Note: Finding an exactly-optimal k-Means clustering is NP-hard.

Randomization helps avoid bad configurations.
K-means++
Start:
• Choose first cluster center at random from the data points

Iterate:
• For every remaining data point x, compute D(x) the distance
from x to the closest cluster center.
• Choose a remaining point x randomly with probability
proportional to D(x)2, and make it a new cluster center.

Intuitively, this finds a sample of widely-spaced points avoiding

“collapsing” of the clustering into a few internal centers.
K-means properties
• It’s a greedy algorithm with random setup – solution isn’t
optimal and varies significantly with different initial points.
• Very simple convergence proofs.
• Performance is O(nk) per iteration, not bad and can be
heuristically improved.
n = total features in the dataset, k = number clusters
• Many generalizations, e.g.
• Fixed-size clusters
• Simple generalization to m-best soft clustering
• As a “local” clustering method, it works well for data
condensation/compression.
Choosing clustering dimension
• AIC or Akaike Information Criterion:

• K=dimension, L(K) is the likelihood (could be RSS) and q(K) is a

measure of model complexity (cluster description complexity).

• AIC favors more compact (fewer clusters) clusterings.

• For sparse data, AIC will incorporate the number of non-zeros

in the cluster spec. Lower is better.
Outline
• Unsupervised Learning
• K-Means clustering
• DBSCAN
• Matrix Factorization
• Performance
DBSCAN
• Motivation: Centroid-based clustering methods like k-
Means favor clusters that are spherical, and have great
difficulty with anything else.
• But with real data we have:
DBSCAN
• DBSCAN performs density-based clustering, and follows
the shape of dense neighborhoods of points.

Figure from Wikipedia Article on DBSCAN

• Core points have at least minPts neighbors in a sphere of

diameter  around them.
• The red points here are core points with at least minPts =
3 neighbors in an -sphere around them.
DBSCAN

• Core points can directly reach neighbors in their -sphere.

• Non-core points cannot directly reach another point.
• Point q is density-reachable from p if there is a series of
points p=p1,p2,…pn=q s.t. pi+1 is directly reachable from pi
• All points not density-reachable from any other points are
outliers.
DBSCAN Clusters

• Points, p, q are density-connected if there is a point o such that

both p and q are density-reachable from o.
• A cluster is a set of points which are mutually density-connected.
• If a point is density-reachable from a cluster-point, it is part of the
cluster as well.
• For the figure above, points A are density-connected, B, C are
density reachable and N is not.
DBSCAN Algorithm
DBSCAN(D, eps, MinPts) {
C = 0
for each point P in dataset D {
if P is visited
continue next point
mark P as visited
NeighborPts = regionQuery(P, eps)
if sizeof(NeighborPts) < MinPts
mark P as NOISE
else {
C = next cluster
expandCluster(P, NeighborPts, C, eps, MinPts)
}
}
}
Source Wikipedia article on DBSCAN
DBSCAN Performance
• DBSCAN uses all-pairs point distances, but using an efficient
indexing structure (and assuming there are not too many
neighbors), each neighborhood query takes O(log n) time.

• The algorithm overall can be made to run in O(n log n).

• Fast neighbor search becomes progressively harder (higher

constants) in higher dimensions.

• The fast implementation also assumes data can be held in

memory.

• The fall-back is an O(n2) out-of-memory implementation.

5-minute break
Logistics
• Your EC2 instances will be ready by Weds.
• One per project team for project work, they’ll also be
used for most of the remaining labs.
• Not for fulltime use – we have about 200 hrs/team
• Look for an email with credentials.

• Make sure you have ssh installed and configured on

your laptop (not in VM).
• Builtin for Macs and Linux
• We recommend Putty for windows.
• You probably don’t need X windows on your laptop, but it
may help.
Outline
• Unsupervised Learning
• K-Means clustering
• DBSCAN
• Matrix Factorization
• Performance
Matrix Factorization - Motivation
• For problems with linear relationships between a response Y
and feature data X we have the linear regression formula:
p
Yˆ  ˆ0   X j ˆ j
j 1

• But as the dimension of Y grows, we quickly “run out of data”

– the model has more degrees of freedom than we have data.
• e.g. recommendations, information retrieval
• Features and responses are the same, e.g. ratings of
products, or relevance of a term to a document.
• We have to deal with thousands or millions of responses,
as well as thousands or millions of features.

• This is one flavor of the “curse of dimensionality”.

Matrix Factorization - Motivation
• High-dimensional data will often have simpler structure, e.g.
• Topics in documents
• General product characteristics (reliability, cost etc.)
• User preferences
• User personality, demographics
• Dimensions of sentiment
• Themes in discussions
• “Voices” or styles in documents
• We can approximate these effects using a lower-dimensional
model, which we assume for now is linear.
Matrix Factorization - Motivation
The subspace allows us to predict the values of other features
from known features:
d coordinates define a unique point on a d-dimensional plane.
From these measurements, we can predict other coordinates.
Applications: User product preferences, CTR prediction, search
engine optimization
Matrix Factorization - Motivation
Usually, data vary more strongly in some dimensions than
others. Fitting a linear model to the (k) strongest directions
gives the best approximation to the data in a least-squares
sense.
Matrix Factorization
Algebraically, we have a high-dimensional data matrix (typically
sparse) S, which we approximate as a product of two dense
matrices A and B with low “inner” dimension:
N samples (users, documents) K latent dims N samples

K
B

M
M S A
features

From the factorization, we can fill in missing values of S. This

problem is often called “Matrix Completion”.
Matrix Factorization
Columns of B represent the distribution of topics the nth sample.
Rows of A represent the distribution of topics in the mth feature.
These weights allow us to interpret the latent dimensions.

N samples (users, documents) K latent dims N samples

K
B

M
M S A
features
Matrix Factorization
is a popular method for sparse, linear data (ratings, kw/URL
combos, CTR on ads).

• Input matrix S is sparse, and we approximate it as a low-

dimensional product: S  A * B. Zeros of S encode don’t know
state.

• Typically minimize quadratic loss on non-zeros (actual

observations) of S
L2(S – A *S B)
and with L2 regularizers on A and B. Here *S denotes the
product evaluated only at non-zeros of S.
Matrix Factorization with SGD
We can compute gradients with respect to A and B which are:
𝑑𝑙
= −2 𝑆 − 𝐴 ∗𝑆 𝐵 𝐵𝑇 + 2𝑤𝐴 𝐴
𝑑𝐴

where wA is the weighting of the regularizer on A, and

𝑑𝑙
= −2𝐴𝑇 𝑆 − 𝐴 ∗𝑆 𝐵 + 2𝑤𝐵 𝐵
𝑑𝐵
where wB is the weight of the regularizer on B.

Then the loss can be minimized with SGD. This method is quite
fast, but suffers from weak local optima.
Matrix Factorization with MCMC
The local optima problem can be minimized by using MCMC
methods (Markov-Chain Monte-Carlo).

Instead of moving only in the direction of the gradient, these

methods sometimes move “down” which allows them to jump
into other optima.

Changing the “energy” of the search allows “annealing” which

generally finds good local optima.

This is probably the

most accurate method
for matrix factorization
right now.
Alternating Least Squares (ALS)
Given A, solving for a column of B is a standard linear regression
task, and has a closed form:

𝐵𝑖 = 𝑤𝐵 𝐼 − 𝐴𝑇 𝐷𝑖 𝐴 −1 𝐴𝑇 𝑆𝑖

where Si is the ith column of S, Di is a diagonal matrix with ones at

the non-zeros rows of Si.
A similar formula exists for rows of A.

Because of the matrix inversions, this method is quite expensive

requiring O(K3(N+M)) time. This is however a popular
implementation in cloud systems (e.g. Spark, Powergraph,
Mahout).
Netflix
• From the first year onwards all competitive entries used a
dimension reduction model.
• The dimension reduction accounts for about half the
algorithms’ gain over the baseline.
• The winning team’s model is roughly:

• r is the rating, b is a baseline, q is an item factor, the

remainder is the “user factor”.
• It incorporates new factor vectors x: for explicit ratings and y:
for implicit ratings.
Netflix
• Summary of dimension reduction models:

• Dimension reduction augmented by neighborhoods:

Netflix Performance

From Koren, Bell, Volinksy, “Matrix Factorization Techniques for Recommender Systems”
IEEE Computer 2009.
Outline
• Unsupervised Learning
• K-Means clustering
• DBSCAN
• Matrix Factorization
• Performance
Performance
There are several reasons to design for performance:
• Dataset size (assuming model quality improves with size)
• Model size (bigger models generally perform better)
Both the above are true (but not obvious) for power-law
datasets. Because the data “tail” is long, models also need to be
long-tailed: it allows you to model less-frequent users and
features.

Training long-tailed models requires

more data.
Offline/Online Performance
There are two different dimensions of performance:

1. Offline: Model training. Plenty of resources. Typical goal is 1

to few days training time. Best possible model accuracy.

2. Online: Model prediction. Limited resources (memory and

machines). Goal is usually to minimize latency, and perhaps
online model size.

2. Influences 1. Models trained offline have to fit and be fast in

the deployment environment.
Three Approaches to Performance
There are at least three approaches to improving performance:
• Algorithmic improvements. e.g. adaptive SGD methods.

• Computational improvements: Fully optimize code, leverage

hardware (e.g. Graphics Processing Units).

• Cluster scale-up: Distribute model or data across computers

on a network.
Three Approaches to Performance
There are at least three approaches to improving performance:
• Algorithmic improvements. e.g. adaptive SGD methods.
• Free! Orders of magnitude improvements possible
• Computational improvements: Fully optimize code, leverage
hardware (e.g. Graphics Processing Units).
• Free! Orders of magnitude improvements possible.
Compounds algorithmic improvements.
• Cluster scale-up: Distribute model or data across computers
on a network.
Three Approaches to Performance
There are at least three approaches to improving performance:
• Algorithmic improvements. e.g. adaptive SGD methods.
• Free! Orders of magnitude improvements possible.
• Computational improvements: Fully optimize code, leverage
hardware (e.g. Graphics Processing Units).
• Free! Orders of magnitude improvements possible.
Compounds algorithmic improvements.
• Cluster scale-up: Distribute model or data across computers
on a network.
• Superlinear cost: a k-fold speedup requires, fn(k) > k
additional cost. Usually does not compound algorithmic and
computational improvements (network bottleneck).
Three Approaches to Performance
There are at least three approaches to improving performance:
• Algorithmic improvements. e.g. adaptive SGD methods.
• Free! Orders of magnitude improvements possible.
• Computational improvements: Fully optimize code, leverage
hardware (e.g. Graphics Processing Units).
• Free! Orders of magnitude improvements possible.
Compounds algorithmic improvements.
• Cluster scale-up: Distribute model or data across computers
on a network.
• Codesign: using knowledge of the algorithms and data (power
law), we can reduce network load several orders of
magnitude. Network topology awareness can improve things
dramatically. Linear speedup is possible this way.

3.4-16 Edited
No ratings yet
3.4-16 Edited
4 pages
Clustering
No ratings yet
Clustering
75 pages
Outline: Three Basic Algorithms
No ratings yet
Outline: Three Basic Algorithms
34 pages
Lecture Unsupervised (17!04!2024)
No ratings yet
Lecture Unsupervised (17!04!2024)
61 pages
Clustering Analysis (Unsupervised)
No ratings yet
Clustering Analysis (Unsupervised)
6 pages
DM After Midz
No ratings yet
DM After Midz
22 pages
L07 Clustering Algorithms
No ratings yet
L07 Clustering Algorithms
45 pages
Unit 4
No ratings yet
Unit 4
16 pages
Exp5 - Unsupervised Learning
No ratings yet
Exp5 - Unsupervised Learning
13 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
Lecture 12 - Unsupervised Learning - Shoould Be Marged
No ratings yet
Lecture 12 - Unsupervised Learning - Shoould Be Marged
31 pages
Datamining Lect8
No ratings yet
Datamining Lect8
79 pages
CC - Unit IV - Chapters
No ratings yet
CC - Unit IV - Chapters
47 pages
03 Clustering
No ratings yet
03 Clustering
63 pages
ML Unit Iii
No ratings yet
ML Unit Iii
12 pages
Lecture 13 - Unsupervised Learning, PCA ICA
No ratings yet
Lecture 13 - Unsupervised Learning, PCA ICA
50 pages
ML Lecture14
No ratings yet
ML Lecture14
17 pages
Clustering Analysis
No ratings yet
Clustering Analysis
30 pages
Machine Learning Unit-4
No ratings yet
Machine Learning Unit-4
24 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Week 9 Part 1 Clustering
No ratings yet
Week 9 Part 1 Clustering
44 pages
(KtabPDF Com) xrwA7TEBGp
No ratings yet
(KtabPDF Com) xrwA7TEBGp
32 pages
Density Based Clustering (Unit 5)
No ratings yet
Density Based Clustering (Unit 5)
5 pages
Ijcttjournal V1i1p12
No ratings yet
Ijcttjournal V1i1p12
3 pages
Chapter 3 2
No ratings yet
Chapter 3 2
27 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Ds Module 5
No ratings yet
Ds Module 5
49 pages
Clustering
No ratings yet
Clustering
65 pages
Mod 4 - CLustering
No ratings yet
Mod 4 - CLustering
55 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
Unit 3 Updated Notes
No ratings yet
Unit 3 Updated Notes
29 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
6 pages
A Comparative Study of K-Means, DBSCAN and OPTICS
No ratings yet
A Comparative Study of K-Means, DBSCAN and OPTICS
6 pages
Image Clustering: Prof. Dr. Rafiqul Islam Department of CSE
No ratings yet
Image Clustering: Prof. Dr. Rafiqul Islam Department of CSE
26 pages
Machine Learning Note Modul 4 5
No ratings yet
Machine Learning Note Modul 4 5
20 pages
Density Based Clustering
No ratings yet
Density Based Clustering
25 pages
Machine Learning4
No ratings yet
Machine Learning4
39 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
Unit 7 Clustering
No ratings yet
Unit 7 Clustering
56 pages
Dbscan and Optics
No ratings yet
Dbscan and Optics
28 pages
Week 07 Lecture Material
No ratings yet
Week 07 Lecture Material
49 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
ML L14 Clustering
No ratings yet
ML L14 Clustering
59 pages
An Introduction To Data Mining: Prof. S. Sudarshan CSE Dept, IIT Bombay
No ratings yet
An Introduction To Data Mining: Prof. S. Sudarshan CSE Dept, IIT Bombay
48 pages
Clustering
No ratings yet
Clustering
12 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
Clustering Lec 1 Introduction To Clustering
No ratings yet
Clustering Lec 1 Introduction To Clustering
48 pages
EML %TH Module
No ratings yet
EML %TH Module
40 pages
UNIT5
No ratings yet
UNIT5
60 pages
K Means Clustering
No ratings yet
K Means Clustering
6 pages
W6 Clustering
No ratings yet
W6 Clustering
29 pages
2021 Clustering
No ratings yet
2021 Clustering
50 pages
PART2
No ratings yet
PART2
61 pages
Medical Imabmnge Analysis
No ratings yet
Medical Imabmnge Analysis
41 pages
EE769-10 Clustering
No ratings yet
EE769-10 Clustering
20 pages
Recommendation Systems
No ratings yet
Recommendation Systems
27 pages
Data Mining Unit-Iv
No ratings yet
Data Mining Unit-Iv
34 pages
ML UNIT 4 Sir
No ratings yet
ML UNIT 4 Sir
42 pages
Andrew Crudge, Will Thomas, Kaiyuan Zhu, Landmark Recognition Using Machine Learning
No ratings yet
Andrew Crudge, Will Thomas, Kaiyuan Zhu, Landmark Recognition Using Machine Learning
5 pages
The Problem of Overfitting - Coursera
No ratings yet
The Problem of Overfitting - Coursera
1 page
3muri Brief Theory PDF
No ratings yet
3muri Brief Theory PDF
19 pages
Topic 1 Class Exercises
No ratings yet
Topic 1 Class Exercises
5 pages
A Deep Learning Approach For Optimizing Monoclonal Antibody Production Process Parameters
No ratings yet
A Deep Learning Approach For Optimizing Monoclonal Antibody Production Process Parameters
15 pages
Image Features Using Wavelets and Applications To Document Image Processing
No ratings yet
Image Features Using Wavelets and Applications To Document Image Processing
71 pages
Image Enhancement Frequency Domain
No ratings yet
Image Enhancement Frequency Domain
40 pages
Aps502h1 20209 6316201532472020
No ratings yet
Aps502h1 20209 6316201532472020
4 pages
An Effective Approach For Pneumonia Detection Using Convolution Vision Transformer
No ratings yet
An Effective Approach For Pneumonia Detection Using Convolution Vision Transformer
6 pages
Physics: Alternative Sample Schedules Are Available at Nyuad - Nyu.edu/grids
No ratings yet
Physics: Alternative Sample Schedules Are Available at Nyuad - Nyu.edu/grids
1 page
Operations Research (16SMBEMM1:1) Study Material Class: III-B.Sc Mathematics
No ratings yet
Operations Research (16SMBEMM1:1) Study Material Class: III-B.Sc Mathematics
207 pages
Asoc D 23 07024 - R4
No ratings yet
Asoc D 23 07024 - R4
27 pages
Teach Yourself
No ratings yet
Teach Yourself
17 pages
CBNST QUESTION BANK (2012-2013 Session) CSE-2nd Year
No ratings yet
CBNST QUESTION BANK (2012-2013 Session) CSE-2nd Year
3 pages
AMME
No ratings yet
AMME
192 pages
Data Structures Unit-1 Chapter-1
No ratings yet
Data Structures Unit-1 Chapter-1
17 pages
Exponential Function
No ratings yet
Exponential Function
22 pages
Difference Between Lossless Compression and Lossy Compression
No ratings yet
Difference Between Lossless Compression and Lossy Compression
15 pages
Chapter 8 Steady State Error
No ratings yet
Chapter 8 Steady State Error
24 pages
Tutorial 6
No ratings yet
Tutorial 6
2 pages
On The Application of Linear Programming On A Transportation Problem
No ratings yet
On The Application of Linear Programming On A Transportation Problem
5 pages
Index Page
No ratings yet
Index Page
12 pages
Thesis On Content Based Image Retrieval
100% (3)
Thesis On Content Based Image Retrieval
7 pages
Statistics Help Card Full
No ratings yet
Statistics Help Card Full
6 pages
ALGO Practice Session-I
No ratings yet
ALGO Practice Session-I
2 pages
CS 170, Spring 2020 HW1 A. Chiesa & J. Nelson
No ratings yet
CS 170, Spring 2020 HW1 A. Chiesa & J. Nelson
4 pages
Stat 1st Year
No ratings yet
Stat 1st Year
2 pages
Ad Batch
No ratings yet
Ad Batch
12 pages
ISSN: 2320-7493 (Online) 2320-8449 (Print)
No ratings yet
ISSN: 2320-7493 (Online) 2320-8449 (Print)
2 pages

Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny

Uploaded by

Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny

Uploaded by

Introduction to Data Science

CS 194 Fall 2015

• Unsupervised: Given only samples X of the data, we compute

Note: We break an image into regions of points with similar

We don’t claim that clusters model underlying structure, but that

• We tend to see cluster structure whether it is there or not.

• Hard clustering: items assigned to a unique cluster.

A simple greedy algorithm locally optimizes this measure

Note: Finding an exactly-optimal k-Means clustering is NP-hard.

Intuitively, this finds a sample of widely-spaced points avoiding

• K=dimension, L(K) is the likelihood (could be RSS) and q(K) is a

• AIC favors more compact (fewer clusters) clusterings.

• For sparse data, AIC will incorporate the number of non-zeros

Figure from Wikipedia Article on DBSCAN

• Core points have at least minPts neighbors in a sphere of

• Core points can directly reach neighbors in their -sphere.

• Points, p, q are density-connected if there is a point o such that

• The algorithm overall can be made to run in O(n log n).

• Fast neighbor search becomes progressively harder (higher

• The fast implementation also assumes data can be held in

• The fall-back is an O(n2) out-of-memory implementation.

• Make sure you have ssh installed and configured on

• But as the dimension of Y grows, we quickly “run out of data”

• This is one flavor of the “curse of dimensionality”.

From the factorization, we can fill in missing values of S. This

N samples (users, documents) K latent dims N samples

• Input matrix S is sparse, and we approximate it as a low-

• Typically minimize quadratic loss on non-zeros (actual

where wA is the weighting of the regularizer on A, and

Instead of moving only in the direction of the gradient, these

Changing the “energy” of the search allows “annealing” which

This is probably the

where Si is the ith column of S, Di is a diagonal matrix with ones at

Because of the matrix inversions, this method is quite expensive

• r is the rating, b is a baseline, q is an item factor, the

• Dimension reduction augmented by neighborhoods:

Training long-tailed models requires

1. Offline: Model training. Plenty of resources. Typical goal is 1

2. Online: Model prediction. Limited resources (memory and

2. Influences 1. Models trained offline have to fit and be fast in

• Computational improvements: Fully optimize code, leverage

• Cluster scale-up: Distribute model or data across computers

• Cluster scale-up: Distribute model or data across computers

You might also like