Module 5 - Clustering - Afterclassb
Module 5 - Clustering - Afterclassb
Module 5: Clustering
1
Netflix
§ Subscription services
§ Key aspect is being able to offer customers accurate movie
recommendations based on a customer’s own preferences and viewing
history
2
Using other users’ rankings: Collaborative Filtering
§ Consider suggesting to Carl that he watch "Men in Black", since Amy rated
it highly and Carl and Amy seem to have similar preferences
3
Using movie information: Content Filtering
4
MovieLens Data
§ www.movielens.org is a movie recommendation website run by the
GroupLens Research Lab at the University of Minnesota.
5
MovieLens item dataset
Movies in the dataset are categorized as belonging to different genres.
6
Clustering
A task of unsupervised learning: segment data and assign those with similar
traits into the same groups (not prediction)
7
Visualize Clustering
8
An intermediate step for prediction
Clustering can be done before predictions, which leads to higher accuracy.
9
Applications
§ What applications can you think of?
10
Applications
§ What applications can you think of?
11
Types of clustering methods
§ There are many different algorithms for clustering.
– Differ in what makes a cluster and how to find them
12
Distance between points
Need to define distance between two data points i (C is :
,
iz ,
-
..., iR)
(2 .
&
'
"!" = $ %!# − %"# -
↓
#$%
(i , -xj) +
Kiz -
y2)
↑
(ik-sjR)"
-
–
– (!" : value of point ) in the !th attribute
13
1
. belongs to the
corresponding genre .
Distance Example
↑0
:
not
14
Distance between “Toy Story” and “Batman Forever”
~
U
Toy Story: (0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
Batman Forever: (1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
55
=
15
Distance between “Toy Story” and “Batman Forever”
Toy Story: (0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
Batman Forever: (1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
"=
0−1 ' + 1−1 ' + ⋯+ 1 − 0 ' +⋯= 5
J (1-1) +--(1-8)+-
+
, .
max
I kiy-sjy)
16
y
=
1 ,
.
-, R .
Cluster 1
(3 0) 1 2) (4 6)
:
. .
,
, .
§ Centroid distance
– Distance between centroids of clusters
– Centroid: the point with value in each attribute being average of data
points in the cluster Cluster A (1 0 1) : , , , 1 3 5)
,
.
centroid o
Buster A : ( *2 - #)
17
Distance between clusters
§ Minimum distance
– Distance between points from different clusters that are the closest to
④
each other
18
Company A :
($2 million ,
30
years)
2(cr2 Yr)
§ Normalization reflects the relative importance of each variable
·
3 (1) .
-
y)
=
(I I)
§ Not required in Hw4
4 Ci , Yed
.
i i
N (xN ·
yx)
19
, 0x :
J 0y
,
K-means clustering
Partition a dataset into k distinct, non-overlapping clusters in which each
observation belongs to the cluster with the nearest cluster centroid
1. Specify the desired number of clusters k
2. Randomly assign each data point to a cluster
3. Compute cluster centroids
4. Re-assign each point to the closest cluster centroid
5. Re-compute cluster centroids
6. Repeat 4 and 5 until no improvement is made
20
K-means clustering algorithm
1. Specify the desired number of clusters k
21
D X
K-means clustering algorithm
1. Specify the desired number of clusters k =2 .
22
K-means clustering algorithm
1. Specify the desired number of clusters k
2. Randomly assign each data point to a cluster
3. Compute cluster centroids
O
O
23
①
K-means clustering algorithm
1. Specify the desired number of clusters k
2. Randomly assign each data point to a cluster
3. Compute cluster centroids
4. Re-assign each point to the closest cluster centroid
24
K-means clustering algorithm
1. Specify the desired number of clusters k
2. Randomly assign each data point to a cluster
3. Compute cluster centroids
4. Re-assign each point to the closest cluster centroid
5. Re-compute cluster centroids
6. Repeat 4 and 5 until no improvement is made
25
Practical considerations
• Can strategically select initial partition of points into clusters if you have
some knowledge about the data
• Can run algorithm several times with different random starting points
26
Hierarchical clustering
Build a hierarchy of clusters: Each observation starts in its own cluster, and
pairs of clusters are merged as one moves up the hierarchy
• A disadvantage of k-means clustering: need to pre-specify k
• Hierarchical clustering has no such requirement.
1. Start with each data point in the cluster including itself only
2. Compute the distance between each pair of clusters
3. Combine the two nearest clusters
4. Re-compute the distance between each pair of clusters
5. Combine the two nearest clusters
6. Repeat 4 and 5 until all data points belong to one single cluster
27
Hierarchical clustering
1. Start with each data point in the cluster including itself only
28
Hierarchical clustering
C =
4 =
10 =
1 =
6 .
1. Start with each data point in the cluster including itself only
2. Compute the distance between each pair of clusters
3. Combine the two nearest clusters
29
⑰
Hierarchical clustering
1. Start with each data point in the cluster including itself only
2. Compute the distance between each pair of clusters
3. Combine the two nearest clusters
4. Re-compute the distance between each pair of clusters
5. Combine the two nearest clusters
30
Hierarchical clustering
1. Start with each data point in the cluster including itself only
2. Compute the distance between each pair of clusters
3. Combine the two nearest clusters
4. Re-compute the distance between each pair of clusters
5. Combine the two nearest clusters
31
Hierarchical clustering
1. Start with each data point in the cluster including itself only
2. Compute the distance between each pair of clusters
3. Combine the two nearest clusters
4. Re-compute the distance between each pair of clusters
5. Combine the two nearest clusters
6. Repeat 4 and 5 until all data points belong to one single cluster
32
Display cluster process
The process by which data points are combined is shown with grey
lines.
Height of vertical
lines represents
distance between
points or clusters
along bottom
33
Select clusters
Hierarchical clustering is done without first selecting # clusters desired.
How can we select # clusters from the dendrogram?
• Draw a horizontal line cutting across the dendrogram
• # vertical lines that the horizontal line crosses = # clusters
• Good choice if the horizontal line has a lot of “wiggle room”
• More “wiggle room” means two clusters are farther from each
other.
34
Example: Select clusters
35
Comparison between the two methods
K-means clustering
• Pros ci a
(n
• Simple and fast
• Work well with both small and large datasets
• Cons
• # clusters has to be pre-specified
• Outliers may skew the centroid positions, resulting in poor clustering
Hierarchical clustering
• Pros
• Dendrogram is intuitive and provides a rich structure of the data
• Dendrogram can be used to assist in selecting # clusters (depending on
specific purpose as well)
• Cons
• Require a lot of computation power (due to distance computations)
• Not work if datasets are too large
• Sensitive to metric of distance between points and between clusters
37
Beyond movies: Mass personalization
“If I have 3 million customers on the web, I should have 3 million
stores on the web.” – Jeff Bezos, CEO of Amazon.com
• Help users find items they might not have searched for:
• A new favorite band
• An old friend who uses the same social media network
• A book or song they are likely to enjoy
38
More about Amazon
Item-to-item collaborative filtering
39
Example: Amazon’s recommendation
40
What is the edge?
§ In today’s digital age, businesses often have hundreds of thousands of items
to offer their customers
41
Recommendation methods used
42
Produce useful recommendations
43
R output of Oscars
- ? ? ?
44
Logit =-6 .
99 + 003AON +0
196 * GG
-
-
-
-
-
&
#
P(+ = ,|.., /0) = #$% !(!#.%%&'.(')∗+,&'.-%#∗..)
-
/
positive,
->> significant predictors.
-
45
Logistic Regression - Oscar
To check, we can see the VIF. Both of them are smaller than 10. No multicollinearity
problem.
46
PlY= 1/ GG ,
Now) -
=
47
P(Y=1)GG ON) =
,
↑
Logistic Regression - Oscar
5. Use the model to predict the probability of the following three movies
winning an Oscar.
↳
Movie Noms. GG wins Pr(Win)
The Artist &
10 &
3 0.605
Midnight in Paris 4 1 0.015
Moneyball 6 0 0.018
Actual = 1 13 FN 14 +P
False Positive Rate = 7/(111+7) = 0.059;
ne
48
Clustering
§ Two data points: X = (3, 0, 1), Y = (2, 1, 3). What is the Euclidean distance
between X and Y?
49
Clustering
§ Two data points: X = (3, 0, 1), Y = (2, 1, 3). What is the Euclidean distance
between X and Y?
50