0% found this document useful (0 votes)
21 views

Module 5 - Clustering - Afterclassb

This document discusses clustering techniques for movie recommendations. It introduces Netflix's recommendation system which uses collaborative filtering by considering preferences of similar users. Content filtering analyzes movie information like genres and directors to find similar movies. The MovieLens dataset with movie genres is used to cluster movies into groups with similar genres. Clustering can reveal patterns and group data before making predictions to improve accuracy. Popular clustering algorithms like K-means and hierarchical clustering are introduced which group data based on distance between data points. Distance is defined using metrics like Euclidean distance between vectors of movie genre values. Clustering has applications in marketing, fraud detection, insurance and stock analysis.

Uploaded by

Vanessa Wong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Module 5 - Clustering - Afterclassb

This document discusses clustering techniques for movie recommendations. It introduces Netflix's recommendation system which uses collaborative filtering by considering preferences of similar users. Content filtering analyzes movie information like genres and directors to find similar movies. The MovieLens dataset with movie genres is used to cluster movies into groups with similar genres. Clustering can reveal patterns and group data before making predictions to improve accuracy. Popular clustering algorithms like K-means and hierarchical clustering are introduced which group data based on distance between data points. Distance is defined using metrics like Euclidean distance between vectors of movie genre values. Clustering has applications in marketing, fraud detection, insurance and stock analysis.

Uploaded by

Vanessa Wong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

IIMT 2641 Introduction to Business Analytics

Module 5: Clustering

1
Netflix
§ Subscription services
§ Key aspect is being able to offer customers accurate movie
recommendations based on a customer’s own preferences and viewing
history

2
Using other users’ rankings: Collaborative Filtering

§ Consider suggesting to Carl that he watch "Men in Black", since Amy rated
it highly and Carl and Amy seem to have similar preferences

3
Using movie information: Content Filtering

§ We saw that Amy liked "Men In Black”


– It was directed by Barry Sonnenfeld
– Classified in the genres of action, adventure, sci-fi and comedy
– It stars actor Will Smith

§ Consider recommending to Amy:


– Barry Sonnenfeld’s movie "Get Shorty"
– "Jurassic Park", which is in the genres of action, adventure, and sci-fi
– Will Smith’s movie "Hitch"

4
MovieLens Data
§ www.movielens.org is a movie recommendation website run by the
GroupLens Research Lab at the University of Minnesota.

§ They collect user preferences about movies and do collaborative filtering to


make recommendations.

§ We will use their movie database to try content filtering by clustering.


– Recommendation is more than predicting whether customer A likes a particular
movie . . .

5
MovieLens item dataset
Movies in the dataset are categorized as belonging to different genres.

Action Adventure Animation Children Comedy Crime


Documentary Drama Fantasy Film Noir Horror Musical
Mystery Romance Sci-Fi Thriller War Western

§ Each movie may belong to many genres.


Can we systematically find groups of movies with similar sets of genres?

6
Clustering
A task of unsupervised learning: segment data and assign those with similar
traits into the same groups (not prediction)

§ Identify hidden pattern/underlying trend/unusual characteristics, detect


anomalies
– Result: high intra-group similarity and low inter-group similarity
§ No object about a particular item to predict in mind
– Training data has not been pre-labelled into a known class.
§ Compare: supervised learning ≈ predictive analytics
– Predict the outcome of an unknown object
– Linear regression, logistic regression, etc.

7
Visualize Clustering

8
An intermediate step for prediction
Clustering can be done before predictions, which leads to higher accuracy.

§ Reveal patterns and relevant features/predictors


– Clustering medical history of patients may provide indicative signals about
heart attack.
– Clustering viewing history of customers may reveal their preferences and
recommend movies of interest.
§ Cluster data into groups and build a predictive model for each group
– Be careful not to overfit your model! (work best with large datasets)

9
Applications
§ What applications can you think of?

10
Applications
§ What applications can you think of?

§ Marketing: customer segmentation (discovery of distinct groups of


customers) for target marketing
§ Fraud detection: identify peer groups and normal behavior in those groups,
and then investigate outliers in each group
§ Insurance: identify customer groups with high average claim cost
§ Stock selection: identify groups of stocks that have similar trends

11
Types of clustering methods
§ There are many different algorithms for clustering.
– Differ in what makes a cluster and how to find them

§ We will cover two most popular methods


– K-means clustering
– Hierarchical clustering

12
Distance between points
Need to define distance between two data points i (C is :
,
iz ,
-
..., iR)

§ A natural choice is “Euclidean distance” j (2j1


--- (k)
:

(2 .

§ Distance between points i and j with k attributes is

&
'
"!" = $ %!# − %"# -


#$%
(i , -xj) +
Kiz -

y2)

(ik-sjR)"
-

! = 1 … , &: different attributes


. . .


– (!" : value of point ) in the !th attribute

13
1
. belongs to the
corresponding genre .

Distance Example

↑0
:

not

Need to define distance between two data points


§ The movie “Toy Story” is categorized as Adventure, Animation,
Children’s, Comedy, and Fantasy:
– Toy Story: (0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
I
* 8
action genues
§ The movie “Batman Forever” is categorized as Action, Adventure,
Comedy, and Crime:
– Batman Forever: (1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
* 4 *↑

14
Distance between “Toy Story” and “Batman Forever”

~
U
Toy Story: (0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
Batman Forever: (1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

" =? Quus-a-m- (ros

55
=

15
Distance between “Toy Story” and “Batman Forever”
Toy Story: (0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
Batman Forever: (1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

"=
0−1 ' + 1−1 ' + ⋯+ 1 − 0 ' +⋯= 5

J (1-1) +--(1-8)+-
+

Other popular distance metrics (not required):


-...
=

Manhattan Distance: Sum of absolute values instead of squares


Maximum Coordinate Distance: Only consider attribute in which data points
deviate the most i (i die
.
<in)
,
....

j (35) <52 -- (n)


.

, .

max
I kiy-sjy)
16
y
=

1 ,
.

-, R .
Cluster 1
(3 0) 1 2) (4 6)
:

. .
,
, .

Distance between clusters centroid of Cluster 1


* *)
(
:

§ Centroid distance
– Distance between centroids of clusters
– Centroid: the point with value in each attribute being average of data
points in the cluster Cluster A (1 0 1) : , , , 1 3 5)
,
.

centroid o
Buster A : ( *2 - #)

17
Distance between clusters
§ Minimum distance
– Distance between points from different clusters that are the closest to


each other

18
Company A :
($2 million ,
30
years)

Normalize data Compay B :

<$3 million , 40 yers)


§ Distance is influenced by scale of variables, so we usually normalize by
subtracting the mean and dividing by the standard deviation
– For example, if two attributes of companies are annual revenue (in
dollars) and age of the company (in years), the revenue variable would
dominate in calculating distance.
X
– However, if all variables are on the same scale, then normalization is
not necessary.
1 (C , y ) .

Compay ith compay


:

2(cr2 Yr)
§ Normalization reflects the relative importance of each variable
·

3 (1) .
-

y)
=
(I I)
§ Not required in Hw4
4 Ci , Yed
.

i i
N (xN ·
yx)
19
, 0x :

J 0y
,
K-means clustering
Partition a dataset into k distinct, non-overlapping clusters in which each
observation belongs to the cluster with the nearest cluster centroid
1. Specify the desired number of clusters k
2. Randomly assign each data point to a cluster
3. Compute cluster centroids
4. Re-assign each point to the closest cluster centroid
5. Re-compute cluster centroids
6. Repeat 4 and 5 until no improvement is made

20
K-means clustering algorithm
1. Specify the desired number of clusters k

21
D X
K-means clustering algorithm
1. Specify the desired number of clusters k =2 .

2. Randomly assign each data point to a cluster

22
K-means clustering algorithm
1. Specify the desired number of clusters k
2. Randomly assign each data point to a cluster
3. Compute cluster centroids

O
O

23

K-means clustering algorithm
1. Specify the desired number of clusters k
2. Randomly assign each data point to a cluster
3. Compute cluster centroids
4. Re-assign each point to the closest cluster centroid

24
K-means clustering algorithm
1. Specify the desired number of clusters k
2. Randomly assign each data point to a cluster
3. Compute cluster centroids
4. Re-assign each point to the closest cluster centroid
5. Re-compute cluster centroids
6. Repeat 4 and 5 until no improvement is made

25
Practical considerations

• The number of clusters k can be selected from previous knowledge or


experimenting.

• Can strategically select initial partition of points into clusters if you have
some knowledge about the data

• Can run algorithm several times with different random starting points

26
Hierarchical clustering

Build a hierarchy of clusters: Each observation starts in its own cluster, and
pairs of clusters are merged as one moves up the hierarchy
• A disadvantage of k-means clustering: need to pre-specify k
• Hierarchical clustering has no such requirement.

1. Start with each data point in the cluster including itself only
2. Compute the distance between each pair of clusters
3. Combine the two nearest clusters
4. Re-compute the distance between each pair of clusters
5. Combine the two nearest clusters
6. Repeat 4 and 5 until all data points belong to one single cluster

27
Hierarchical clustering

1. Start with each data point in the cluster including itself only

28
Hierarchical clustering
C =

4 =

10 =

1 =

6 .

1. Start with each data point in the cluster including itself only
2. Compute the distance between each pair of clusters
3. Combine the two nearest clusters

29

Hierarchical clustering
1. Start with each data point in the cluster including itself only
2. Compute the distance between each pair of clusters
3. Combine the two nearest clusters
4. Re-compute the distance between each pair of clusters
5. Combine the two nearest clusters

30
Hierarchical clustering
1. Start with each data point in the cluster including itself only
2. Compute the distance between each pair of clusters
3. Combine the two nearest clusters
4. Re-compute the distance between each pair of clusters
5. Combine the two nearest clusters

31
Hierarchical clustering
1. Start with each data point in the cluster including itself only
2. Compute the distance between each pair of clusters
3. Combine the two nearest clusters
4. Re-compute the distance between each pair of clusters
5. Combine the two nearest clusters
6. Repeat 4 and 5 until all data points belong to one single cluster

32
Display cluster process
The process by which data points are combined is shown with grey
lines.

Height of vertical
lines represents
distance between
points or clusters

Data points listed -> a


- - - -

along bottom
33
Select clusters
Hierarchical clustering is done without first selecting # clusters desired.
How can we select # clusters from the dendrogram?
• Draw a horizontal line cutting across the dendrogram
• # vertical lines that the horizontal line crosses = # clusters
• Good choice if the horizontal line has a lot of “wiggle room”
• More “wiggle room” means two clusters are farther from each
other.

34
Example: Select clusters

35
Comparison between the two methods
K-means clustering
• Pros ci a
(n
• Simple and fast
• Work well with both small and large datasets
• Cons
• # clusters has to be pre-specified
• Outliers may skew the centroid positions, resulting in poor clustering
Hierarchical clustering
• Pros
• Dendrogram is intuitive and provides a rich structure of the data
• Dendrogram can be used to assist in selecting # clusters (depending on
specific purpose as well)
• Cons
• Require a lot of computation power (due to distance computations)
• Not work if datasets are too large
• Sensitive to metric of distance between points and between clusters

37
Beyond movies: Mass personalization
“If I have 3 million customers on the web, I should have 3 million
stores on the web.” – Jeff Bezos, CEO of Amazon.com

• Recommendation systems build models about users’ preferences to


personalize the user experience.

• Help users find items they might not have searched for:
• A new favorite band
• An old friend who uses the same social media network
• A book or song they are likely to enjoy

38
More about Amazon
Item-to-item collaborative filtering

• Mixed system: “similar” items purchased by “similar customers”


• Content filtering: consider items to be similar if they have similar
attributes
• e.g. director/genre/actor of movies
• Amazon also considers items to be similar if they are “complements”
• Complements: customers tend to purchase them together
• Take into account both items’ attributes and customers’ behavior
• A variation is used by YouTube.

39
Example: Amazon’s recommendation

40
What is the edge?
§ In today’s digital age, businesses often have hundreds of thousands of items
to offer their customers

§ Excellent recommendation systems can make or break these businesses

§ Clustering algorithms, which are tailored to find similar customers or


similar items, form the backbone of many of these recommendation
systems

41
Recommendation methods used

Collaborative filtering Content filtering


Netflix Netflix
Spotify IMDb
Amazon.com Rotten Tomatoes
Linkedin Pandora (Music and podcast)
Youtube
Facebook

42
Produce useful recommendations

• Recommendation systems are potentially helpful when we browse


websites but may not have a specific item in mind.
• In today’s digital age, businesses often have thousands of items to offer
to their customers.
• Recommendation systems can make or break these businesses.
• Clustering algorithms, which are tailored to find similar customers or
similar items, form the backbone of many recommendation systems.

43
R output of Oscars

- ? ? ?

44
Logit =-6 .

99 + 003AON +0
196 * GG

Logistic Regression - Oscar


-

-
-
-
-
-

1. We run a logistic regression model predicting the likelihood of winning


an Oscar as a function of nominations, and number of Golden Globe
-
wins. What is the estimated model?

&
#
P(+ = ,|.., /0) = #$% !(!#.%%&'.(')∗+,&'.-%#∗..)

2. Test H0: coefficient OscarNominations = 0 versus HA: coefficient


-

OscarNominations ≠ 0 at a significance level of 0.001


• z-value = 0.5034/0.125 = 4.0272
• P-value = 2*P(Z < -| z-value |) - 2 A 0 0000f .

• P-value for two-tail test < 0.001


• Reject H0.
3. Which variables are significant predictors of the likelihood of winning an
Oscar? Which are not? - Significance level: 0.05

• Oscar Nominations and Golden Globe wins are both


I

-
/

positive,
->> significant predictors.
-

45
Logistic Regression - Oscar

Do you think the regression model has the multicollinearity problem?

To check, we can see the VIF. Both of them are smaller than 10. No multicollinearity
problem.

46
PlY= 1/ GG ,
Now) -
=

Logistic Regression - Oscar


5. Use the model to predict the probability of the following three movies
winning an Oscar.

Movie Noms. GG wins Pr(Win)


The Artist 10 3
Midnight in Paris 4 1
Moneyball 6 0

6. We choose threshold t = 0.5, and the confusion matrix is shown below. (1


means Oscar Winner) What is the false positive rate? What is the overall
accuracy? Predicted = 0 Predicted = 1
Actual = 0 111 7
Actual = 1 13 14

47
P(Y=1)GG ON) =
,


Logistic Regression - Oscar
5. Use the model to predict the probability of the following three movies
winning an Oscar.

Movie Noms. GG wins Pr(Win)
The Artist &
10 &
3 0.605
Midnight in Paris 4 1 0.015
Moneyball 6 0 0.018

6. We choose threshold t = 0.5, and the confusion matrix is shown below. (1


means Oscar Winner) What is the false positive rate? What is the overall
accuracy? Predicted = 0 Predicted = 1
Actual = 0 111 TN 7 FP .

Actual = 1 13 FN 14 +P
False Positive Rate = 7/(111+7) = 0.059;
ne

Overall accuracy = (111+14)/(111+7+13+14) = 0.862

48
Clustering

§ Two data points: X = (3, 0, 1), Y = (2, 1, 3). What is the Euclidean distance
between X and Y?

§ True or False: The first step of Hierarchical clustering is to randomly assign


each data point to a cluster.

49
Clustering

§ Two data points: X = (3, 0, 1), Y = (2, 1, 3). What is the Euclidean distance
between X and Y?

-−. ) + /−0 ) + 0−- ) = 1 = .. 34

§ True or False: The first step of Hierarchical clustering is to randomly assign


each data point to a cluster.

56789. The first step of Hierarchical clustering is


“Start with each data point in the cluster including
itself only”.

50

You might also like