0% found this document useful (0 votes)

21 views

Module 5 - Clustering - Afterclassb

This document discusses clustering techniques for movie recommendations. It introduces Netflix's recommendation system which uses collaborative filtering by considering preferences of similar users. Content filtering analyzes movie information like genres and directors to find similar movies. The MovieLens dataset with movie genres is used to cluster movies into groups with similar genres. Clustering can reveal patterns and group data before making predictions to improve accuracy. Popular clustering algorithms like K-means and hierarchical clustering are introduced which group data based on distance between data points. Distance is defined using metrics like Euclidean distance between vectors of movie genre values. Clustering has applications in marketing, fraud detection, insurance and stock analysis.

Uploaded by

Vanessa Wong

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

Module 5 - Clustering - Afterclassb

Uploaded by

Vanessa Wong

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

IIMT 2641 Introduction to Business Analytics

Module 5: Clustering

1
Netflix
§ Subscription services
§ Key aspect is being able to offer customers accurate movie
recommendations based on a customer’s own preferences and viewing
history

2
Using other users’ rankings: Collaborative Filtering

§ Consider suggesting to Carl that he watch "Men in Black", since Amy rated
it highly and Carl and Amy seem to have similar preferences

3
Using movie information: Content Filtering

§ We saw that Amy liked "Men In Black”

– It was directed by Barry Sonnenfeld
– Classified in the genres of action, adventure, sci-fi and comedy
– It stars actor Will Smith

§ Consider recommending to Amy:

– Barry Sonnenfeld’s movie "Get Shorty"
– "Jurassic Park", which is in the genres of action, adventure, and sci-fi
– Will Smith’s movie "Hitch"

4
MovieLens Data
§ www.movielens.org is a movie recommendation website run by the
GroupLens Research Lab at the University of Minnesota.

§ They collect user preferences about movies and do collaborative filtering to

make recommendations.

§ We will use their movie database to try content filtering by clustering.

– Recommendation is more than predicting whether customer A likes a particular
movie . . .

5
MovieLens item dataset
Movies in the dataset are categorized as belonging to different genres.

Action Adventure Animation Children Comedy Crime

Documentary Drama Fantasy Film Noir Horror Musical
Mystery Romance Sci-Fi Thriller War Western

§ Each movie may belong to many genres.

Can we systematically find groups of movies with similar sets of genres?

6
Clustering
A task of unsupervised learning: segment data and assign those with similar
traits into the same groups (not prediction)

§ Identify hidden pattern/underlying trend/unusual characteristics, detect

anomalies
– Result: high intra-group similarity and low inter-group similarity
§ No object about a particular item to predict in mind
– Training data has not been pre-labelled into a known class.
§ Compare: supervised learning ≈ predictive analytics
– Predict the outcome of an unknown object
– Linear regression, logistic regression, etc.

7
Visualize Clustering

8
An intermediate step for prediction
Clustering can be done before predictions, which leads to higher accuracy.

§ Reveal patterns and relevant features/predictors

– Clustering medical history of patients may provide indicative signals about
heart attack.
– Clustering viewing history of customers may reveal their preferences and
recommend movies of interest.
§ Cluster data into groups and build a predictive model for each group
– Be careful not to overfit your model! (work best with large datasets)

9
Applications
§ What applications can you think of?

10
Applications
§ What applications can you think of?

§ Marketing: customer segmentation (discovery of distinct groups of

customers) for target marketing
§ Fraud detection: identify peer groups and normal behavior in those groups,
and then investigate outliers in each group
§ Insurance: identify customer groups with high average claim cost
§ Stock selection: identify groups of stocks that have similar trends

11
Types of clustering methods
§ There are many different algorithms for clustering.
– Differ in what makes a cluster and how to find them

§ We will cover two most popular methods

– K-means clustering
– Hierarchical clustering

12
Distance between points
Need to define distance between two data points i (C is :
,
iz ,
-
..., iR)

§ A natural choice is “Euclidean distance” j (2j1

--- (k)
:

(2 .

§ Distance between points i and j with k attributes is

&
'
"!" = $ %!# − %"# -

↓
#$%
(i , -xj) +
Kiz -

y2)
↑
(ik-sjR)"
-

! = 1 … , &: different attributes

. . .

–
– (!" : value of point ) in the !th attribute

13
1
. belongs to the
corresponding genre .

Distance Example

↑0
:

not

Need to define distance between two data points

§ The movie “Toy Story” is categorized as Adventure, Animation,
Children’s, Comedy, and Fantasy:
– Toy Story: (0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
I
* 8
action genues
§ The movie “Batman Forever” is categorized as Action, Adventure,
Comedy, and Crime:
– Batman Forever: (1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
* 4 *↑

14
Distance between “Toy Story” and “Batman Forever”

~
U
Toy Story: (0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
Batman Forever: (1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

" =? Quus-a-m- (ros

55
=

15
Distance between “Toy Story” and “Batman Forever”
Toy Story: (0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
Batman Forever: (1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

"=
0−1 ' + 1−1 ' + ⋯+ 1 − 0 ' +⋯= 5

J (1-1) +--(1-8)+-
+

Other popular distance metrics (not required):

-...
=

Manhattan Distance: Sum of absolute values instead of squares

Maximum Coordinate Distance: Only consider attribute in which data points
deviate the most i (i die
.
<in)
,
....

j (35) <52 -- (n)

, .

max
I kiy-sjy)
16
y
=

1 ,
.

-, R .
Cluster 1
(3 0) 1 2) (4 6)
:

. .
,
, .

Distance between clusters centroid of Cluster 1

* *)
(
:

§ Centroid distance
– Distance between centroids of clusters
– Centroid: the point with value in each attribute being average of data
points in the cluster Cluster A (1 0 1) : , , , 1 3 5)
,
.

centroid o
Buster A : ( *2 - #)

17
Distance between clusters
§ Minimum distance
– Distance between points from different clusters that are the closest to

④
each other

18
Company A :
($2 million ,
30
years)

Normalize data Compay B :

<$3 million , 40 yers)

§ Distance is influenced by scale of variables, so we usually normalize by
subtracting the mean and dividing by the standard deviation
– For example, if two attributes of companies are annual revenue (in
dollars) and age of the company (in years), the revenue variable would
dominate in calculating distance.
X
– However, if all variables are on the same scale, then normalization is
not necessary.
1 (C , y ) .

Compay ith compay

2(cr2 Yr)
§ Normalization reflects the relative importance of each variable
·

3 (1) .
-

y)
=
(I I)
§ Not required in Hw4
4 Ci , Yed
.

i i
N (xN ·
yx)
19
, 0x :

J 0y
,
K-means clustering
Partition a dataset into k distinct, non-overlapping clusters in which each
observation belongs to the cluster with the nearest cluster centroid
1. Specify the desired number of clusters k
2. Randomly assign each data point to a cluster
3. Compute cluster centroids
4. Re-assign each point to the closest cluster centroid
5. Re-compute cluster centroids
6. Repeat 4 and 5 until no improvement is made

20
K-means clustering algorithm
1. Specify the desired number of clusters k

21
D X
K-means clustering algorithm
1. Specify the desired number of clusters k =2 .

2. Randomly assign each data point to a cluster

22
K-means clustering algorithm
1. Specify the desired number of clusters k
2. Randomly assign each data point to a cluster
3. Compute cluster centroids

O
O

23
①
K-means clustering algorithm
1. Specify the desired number of clusters k
2. Randomly assign each data point to a cluster
3. Compute cluster centroids
4. Re-assign each point to the closest cluster centroid

24
K-means clustering algorithm
1. Specify the desired number of clusters k
2. Randomly assign each data point to a cluster
3. Compute cluster centroids
4. Re-assign each point to the closest cluster centroid
5. Re-compute cluster centroids
6. Repeat 4 and 5 until no improvement is made

25
Practical considerations

• The number of clusters k can be selected from previous knowledge or

experimenting.

• Can strategically select initial partition of points into clusters if you have
some knowledge about the data

• Can run algorithm several times with different random starting points

26
Hierarchical clustering

Build a hierarchy of clusters: Each observation starts in its own cluster, and
pairs of clusters are merged as one moves up the hierarchy
• A disadvantage of k-means clustering: need to pre-specify k
• Hierarchical clustering has no such requirement.

1. Start with each data point in the cluster including itself only
2. Compute the distance between each pair of clusters
3. Combine the two nearest clusters
4. Re-compute the distance between each pair of clusters
5. Combine the two nearest clusters
6. Repeat 4 and 5 until all data points belong to one single cluster

27
Hierarchical clustering

1. Start with each data point in the cluster including itself only

28
Hierarchical clustering
C =

4 =

10 =

1 =

6 .

1. Start with each data point in the cluster including itself only
2. Compute the distance between each pair of clusters
3. Combine the two nearest clusters

29
⑰
Hierarchical clustering
1. Start with each data point in the cluster including itself only
2. Compute the distance between each pair of clusters
3. Combine the two nearest clusters
4. Re-compute the distance between each pair of clusters
5. Combine the two nearest clusters

30
Hierarchical clustering
1. Start with each data point in the cluster including itself only
2. Compute the distance between each pair of clusters
3. Combine the two nearest clusters
4. Re-compute the distance between each pair of clusters
5. Combine the two nearest clusters

31
Hierarchical clustering
1. Start with each data point in the cluster including itself only
2. Compute the distance between each pair of clusters
3. Combine the two nearest clusters
4. Re-compute the distance between each pair of clusters
5. Combine the two nearest clusters
6. Repeat 4 and 5 until all data points belong to one single cluster

32
Display cluster process
The process by which data points are combined is shown with grey
lines.

Height of vertical
lines represents
distance between
points or clusters

Data points listed -> a

- - - -

along bottom
33
Select clusters
Hierarchical clustering is done without first selecting # clusters desired.
How can we select # clusters from the dendrogram?
• Draw a horizontal line cutting across the dendrogram
• # vertical lines that the horizontal line crosses = # clusters
• Good choice if the horizontal line has a lot of “wiggle room”
• More “wiggle room” means two clusters are farther from each
other.

34
Example: Select clusters

35
Comparison between the two methods
K-means clustering
• Pros ci a
(n
• Simple and fast
• Work well with both small and large datasets
• Cons
• # clusters has to be pre-specified
• Outliers may skew the centroid positions, resulting in poor clustering
Hierarchical clustering
• Pros
• Dendrogram is intuitive and provides a rich structure of the data
• Dendrogram can be used to assist in selecting # clusters (depending on
specific purpose as well)
• Cons
• Require a lot of computation power (due to distance computations)
• Not work if datasets are too large
• Sensitive to metric of distance between points and between clusters

37
Beyond movies: Mass personalization
“If I have 3 million customers on the web, I should have 3 million
stores on the web.” – Jeff Bezos, CEO of Amazon.com

• Recommendation systems build models about users’ preferences to

personalize the user experience.

• Help users find items they might not have searched for:
• A new favorite band
• An old friend who uses the same social media network
• A book or song they are likely to enjoy

38
More about Amazon
Item-to-item collaborative filtering

• Mixed system: “similar” items purchased by “similar customers”

• Content filtering: consider items to be similar if they have similar
attributes
• e.g. director/genre/actor of movies
• Amazon also considers items to be similar if they are “complements”
• Complements: customers tend to purchase them together
• Take into account both items’ attributes and customers’ behavior
• A variation is used by YouTube.

39
Example: Amazon’s recommendation

40
What is the edge?
§ In today’s digital age, businesses often have hundreds of thousands of items
to offer their customers

§ Excellent recommendation systems can make or break these businesses

§ Clustering algorithms, which are tailored to find similar customers or

similar items, form the backbone of many of these recommendation
systems

41
Recommendation methods used

Collaborative filtering Content filtering

Netflix Netflix
Spotify IMDb
Amazon.com Rotten Tomatoes
Linkedin Pandora (Music and podcast)
Youtube
Facebook

42
Produce useful recommendations

• Recommendation systems are potentially helpful when we browse

websites but may not have a specific item in mind.
• In today’s digital age, businesses often have thousands of items to offer
to their customers.
• Recommendation systems can make or break these businesses.
• Clustering algorithms, which are tailored to find similar customers or
similar items, form the backbone of many recommendation systems.

43
R output of Oscars

- ？？？

44
Logit =-6 .

99 + 003AON +0
196 * GG

Logistic Regression - Oscar

-
-
-
-
-

1. We run a logistic regression model predicting the likelihood of winning

an Oscar as a function of nominations, and number of Golden Globe
-
wins. What is the estimated model?

&
#
P(+ = ,|.., /0) = #$% !(!#.%%&'.(')∗+,&'.-%#∗..)

2. Test H0: coefficient OscarNominations = 0 versus HA: coefficient

OscarNominations ≠ 0 at a significance level of 0.001

• z-value = 0.5034/0.125 = 4.0272
• P-value = 2*P(Z < -| z-value |) - 2 A 0 0000f .

• P-value for two-tail test < 0.001

• Reject H0.
3. Which variables are significant predictors of the likelihood of winning an
Oscar? Which are not? - Significance level: 0.05

• Oscar Nominations and Golden Globe wins are both

-
/

positive,
->> significant predictors.
-

45
Logistic Regression - Oscar

Do you think the regression model has the multicollinearity problem?

To check, we can see the VIF. Both of them are smaller than 10. No multicollinearity
problem.

46
PlY= 1/ GG ,
Now) -
=

Logistic Regression - Oscar

5. Use the model to predict the probability of the following three movies
winning an Oscar.

Movie Noms. GG wins Pr(Win)

The Artist 10 3
Midnight in Paris 4 1
Moneyball 6 0

6. We choose threshold t = 0.5, and the confusion matrix is shown below. (1

means Oscar Winner) What is the false positive rate? What is the overall
accuracy? Predicted = 0 Predicted = 1
Actual = 0 111 7
Actual = 1 13 14

47
P(Y=1)GG ON) =
,

↑
Logistic Regression - Oscar
5. Use the model to predict the probability of the following three movies
winning an Oscar.
↳
Movie Noms. GG wins Pr(Win)
The Artist &
10 &
3 0.605
Midnight in Paris 4 1 0.015
Moneyball 6 0 0.018

6. We choose threshold t = 0.5, and the confusion matrix is shown below. (1

means Oscar Winner) What is the false positive rate? What is the overall
accuracy? Predicted = 0 Predicted = 1
Actual = 0 111 TN 7 FP .

Actual = 1 13 FN 14 +P
False Positive Rate = 7/(111+7) = 0.059;
ne

Overall accuracy = (111+14)/(111+7+13+14) = 0.862

48
Clustering

§ Two data points: X = (3, 0, 1), Y = (2, 1, 3). What is the Euclidean distance
between X and Y?

§ True or False: The first step of Hierarchical clustering is to randomly assign

each data point to a cluster.

49
Clustering

§ Two data points: X = (3, 0, 1), Y = (2, 1, 3). What is the Euclidean distance
between X and Y?

-−. ) + /−0 ) + 0−- ) = 1 = .. 34

§ True or False: The first step of Hierarchical clustering is to randomly assign

each data point to a cluster.

56789. The first step of Hierarchical clustering is

“Start with each data point in the cluster including
itself only”.

The Crystals of Z''leth: Solo Adventures, #2
From Everand
The Crystals of Z''leth: Solo Adventures, #2
Obvious Mimic
5/5 (1)
Week6_clustering_regression
No ratings yet
Week6_clustering_regression
101 pages
Topic4 Clustering
No ratings yet
Topic4 Clustering
78 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Clustering Class Ppt
No ratings yet
Clustering Class Ppt
103 pages
Chapter 5 Clustering
No ratings yet
Chapter 5 Clustering
40 pages
Clustering
No ratings yet
Clustering
75 pages
ML4 Unsupervised Learning
No ratings yet
ML4 Unsupervised Learning
60 pages
Clustering, A Tool To Analyze Data Points
No ratings yet
Clustering, A Tool To Analyze Data Points
61 pages
Clustering-Part1.pptx
No ratings yet
Clustering-Part1.pptx
84 pages
Clustering
No ratings yet
Clustering
75 pages
Chapter 6
No ratings yet
Chapter 6
62 pages
Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Chap15 Cluster Analysis
No ratings yet
Chap15 Cluster Analysis
55 pages
cs4811-ch10c-clustering
No ratings yet
cs4811-ch10c-clustering
35 pages
Unit-7 Finalized
No ratings yet
Unit-7 Finalized
20 pages
hierarchicalclustering
No ratings yet
hierarchicalclustering
20 pages
Unit 2
No ratings yet
Unit 2
89 pages
Unit 7 Clustering
No ratings yet
Unit 7 Clustering
56 pages
8. Clustering
No ratings yet
8. Clustering
38 pages
Clusters
No ratings yet
Clusters
64 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
Cluster Analysis
No ratings yet
Cluster Analysis
60 pages
4 Clustering1
No ratings yet
4 Clustering1
41 pages
DM 10,11 Clustering PDF
No ratings yet
DM 10,11 Clustering PDF
65 pages
Clustering
No ratings yet
Clustering
125 pages
1. Clustering
No ratings yet
1. Clustering
75 pages
lec2
No ratings yet
lec2
32 pages
3. Chapter 5 CLUSTERING
No ratings yet
3. Chapter 5 CLUSTERING
36 pages
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
No ratings yet
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
61 pages
Cluster
100% (1)
Cluster
72 pages
datamining-lect8
No ratings yet
datamining-lect8
79 pages
Grouping
No ratings yet
Grouping
98 pages
Slide TIF311 DM 10 11
No ratings yet
Slide TIF311 DM 10 11
49 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
Module 3 - 1
No ratings yet
Module 3 - 1
149 pages
Lecture4 Slides
No ratings yet
Lecture4 Slides
43 pages
Cluster Analysis
No ratings yet
Cluster Analysis
24 pages
Unit-V Cluster Analysis?: Unsupervised Classification Stand-Alone Tool Preprocessing Step
No ratings yet
Unit-V Cluster Analysis?: Unsupervised Classification Stand-Alone Tool Preprocessing Step
24 pages
Clustering Slides
No ratings yet
Clustering Slides
22 pages
21AI71-module-5-textbook
No ratings yet
21AI71-module-5-textbook
25 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Week 9 - Clustering
No ratings yet
Week 9 - Clustering
63 pages
Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
DM Lecture 06
No ratings yet
DM Lecture 06
32 pages
K Mean Clustering1
No ratings yet
K Mean Clustering1
23 pages
Clustering Analysis
No ratings yet
Clustering Analysis
102 pages
Clustering
No ratings yet
Clustering
39 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Lecture 8
No ratings yet
Lecture 8
56 pages
DM Chapter 5 (Clustering)
No ratings yet
DM Chapter 5 (Clustering)
40 pages
Lecture 5
No ratings yet
Lecture 5
53 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
17 pages
Soft Vs Hard Clustering
No ratings yet
Soft Vs Hard Clustering
5 pages
Presentation 28128 Content Document 20241126014005PM
No ratings yet
Presentation 28128 Content Document 20241126014005PM
80 pages
Lecture 4
No ratings yet
Lecture 4
6 pages
DM Clustering
No ratings yet
DM Clustering
51 pages
Clustering Today
No ratings yet
Clustering Today
52 pages
Learn Python through Nursery Rhymes and Fairy Tales: Classic Stories Translated into Python Programs (Coding for Kids and Beginners)
From Everand
Learn Python through Nursery Rhymes and Fairy Tales: Classic Stories Translated into Python Programs (Coding for Kids and Beginners)
Shari Eskenas
5/5 (1)
Module 2 - RV - Afterclass
No ratings yet
Module 2 - RV - Afterclass
44 pages
Module 2 - Sample - Afterclass
No ratings yet
Module 2 - Sample - Afterclass
36 pages
Module 3 - MultipleLinearRegression - Afterclass1b
No ratings yet
Module 3 - MultipleLinearRegression - Afterclass1b
34 pages
Module 6 - CART - Inclassb
No ratings yet
Module 6 - CART - Inclassb
50 pages
Fused Silica Datasheet
No ratings yet
Fused Silica Datasheet
2 pages
AI Handbook
No ratings yet
AI Handbook
15 pages
Optical Detector
No ratings yet
Optical Detector
51 pages
12 Aluminium and Alumina: Production
No ratings yet
12 Aluminium and Alumina: Production
10 pages
Retail-Management Solved MCQs (Set-3)
100% (1)
Retail-Management Solved MCQs (Set-3)
9 pages
Amoco Process
100% (3)
Amoco Process
83 pages
Balkisu Yusuf
No ratings yet
Balkisu Yusuf
1 page
Water Tap Measurer
No ratings yet
Water Tap Measurer
11 pages
Copper in Comparison With Aluminium As Common Material in Conductors of LV and MV Cables
No ratings yet
Copper in Comparison With Aluminium As Common Material in Conductors of LV and MV Cables
5 pages
ABS PSC Quarterly Report (Q3 2023)
No ratings yet
ABS PSC Quarterly Report (Q3 2023)
20 pages
Assignments PSOC
100% (2)
Assignments PSOC
4 pages
Chap 1&3
No ratings yet
Chap 1&3
51 pages
Top Farm
No ratings yet
Top Farm
84 pages
Indian Law Institute Journal of The Indian Law Institute
No ratings yet
Indian Law Institute Journal of The Indian Law Institute
5 pages
Drivers of Innovation
No ratings yet
Drivers of Innovation
5 pages
Risks and Cost of Capital
No ratings yet
Risks and Cost of Capital
8 pages
Zero-Phase Current Transformers (ZCT) for Ground-Fault Protection1
No ratings yet
Zero-Phase Current Transformers (ZCT) for Ground-Fault Protection1
1 page
Gift Subscription Terms and Conditions - Scribd Help Center
No ratings yet
Gift Subscription Terms and Conditions - Scribd Help Center
3 pages
FISIP UNJANI Gambar Arsitektur Pengembangan Desain Bagian 2
No ratings yet
FISIP UNJANI Gambar Arsitektur Pengembangan Desain Bagian 2
18 pages
13.1 Salinas-v-NLRC
No ratings yet
13.1 Salinas-v-NLRC
1 page
Catalogo DTH Sandvik
100% (2)
Catalogo DTH Sandvik
49 pages
Statement of Account: Credit Limit Rs Available Credit Limit Rs
No ratings yet
Statement of Account: Credit Limit Rs Available Credit Limit Rs
3 pages
RCTP PDF
No ratings yet
RCTP PDF
51 pages
Flight Ticket
No ratings yet
Flight Ticket
2 pages
Critical Thinking: Dialysis Machine
No ratings yet
Critical Thinking: Dialysis Machine
19 pages
Specs For Stryker Core Ortho
No ratings yet
Specs For Stryker Core Ortho
2 pages
A European Approach To Artificial Intelligence
100% (1)
A European Approach To Artificial Intelligence
24 pages
Adwea Ims It MGMT PLC v1.0
No ratings yet
Adwea Ims It MGMT PLC v1.0
93 pages
IS 210 - 2009 - Reff2020
No ratings yet
IS 210 - 2009 - Reff2020
13 pages
Antamok
No ratings yet
Antamok
4 pages

Module 5 - Clustering - Afterclassb

Uploaded by

Module 5 - Clustering - Afterclassb

Uploaded by

IIMT 2641 Introduction to Business Analytics

§ We saw that Amy liked "Men In Black”

§ Consider recommending to Amy:

§ They collect user preferences about movies and do collaborative filtering to

§ We will use their movie database to try content filtering by clustering.

Action Adventure Animation Children Comedy Crime

§ Each movie may belong to many genres.

§ Identify hidden pattern/underlying trend/unusual characteristics, detect

§ Reveal patterns and relevant features/predictors

§ Marketing: customer segmentation (discovery of distinct groups of

§ We will cover two most popular methods

§ A natural choice is “Euclidean distance” j (2j1

§ Distance between points i and j with k attributes is

! = 1 … , &: different attributes

Need to define distance between two data points

" =? Quus-a-m- (ros

Other popular distance metrics (not required):

Manhattan Distance: Sum of absolute values instead of squares

j (35) <52 -- (n)

Distance between clusters centroid of Cluster 1

Normalize data Compay B :

<$3 million , 40 yers)

Compay ith compay

2. Randomly assign each data point to a cluster

• The number of clusters k can be selected from previous knowledge or

Data points listed -> a

• Recommendation systems build models about users’ preferences to

• Mixed system: “similar” items purchased by “similar customers”

§ Excellent recommendation systems can make or break these businesses

§ Clustering algorithms, which are tailored to find similar customers or

Collaborative filtering Content filtering

• Recommendation systems are potentially helpful when we browse

Logistic Regression - Oscar

1. We run a logistic regression model predicting the likelihood of winning

2. Test H0: coefficient OscarNominations = 0 versus HA: coefficient

OscarNominations ≠ 0 at a significance level of 0.001

• P-value for two-tail test < 0.001

• Oscar Nominations and Golden Globe wins are both

Do you think the regression model has the multicollinearity problem?

Logistic Regression - Oscar

Movie Noms. GG wins Pr(Win)

6. We choose threshold t = 0.5, and the confusion matrix is shown below. (1

6. We choose threshold t = 0.5, and the confusion matrix is shown below. (1

Overall accuracy = (111+14)/(111+7+13+14) = 0.862

§ True or False: The first step of Hierarchical clustering is to randomly assign

-−. ) + /−0 ) + 0−- ) = 1 = .. 34

§ True or False: The first step of Hierarchical clustering is to randomly assign

56789. The first step of Hierarchical clustering is

You might also like