0% found this document useful (0 votes)
46 views46 pages

AIMLB PGP 2024 Session 12

import info on this topic

Uploaded by

VAKUL SINGH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views46 pages

AIMLB PGP 2024 Session 12

import info on this topic

Uploaded by

VAKUL SINGH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Artificial Intelligence and

Machine Learning for


Business (AIMLB)
Mukul Gupta
(Information Systems Area)
What is customer segmentation?

• Grouping customers based on shared


characteristics.
• This allows companies to refine their messaging, sales
strategies, and products to target, advertise, and sell to
those audiences more effectively.
Customer segmentation strategy

• STP approach: Segmentation, Targeting, and


Positioning
Customer Segmentation Techniques
• Roughly, three categories:
• Rule-based segmentation
• Based on manually designed rules.
• Segmentation is not portable to other analyses. So, with a new
goal, new knowledge, or new data, the whole rule system needs to
be redesigned.
• For example, if you want to categorize your customers based on the number
of days since their first order
• Segmentation using binning
• Binning data based on one or more features.
• This does not necessarily require domain knowledge, but the
business goal must be clear.
• Segmentation with zero knowledge
• Common clustering algorithms can be applied.
What is Cluster Analysis?
• Given a set of objects, place them in groups such
that the objects in a group are similar (or related) to
one another and different from (or unrelated to) the
objects in other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Applications of Cluster Analysis
• Understanding
Discovered Clusters Industry Group
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,

1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

• Group related DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,


Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
Technology1-DOWN

documents for Sun-DOWN


Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,

browsing, group genes 2 ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,


Computer-Assoc-DOWN,Circuit-City-DOWN,
Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Technology2-DOWN

and proteins that have Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN


Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,

similar functionality, or 3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN


Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
group stocks with 4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP

similar price
fluctuations

• Summarization
• Reduce the size of
large data sets Clustering precipitation
in Australia
Notion of a Cluster can be Ambiguous

How many clusters? Six Clusters

Two Clusters Four Clusters


Types of Clustering

• A clustering is a set of clusters


• Important distinction between hierarchical and
partitional sets of clusters
• Partitional Clustering
• A division of data objects into non-overlapping subsets (clusters)
• Hierarchical clustering
• A set of nested clusters organized as a hierarchical tree
Partitional Clustering

Original Points A Partitional Clustering


Hierarchical Clustering

p1
p3 p4
p2
p1 p2 p3 p4

Hierarchical Clustering Dendrogram


K-means Clustering
• Partitional clustering approach
• Number of clusters, 𝐾, must be specified
• Each cluster is associated with a centroid (center point)
• The center of a cluster is often a centroid, the average of all the points
in the cluster
• Each point is assigned to the cluster with the closest centroid
• The basic algorithm is very simple
Example of K-means Clustering
Iteration 6
1
2
3
4
5
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x
Example of K-means Clustering
Iteration 1 Iteration 2 Iteration 3
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Iteration 4 Iteration 5 Iteration 6


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
K-means Clustering – Details
• Simple iterative algorithm.
• Choose initial centroids;
• repeat {assign each point to a nearest centroid; re-compute cluster
centroids}
• until centroids stop changing.
• Initial centroids are often chosen randomly.
• Clusters produced can vary from one run to another
• K-means will converge for common proximity measures with
appropriately defined centroid.
• Most of the convergence happens in the first few iterations.
• Often the stopping condition is changed to ‘Until relatively few points
change clusters’
• Complexity is 𝑂(𝑛 ∗ 𝐾 ∗ 𝐼 ∗ 𝑑)
•𝑛 = number of points, 𝐾 = number of clusters,
𝐼 = number of iterations, 𝑑 = number of attributes
K-means Objective Function
• A common objective function (used with Euclidean
distance measure) is Sum of Squared Error (SSE)
• For each point, the error is the distance to the nearest
cluster center
• To get SSE, we square these errors and sum them.
K
SSE =   dist 2 (mi , x )
i =1 xCi

• 𝑥 is a data point in cluster 𝐶𝑖 and 𝑚𝑖 is the centroid


(mean) for cluster 𝐶𝑖
• SSE improves in each iteration of K-means until it
reaches a local or global minima.
Two different K-means Clustering
3

2.5

2
Original Points
1.5

y
1

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x

3 3

2.5 2.5

2 2

1.5 1.5
y

1
y 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x x

Optimal Clustering Sub-optimal Clustering


Importance of Choosing Initial Centroids

Iteration 5
1
2
3
4
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x
Importance of Choosing Initial Centroids

Iteration 1 Iteration 2
3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x x

Iteration 3 Iteration 4 Iteration 5


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Limitations of K-means

• K-means has problems when clusters are of


differing
• Sizes
• Densities
• Non-globular shapes

• K-means has problems when the data contains


outliers.
• One possible solution is to remove outliers before
clustering
Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)


Limitations of K-means: Differing
Density

Original Points K-means (3 Clusters)


Limitations of K-means: Non-globular
Shapes

Original Points K-means (2 Clusters)


Hierarchical Clustering

• Produces a set of nested clusters organized as a


hierarchical tree
• Can be visualized as a dendrogram
• A tree like diagram that records the sequences of merges
or splits
6 5
0.2
4
3 4
0.15 2
5
2
0.1

1
0.05
3 1

0
1 3 2 5 4 6
Strengths of Hierarchical Clustering

• Do not have to assume any particular number of


clusters
• Any desired number of clusters can be obtained by
‘cutting’ the dendrogram at the proper level

• They may correspond to meaningful taxonomies


• Example in biological sciences (e.g., animal kingdom,
phylogeny reconstruction, …)
Hierarchical Clustering
• Two main types of hierarchical clustering
• Agglomerative:
• Start with the points as individual clusters
• At each step, merge the closest pair of clusters until only one
cluster (or 𝑘 clusters) left
• Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains an
individual point (or there are 𝑘 clusters)

• Traditional hierarchical algorithms use a similarity or


distance matrix
• Merge or split one cluster at a time
Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
• Key Idea: Successively merge closest clusters

• Basic algorithm
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains

• Key operation is the computation of the proximity of two


clusters
• Different approaches to define the distance between clusters
distinguish the different algorithms
Steps 1 and 2

• Start with clusters of individual points and a


proximity matrix p1 p2 p3 p4 p5 ...
p1

p2
p3

p4
p5
.
.
. Proximity Matrix

...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate Situation

• After some merging steps, we have some clusters


C1 C2 C3 C4 C5
C1

C2
C3
C3
C4
C4
C5

Proximity Matrix
C1

C2 C5

...
p1 p2 p3 p4 p9 p10 p11 p12
Step 4
• We want to merge the two closest clusters (C2 and
C5) and update the proximity matrix.
C1 C2 C3 C4 C5
C1

C2

C3 C3

C4 C4
C5

C1
Proximity Matrix

C2 C5

...
p1 p2 p3 p4 p9 p10 p11 p12
Step 5

• The question is “How do we update the proximity


matrix?” C2
U
C1 C5 C3 C4

C1 ?

C2 U C5 ? ? ? ?
C3
C3 ?
C4
C4 ?

Proximity Matrix
C1

C2 U C5

...
p1 p2 p3 p4 p9 p10 p11 p12
How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...
p1
Similarity?
p2

p3

p4

p5
MIN (Single linkage)
.
MAX (Complete linkage)
.
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...
p1

p2

p3

p4

p5
MIN
.
MAX
.
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...
p1

p2

p3

p4

p5
MIN
.
MAX
.
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...
p1

p2

p3

p4

p5
MIN
.
MAX
.
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...
p1

  p2

p3

p4

p5
MIN
.
MAX
.
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared distance
MIN, MAX, and Group Average

• MIN
• Can handle non-elliptical shapes
• Sensitive to noise and outliers
• MAX
• Less susceptible to noise and outliers
• Tends to break large clusters and biased towards globular
clusters
• Group average
• Compromise between MIN and MAX
• Less susceptible to noise and outliers
• Biased towards globular clusters
Hierarchical Clustering: Comparison
5
1 4 1
3
2 5
5 5
2 1 2
MIN MAX
2 3 6 3 6
3
1
4 4
4

5
1 5 4 1
2 2
5 Ward’s Method 5
2 2
3 6 Group Average 3 6
3
4 1 1
4 4
3
Hierarchical Clustering: Time and
Space requirements
• 𝑂 𝑁 2 space since it uses the proximity matrix.
• 𝑁 is the number of points.

• 𝑂 𝑁 3 time in many cases


• There are 𝑁 steps and at each step the size, 𝑁 2 ,
proximity matrix must be updated and searched
• Complexity can be reduced to 𝑂 𝑁 2 log 𝑁 time with
some cleverness
Hierarchical Clustering: Problems and
Limitations
• Once a decision is made to combine two clusters, it
cannot be undone

• No global objective function is directly minimized

• Different schemes have problems with one or more


of the following:
• Sensitivity to noise
• Difficulty handling clusters of different sizes and non-
globular shapes
• Breaking large clusters
Measures of Cluster Validity: Cohesion
and Separation
• Cluster Cohesion: Measures how closely related are
objects in a cluster
• Example: SSE
• Cluster Separation: Measure how distinct or well-
separated a cluster is from other clusters
• Example: Squared Error
• Cohesion is measured by within cluster sum of squares (SSE)
2
𝑆𝑆𝐸 = ෍ ෍ 𝑥 − 𝑚𝑖
𝑖 𝑥∈𝐶𝑖
• Separation is measured by the between cluster sum of squares
(SSB)
2
𝑆𝑆𝐵 = ෍ 𝐶𝑖 𝑚 − 𝑚𝑖
𝑖
Where 𝐶𝑖 is the size of cluster 𝑖
Unsupervised Measures: Cohesion and
Separation
• Example: SSE and SSB
• SSB + SSE = constant
m
  
1 m1 2 3 4 m2 5

K=1 cluster: 𝑆𝑆𝐸 = 1 − 3 2


+ 2−3 2
+ 4−3 2
+ 5−3 2
= 10
𝑆𝑆𝐵 = 4 × 3 − 3 2 = 0
𝑇𝑜𝑡𝑎𝑙 = 10 + 0 = 10
K=2 clusters: 𝑆𝑆𝐸 = 1 − 1.5 2 + 2 − 1.5 2 + 4 − 4.5 2 + 5 − 4.5 2 = 1
𝑆𝑆𝐵 = 2 × 3 − 1.5 2 + 2 × 3 − 4.5 2 =9
𝑇𝑜𝑡𝑎𝑙 = 1 + 9 = 10
Unsupervised Measures: Silhouette
Coefficient
• Silhouette coefficient combines ideas of both cohesion and
separation, but for individual points, as well as clusters and
clusterings
• For an individual point, 𝑖
• Calculate 𝒂 = average distance of 𝑖 to the points in its cluster
• Calculate 𝒃 = min (average distance of 𝑖 to points in another cluster)
• The silhouette coefficient for a point is then given by
s = (b – a) / max(a,b) Distances used
to calculate b
i
• Value can vary between -1 and 1
• Typically ranges between 0 and 1. Distances used
to calculate a
• The closer to 1 the better.

• Can calculate the average silhouette coefficient for a cluster


or a clustering
Determining the Number of Clusters

• SSE is good for comparing two clusterings or two


clusters
• SSE can also be used to estimate the number of
clusters
10

6 9

8
4
7

2 6

SSE
5
0
4
-2 3

2
-4
1
-6 0
2 5 10 15 20 25 30
5 10 15
K
Thank You

You might also like