0% found this document useful (0 votes)

50 views53 pages

Clustering and Association Analysis

The document discusses description mining techniques for unsupervised learning, including clustering, association rule discovery, and principal component analysis. It covers clustering algorithms like K-means clustering, which groups data into K clusters by assigning points to the nearest cluster center and updating the cluster centers iteratively until convergence.

Uploaded by

L My

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views53 pages

Clustering and Association Analysis

Uploaded by

L My

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

CCST9047: The Age

of Big Data
Lecture 5

Feb. 28, 2024

Notice

‣ We have tutorial this week, homework 1 will be released after the

tutorial.

‣ We have the make-up course in reading week.

‣ The rst in-class quiz will be the week after reading week.

‣ We will allow one-page cheating sheet.

‣ The quiz will cover the material by today.

‣ We will open the portal for proposal submission, the proposal

presentation will be in the same week as the quiz.
fi
Recap of Previous Lecture

‣ Data Mining

‣ Non-trivial discovering interesting patterns and insights from the data

‣ Predictive Mining

‣ Classi cation: Decision Tree, Linear classi cation

‣ Regression: Linear regression, generalized linear regression

‣ Brief introduction of neural networks

fi
fi
Data Mining Tasks

‣ Predictive Mining (Previous lecture)

‣ Classi cation

‣ Regression

‣ Description Mining (This lecture)

‣ Clustering

‣ Association Rule Discovery

‣ Principle Component Analysis

fi
Description Mining

‣ Find human-interpretable patterns that describe the data:

‣ Which data points are similar to each other?

‣ Relation between data points or features?

‣ What’s the trend of the data?

‣ Causal relationships

‣ You do not have speci c ground-truth target of the data.

‣ Unsupervised learning
fi
Clustering

‣ What’s Clustering

‣ Organizing data into clusters such that

‣ Intra-cluster data are similar

‣ Inter-cluster data are less similar

‣ Finding natural groupings among objects

Similarity

‣ How to measure the similarity between data?

Data 1 Data 2

Distance function

D(data1, data2)
Similarity

‣ For Data in the continuous space

‣ Data is formulated as a continuous

vector

‣ Distance function can be

‣ ℓ2 norm: ∥x1 − x2∥2

‣ 1
ℓ norm: ∥x1 − x ∥
2 1
Similarity

‣ For a set of directions

‣ Data is formulated as a continuous

vector

‣ Distance function can be cosine

⟨x1, x2⟩
function: D(x1, x2) =
∥x1∥2∥x2∥2
Similarity

‣ For a set of images

‣ We rst need a feature extractor to

get semantic feature from the data.

‣ We then use ℓ1 or ℓ2 norm to

characterize the distance between
these semantic features.
fi
Clustering With Categorical Features

‣ If the data has categorical features, one can cluster datapoints directly.
Clustering With Categorical Features

Martial Taxable
Refund Cheat
status Income
Yes Single 125K No
Yes Married 120K No
Yes Divorced 220K No

Grouping by refund
Martial Taxable
Refund Cheat
status Income
No Married 100K No
No Single 70K No
No Divorced 95K Yes
No Married 60K No
No Single 85K Yes
No Married 75K No
No Single 90K Yes
Clustering by Key Words

‣ Each data point is a paper

‣ Each paper has a number of

key words

‣ Clustering can be performed

based on the key words
Clustering in Euclidean Space

‣ In more challenging case, the data in euclidean cannot be directly grouped.

‣ We do not know the label of data

points.

‣ We do not know how many

clusters should be considered.

‣ How to build the clusters?

Clustering in Euclidean Space

‣ In more challenging case, the data in euclidean cannot be directly grouped.

?
K-Means Clustering

‣ We rst need to decide the number of clusters, i.e., K.

K-means Clustering: Initialization
‣ Step 1: initialize the centers of K = 3 clusters
Decide K, and initialize K centers (randomly)
5

4
k1
3

2
k2

k3
0
0 1 2 3 4 5
fi
K-Means Clustering

‣ Step 2: assignK-means
all data to Clustering:
the nearest center Iteration 1 the centers to the
and move
Assign all objects to the nearest center.
means of the new clusters.
Move a center to the mean of its members.
5

4
k1
3

2
k2

k3
0
0 1 2 3 4 5
K-Means Clustering

K-means Clustering: Iteration 2

‣ Step 3:After
re-assign the data
moving centers, and move
re-assign centers
the objects again
to nearest
Move a center to the mean of its new members.
centers.
5

4
k1

k3
1 k2

0
0 1 2 3 4 5
K-Means Clustering

‣ K-means
Step 4 or more: Clustering:
re-assign the data andFinished!
move centers again, until the
Re-assign and move centers, until …
clustering result does not change.
no objects changed membership.
expression in condition 2 5

4
k1

k2
1
k3

0
0 1 2 3 4 5
General Algorithm of K-means

1. Decide on a value for K, the number of clusters

2. Initialize the K cluster centers (randomly or using prior
information)
3. Decide the class memberships of data points by assigning
them to the nearest cluster center.
4. Re-estimate the K cluster centers, by assuming the
memberships found above are correct.
5. Repeat the steps 4-5 until none of the data changed
membership in the last iteration.
General Algorithm of K-means

• Decide the class memberships of data points by assigning them

to the nearest cluster center.
• Nearest cluster center can be found using di erent similarity/
distance measure.

• Re-estimate the K cluster centers, by assuming the

memberships found above are correct.
• Estimating the centers can be performed by calculating the
mean or median, or weighted average.
ff
Why K-means Works

‣ What’s a good partition?

‣ High intra-cluster similarity

‣ The average distance to the center of the cluster should be as small as

possible.
Distance to the cluster
K
2
∑∑
min ∥xi − μk∥2
k=1 i∈Ck

Cluster k
Why K-means Works

‣ Find the best μk

‣ Take derivatives over μk

K 2
∂ ∑k=1 ∑i∈C ∥xi − μk∥2
∑
k
= (μk − xi)
∂μk i∈Ck

‣ Derivatives becomes zero implies μk is the average of data in Ck

‣ This is consistent with the goal of K-means

What if the distance is ℓ1 norm? ⟹ one may use median to get the center
rather than average.
Choice of K

jective function isWhen When k = 3, the objective function is 133.6

873.0k = 2, the objective function is 173.1

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

K=1 K=2 K=3

k = 2 can already provide reasonable clustering
Choice plotK
We can of the objective function values for k equals 1 to 6…

The abrupt change at k = 2, is highly

K suggestive of two clusters
2 of
∑ or∑“elbow finding”.
in the data. This technique
We can calculate the objectivefor ∥xi − μk∥2
determining the number
clusters is known as “knee finding”
k=1 i∈Ck
1.00E+03

9.00E+02
Objective Function

8.00E+02

7.00E+02

6.00E+02

5.00E+02

4.00E+02

3.00E+02

2.00E+02

1.00E+02

0.00E+00
1 2 3
k 4 5 6

k = 2 can already give a low objective value.

Summary of K-means

‣ Strengths

‣ Simple, easy to implement and debug

‣ Intuitive objective function: optimizes intra-cluster similarity

‣ E cient, especially when K ≪ n

ffi
Summary of K-means

‣ Weakness
‣ Only applicable when the center can be
calculated, thus cannot be applied to
categorical data.

‣ Relatively sensitive to initialization, cannot

guarantee to converge to global min

‣ Need to specify K

‣ Unable to handle noisy data and outliers

‣ Not suitable to discover clusters with non-

convex shape
Association Analysis

Customer Purchase
1 Beer, Chips, Diapers
2 Apple, Beer, Diapers
3 Beer, Egg
4 Chips
5 Beer, Egg, Diapers
Customers who like to buy diapers
6 Apple, Beer also like to buy beer.
7 Beer, Chips, Diapers, Egg
Association Rule Discovery

‣ Given a set of records each of which contain some number of items from a
given collection

‣ Produce dependency rules which will predict occurrence of an item

based on occurrences of other items

Customer Purchase

1 Beer, Coke, Milk

Rules discovered
2 Beer, Bread

3 Beer, Coke, Diaper, Milk {Milk} → {Coke}

4 Beer, Bread, Diaper, Milk {Diaper, Milk} → {Beer}
5 Coke, Diaper, Milk
Application of Association Rule Discovery

‣ Marketing and Sales promotion:

‣ Considering the rule: {Bagels, …}→{Chips}

‣ Chips as consequent

‣ Can be used to see what should be done to boost its sales

‣ Bagels in the antecedent

‣ Can be used to see which products will be a ected if the store

continues to sell Bagels

‣ Bagels in antecedent and Chips in consequent

‣ can be used to see what products should be sold with Bagels to

promote sale of chips
ff
Application of Association Rule Discovery

‣ Supermarket shelf management

‣ Goal: to identify items that are bought together by su ciently many

customers

‣ Approach: process the point-of-sale data collected with barcode

scanners to nd dependencies of items

‣ Classical rules:

‣ If a customer buys diaper and milk, then he is very likely to buy beer.

‣ So, don’t be surprised if you nd six-packs stacked next to diapers!

fi
fi
ffi
Basic Association Rule Mining Algorithm

‣ Given a set of purchase transactions, nd rules that

will predict the occurrence of an item based on the
occurrences of other items in the transaction. Customer Purchase

‣ Itemset: a collection of items, e.g., {milk, bread, 1 Bread, Milk

diaper} 2 Bread, Diaper, Beer, Eggs

‣ Support count (σ): frequency of occurrence of an 3 Milk, Diaper, Beer, Coke

items, e.g., σ({milk, bread}) = 3 4 Bread, Milk, Diaper, Beer

‣ Support: Fraction of transactions that contain an 5 Bread, Milk, Diaper, Coke

items, e.g., s({milk, bread})=3/5

fi
Basic Association Rule Mining Algorithm

‣ Association Rule Customer Purchase

‣ An implication expression of the form X → Y, 1 Bread, Milk

where X and Y are item sets

2 Bread, Diaper, Beer, Eggs

‣
Milk, Diaper, Beer, Coke
Example: {Milk, Diaper}→{Beer}
4 Bread, Milk, Diaper, Beer

‣ Rule Evaluation Metrics 5 Bread, Milk, Diaper, Coke

‣ Support (s): fraction of transactions

containing both X and Y s({Milk, Diaper, Beer}) = 2/5

‣ Con dence (c): measures how often items in

c({Milk, Diaper}→{Beer}) = 2/3
Y appear in the transactions that contain X
fi
Basic Association Rule Mining Algorithm

‣ Given a set of transactions T, the goal of association rule mining is to

nd all rules having

‣ Support is su ciently large: s > sthreshold

‣ Con dence is su ciently large: c > cthreshold

‣ Brute-force approach:

‣ List all possible rules

‣ Calculating the support and con dence

‣ Prune rules based on sthreshold and cthreshold

‣ Computationally expensive!!
fi
fi
ffi
ffi
fi
Basic Association Rule Mining Algorithm

Customer Purchase
{Milk,Diaper} →{Beer} (s=0.4, c=0.67)
1 Bread, Milk {Milk,Beer} → {Diaper} (s=0.4, c=1.0)
2 Bread, Diaper, Beer, Eggs {Diaper,Beer} → {Milk} (s=0.4, c=0.67)
3 Milk, Diaper, Beer, Coke {Beer} → {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} → {Milk,Beer} (s=0.4, c=0.5)
4 Bread, Milk, Diaper, Beer
{Milk} → {Diaper,Beer} (s=0.4, c=0.5)
5 Bread, Milk, Diaper, Coke

Observations:
• All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but can have
di erent con dence
• Thus, we may decouple the support and con dence requirements
ff
fi
fi
Dimension Reduction
Given data points in d dimensions
Convert them to data points in r<d dimensions
With minimal loss of informa on

‣ Clustering

‣ One way to summarize a complex real-valued data point with a

single categorical variable.

‣ Dimension Reduction

‣ Another way to simplify complex high-dimensional data

‣ Summarize data with a lower dimensional real valued vector

‣ Given data points in d dimensions

‣ Convert them to data points in r < d dimensions with

minimal loss of information
ti
Data Compression

Reduce 2D data to 1D
Data Compression

One idea: only use one feature

Data Compression

Reduce 3D data to 2D: use two features out of three

Effective Projection

Criteria of good data compression

‣ Projection: x d
→ proj(x): ℝ → ℝ k

‣ Reconstruction: proj(x) → recon(proj(x)) : ℝ k

→ ℝ d

‣ Projection error: ∥x − recon(proj(x)∥ 2

‣ We hope to nd the projection approach that minimizes the projection error!

fi
A 2D example

‣ Projection to one of the original

variable may not be good

‣ Projections to PC1 could be a

better choice.
Covariance

‣ Variance and covariance:

‣ Measure of the “spread” of a set of points around their center of mass (mean)

‣ Variance:

‣ Measure of the deviation from the mean for points in one dimension

‣ Covariance:

‣ Measure of how much each of the dimensions vary from the mean with
respect to each other
Covariance Matrix

X = [X1, …, XK] Covariance between two di erent dimensions

Variance of the dimension

ff
Covariance for Computing Principle Components

‣ Given the data, how to nd the

principle components/directions?

‣ PC1 and PC2 are orthogonal.

‣ It can be seen that the projection of

data onto PC1 has very large variance.

https://cs.wmich.edu/alfuqaha/summer14/cs6530/lectures/DecisionTrees.pdf
fi
Eigen-decomposition of the Covariance Matrix

‣ Let A = COV(X) denotes the covariance matrix

‣ Eigenvalue λ and corresponding eigenvector v: Av = λv
‣ For d × d matrix, we have a set of eigenvalues λ1 , …, λd and corresponding
eigenvectors v1, …, vd such that

Avi = λivi

‣ Larger λi implies higher variance of data along the direction of vi

Application of Principle Components

‣ Principle components for face image

Application of Principle Components

‣ Image representation and reconstruction

Application of Principle Components

‣ Calculating principle components using di erent sample size, then do image

reconstruction

ff
Application of Principle Components

‣ Compressing the image by only using its principle components

Application of Principle Components
Application of Principle Components
Other Dimension Reduction Methods

‣ PCA (principle component analysis):

‣ Find linear projections that maximize the variance

‣ ICA (independent component analysis):

‣ Similar to PCA but has di erent statistical assumptions

‣ Auto-encoder-decoder:

‣ Training two neural networks to compress and reconstruct by minimizing

the reconstruction error.
ff
Summary

‣ Description Data Mining

‣ Clustering: K-means

‣ Associate Rule Discovery

‣ Principle component analysis: Eigen-decomposition of covariance matrix.

AI-AG-Day-2-28th Feb 2023
No ratings yet
AI-AG-Day-2-28th Feb 2023
44 pages
Clustering Algorithm and Analyasis
No ratings yet
Clustering Algorithm and Analyasis
12 pages
Clustering Techniques in Data Mining
No ratings yet
Clustering Techniques in Data Mining
56 pages
Cluster Analysis
No ratings yet
Cluster Analysis
29 pages
K-means Clustering Explained
No ratings yet
K-means Clustering Explained
41 pages
Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
Unsupervised Machine Learning Techniques
No ratings yet
Unsupervised Machine Learning Techniques
58 pages
Lect 4
No ratings yet
Lect 4
34 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
55 pages
Datamining Lect8
No ratings yet
Datamining Lect8
79 pages
Clustering for Data Analysts
No ratings yet
Clustering for Data Analysts
69 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
ADL LAB Manual
No ratings yet
ADL LAB Manual
27 pages
K Mean Clustering
No ratings yet
K Mean Clustering
32 pages
Understanding Cluster Analysis Basics
No ratings yet
Understanding Cluster Analysis Basics
51 pages
Lecture24 s12
No ratings yet
Lecture24 s12
24 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
20 pages
Minor Project
No ratings yet
Minor Project
10 pages
Understanding Cluster Analysis Basics
No ratings yet
Understanding Cluster Analysis Basics
60 pages
K - Means Clustering and Related Algorithms: Ryan P. Adams COS 324 - Elements of Machine Learning Princeton University
No ratings yet
K - Means Clustering and Related Algorithms: Ryan P. Adams COS 324 - Elements of Machine Learning Princeton University
18 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
Clustering
No ratings yet
Clustering
75 pages
DM Chapter 5 (Clustering)
No ratings yet
DM Chapter 5 (Clustering)
40 pages
Clustering
No ratings yet
Clustering
35 pages
کتاب چهارم بارگزاری شده
No ratings yet
کتاب چهارم بارگزاری شده
63 pages
Clustering Part-1
No ratings yet
Clustering Part-1
48 pages
Unit-7 Finalized
No ratings yet
Unit-7 Finalized
20 pages
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
No ratings yet
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
56 pages
K Means
No ratings yet
K Means
3 pages
Unit 4
No ratings yet
Unit 4
125 pages
Clustering L7
No ratings yet
Clustering L7
7 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
19 pages
Data Mining: Techniques and Methods
No ratings yet
Data Mining: Techniques and Methods
20 pages
Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
K Mean Clustering
No ratings yet
K Mean Clustering
45 pages
Agglomerative Clustering Steps Explained
No ratings yet
Agglomerative Clustering Steps Explained
80 pages
Customer Segmentation Techniques Explained
No ratings yet
Customer Segmentation Techniques Explained
46 pages
SEEM2460 Unsupervised Learning Clustering
No ratings yet
SEEM2460 Unsupervised Learning Clustering
76 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Unit 3
No ratings yet
Unit 3
58 pages
07 Clustering
No ratings yet
07 Clustering
54 pages
K Mean
No ratings yet
K Mean
7 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
Clustering
No ratings yet
Clustering
84 pages
Intro to Cluster Analysis
No ratings yet
Intro to Cluster Analysis
90 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Clustering Techniques for CS Students
100% (1)
Clustering Techniques for CS Students
26 pages
Understanding K-Means Clustering Techniques
No ratings yet
Understanding K-Means Clustering Techniques
28 pages
K Mean Clustering1
No ratings yet
K Mean Clustering1
23 pages
Clustering
No ratings yet
Clustering
29 pages
Unsupervised Learning - Clustering
No ratings yet
Unsupervised Learning - Clustering
55 pages
Clustering
No ratings yet
Clustering
55 pages
Medical Imabmnge Analysis
No ratings yet
Medical Imabmnge Analysis
41 pages
CH-6 DM Clustering
No ratings yet
CH-6 DM Clustering
28 pages
Fraccat
No ratings yet
Fraccat
4 pages
Bank Statement 2025 (Final)
No ratings yet
Bank Statement 2025 (Final)
1 page
AWL-FP Series Ferrite Inductor Specs
No ratings yet
AWL-FP Series Ferrite Inductor Specs
2 pages
m3 Plasma-Shield Gas Boxes 0558005487
100% (1)
m3 Plasma-Shield Gas Boxes 0558005487
88 pages
Transaction Date Narration Cheque / Ref No. Value Date Withdra Wal (DR) Deposit (CR) Balance
No ratings yet
Transaction Date Narration Cheque / Ref No. Value Date Withdra Wal (DR) Deposit (CR) Balance
1 page
Reference Method Statement Sample PDF
No ratings yet
Reference Method Statement Sample PDF
9 pages
PID Control for Engineers
No ratings yet
PID Control for Engineers
14 pages
General Characteristics: Power Definition
No ratings yet
General Characteristics: Power Definition
4 pages
Mobile Radio Propagation Models Overview
No ratings yet
Mobile Radio Propagation Models Overview
25 pages
Power Factor How Effects Bill
No ratings yet
Power Factor How Effects Bill
37 pages
AMC Building Regulations 2012
No ratings yet
AMC Building Regulations 2012
80 pages
Signals & Systems for Engineers
No ratings yet
Signals & Systems for Engineers
5 pages
ASAE s278.6
No ratings yet
ASAE s278.6
4 pages
Saudi Aramco Typical Inspection Plan: Soil Improvement (Vibro Replacement & Vibro Compaction) 31-Nov-2018 Civil
0% (1)
Saudi Aramco Typical Inspection Plan: Soil Improvement (Vibro Replacement & Vibro Compaction) 31-Nov-2018 Civil
10 pages
FIU Graphic Design Internship Guide
No ratings yet
FIU Graphic Design Internship Guide
2 pages
Ossec in The Enterprise Final LR
No ratings yet
Ossec in The Enterprise Final LR
129 pages
HEMS System Dossier and Certification-16 Oct 2020
No ratings yet
HEMS System Dossier and Certification-16 Oct 2020
30 pages
Chapter 20 Quality Control
No ratings yet
Chapter 20 Quality Control
174 pages
GE Surgery OEC 9900 Elite Datasheet
100% (1)
GE Surgery OEC 9900 Elite Datasheet
8 pages
Cisco Routers 1941 Series Datasheet
No ratings yet
Cisco Routers 1941 Series Datasheet
10 pages
Mitsubishi Motors Workshop Repair Manual C1430
No ratings yet
Mitsubishi Motors Workshop Repair Manual C1430
16 pages
DEA351 A-Series Lighting Control Panel Catalog
No ratings yet
DEA351 A-Series Lighting Control Panel Catalog
8 pages
Concrete Masonry Design Guide
No ratings yet
Concrete Masonry Design Guide
30 pages
Alumna
No ratings yet
Alumna
2 pages
LVS Errors & Warnings
No ratings yet
LVS Errors & Warnings
6 pages
Public Key Cryptography Guide
No ratings yet
Public Key Cryptography Guide
34 pages
GTPL Gujarati Power Plus Pack
0% (2)
GTPL Gujarati Power Plus Pack
1 page
Agenda - 09 - 27 - Roles and Procedures For Officers of FBLA 2022-2023
No ratings yet
Agenda - 09 - 27 - Roles and Procedures For Officers of FBLA 2022-2023
4 pages
Application Engineer at Multivirt India
No ratings yet
Application Engineer at Multivirt India
2 pages
Kendall7e ch08
No ratings yet
Kendall7e ch08
56 pages