0% found this document useful (0 votes)
11 views

Lecture 5

The document discusses description mining techniques for unsupervised learning, including clustering, association rule discovery, and principal component analysis. It covers clustering algorithms like K-means clustering, which groups data into K clusters by assigning points to the nearest cluster center and updating the cluster centers iteratively until convergence.

Uploaded by

L My
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Lecture 5

The document discusses description mining techniques for unsupervised learning, including clustering, association rule discovery, and principal component analysis. It covers clustering algorithms like K-means clustering, which groups data into K clusters by assigning points to the nearest cluster center and updating the cluster centers iteratively until convergence.

Uploaded by

L My
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

CCST9047: The Age

of Big Data
Lecture 5

Feb. 28, 2024


Notice

‣ We have tutorial this week, homework 1 will be released after the


tutorial.

‣ We have the make-up course in reading week.

‣ The rst in-class quiz will be the week after reading week.

‣ We will allow one-page cheating sheet.

‣ The quiz will cover the material by today.

‣ We will open the portal for proposal submission, the proposal


presentation will be in the same week as the quiz.
fi
Recap of Previous Lecture

‣ Data Mining

‣ Non-trivial discovering interesting patterns and insights from the data

‣ Predictive Mining

‣ Classi cation: Decision Tree, Linear classi cation

‣ Regression: Linear regression, generalized linear regression

‣ Brief introduction of neural networks


fi
fi
Data Mining Tasks

‣ Predictive Mining (Previous lecture)

‣ Classi cation

‣ Regression

‣ Description Mining (This lecture)

‣ Clustering

‣ Association Rule Discovery

‣ Principle Component Analysis


fi
Description Mining

‣ Find human-interpretable patterns that describe the data:

‣ Which data points are similar to each other?

‣ Relation between data points or features?

‣ What’s the trend of the data?

‣ Causal relationships

‣ You do not have speci c ground-truth target of the data.

‣ Unsupervised learning
fi
Clustering

‣ What’s Clustering

‣ Organizing data into clusters such that

‣ Intra-cluster data are similar

‣ Inter-cluster data are less similar

‣ Finding natural groupings among objects


Similarity

‣ How to measure the similarity between data?

Data 1 Data 2

Distance function

D(data1, data2)
Similarity

‣ For Data in the continuous space

‣ Data is formulated as a continuous


vector

‣ Distance function can be

‣ ℓ2 norm: ∥x1 − x2∥2

‣ 1
ℓ norm: ∥x1 − x ∥
2 1
Similarity

‣ For a set of directions

‣ Data is formulated as a continuous


vector

‣ Distance function can be cosine


⟨x1, x2⟩
function: D(x1, x2) =
∥x1∥2∥x2∥2
Similarity

‣ For a set of images

‣ We rst need a feature extractor to


get semantic feature from the data.

‣ We then use ℓ1 or ℓ2 norm to


characterize the distance between
these semantic features.
fi
Clustering With Categorical Features

‣ If the data has categorical features, one can cluster datapoints directly.
Clustering With Categorical Features

Martial Taxable
Refund Cheat
status Income
Yes Single 125K No
Yes Married 120K No
Yes Divorced 220K No

Grouping by refund
Martial Taxable
Refund Cheat
status Income
No Married 100K No
No Single 70K No
No Divorced 95K Yes
No Married 60K No
No Single 85K Yes
No Married 75K No
No Single 90K Yes
Clustering by Key Words

‣ Each data point is a paper

‣ Each paper has a number of


key words

‣ Clustering can be performed


based on the key words
Clustering in Euclidean Space

‣ In more challenging case, the data in euclidean cannot be directly grouped.

‣ We do not know the label of data


points.

‣ We do not know how many


clusters should be considered.

‣ How to build the clusters?


Clustering in Euclidean Space

‣ In more challenging case, the data in euclidean cannot be directly grouped.

?
K-Means Clustering

‣ We rst need to decide the number of clusters, i.e., K.


K-means Clustering: Initialization
‣ Step 1: initialize the centers of K = 3 clusters
Decide K, and initialize K centers (randomly)
5

4
k1
3

2
k2

k3
0
0 1 2 3 4 5
fi
K-Means Clustering

‣ Step 2: assignK-means
all data to Clustering:
the nearest center Iteration 1 the centers to the
and move
Assign all objects to the nearest center.
means of the new clusters.
Move a center to the mean of its members.
5

4
k1
3

2
k2

k3
0
0 1 2 3 4 5
K-Means Clustering

K-means Clustering: Iteration 2


‣ Step 3:After
re-assign the data
moving centers, and move
re-assign centers
the objects again
to nearest
Move a center to the mean of its new members.
centers.
5

4
k1

k3
1 k2

0
0 1 2 3 4 5
K-Means Clustering

‣ K-means
Step 4 or more: Clustering:
re-assign the data andFinished!
move centers again, until the
Re-assign and move centers, until …
clustering result does not change.
no objects changed membership.
expression in condition 2 5

4
k1

k2
1
k3

0
0 1 2 3 4 5
General Algorithm of K-means

1. Decide on a value for K, the number of clusters


2. Initialize the K cluster centers (randomly or using prior
information)
3. Decide the class memberships of data points by assigning
them to the nearest cluster center.
4. Re-estimate the K cluster centers, by assuming the
memberships found above are correct.
5. Repeat the steps 4-5 until none of the data changed
membership in the last iteration.
General Algorithm of K-means

• Decide the class memberships of data points by assigning them


to the nearest cluster center.
• Nearest cluster center can be found using di erent similarity/
distance measure.

• Re-estimate the K cluster centers, by assuming the


memberships found above are correct.
• Estimating the centers can be performed by calculating the
mean or median, or weighted average.
ff
Why K-means Works

‣ What’s a good partition?

‣ High intra-cluster similarity

‣ The average distance to the center of the cluster should be as small as


possible.
Distance to the cluster
K
2
∑∑
min ∥xi − μk∥2
k=1 i∈Ck

Cluster k
Why K-means Works

‣ Find the best μk

‣ Take derivatives over μk

K 2
∂ ∑k=1 ∑i∈C ∥xi − μk∥2

k
= (μk − xi)
∂μk i∈Ck

‣ Derivatives becomes zero implies μk is the average of data in Ck

‣ This is consistent with the goal of K-means

What if the distance is ℓ1 norm? ⟹ one may use median to get the center
rather than average.
Choice of K

jective function isWhen When k = 3, the objective function is 133.6


873.0k = 2, the objective function is 173.1

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

K=1 K=2 K=3


k = 2 can already provide reasonable clustering
Choice plotK
We can of the objective function values for k equals 1 to 6…

The abrupt change at k = 2, is highly


K suggestive of two clusters
2 of
∑ or∑“elbow finding”.
in the data. This technique
We can calculate the objectivefor ∥xi − μk∥2
determining the number
clusters is known as “knee finding”
k=1 i∈Ck
1.00E+03

9.00E+02
Objective Function

8.00E+02

7.00E+02

6.00E+02

5.00E+02

4.00E+02

3.00E+02

2.00E+02

1.00E+02

0.00E+00
1 2 3
k 4 5 6

k = 2 can already give a low objective value.


Summary of K-means

‣ Strengths

‣ Simple, easy to implement and debug

‣ Intuitive objective function: optimizes intra-cluster similarity

‣ E cient, especially when K ≪ n


ffi
Summary of K-means

‣ Weakness
‣ Only applicable when the center can be
calculated, thus cannot be applied to
categorical data.

‣ Relatively sensitive to initialization, cannot


guarantee to converge to global min

‣ Need to specify K

‣ Unable to handle noisy data and outliers

‣ Not suitable to discover clusters with non-


convex shape
Association Analysis

Customer Purchase
1 Beer, Chips, Diapers
2 Apple, Beer, Diapers
3 Beer, Egg
4 Chips
5 Beer, Egg, Diapers
Customers who like to buy diapers
6 Apple, Beer also like to buy beer.
7 Beer, Chips, Diapers, Egg
Association Rule Discovery

‣ Given a set of records each of which contain some number of items from a
given collection

‣ Produce dependency rules which will predict occurrence of an item


based on occurrences of other items

Customer Purchase

1 Beer, Coke, Milk


Rules discovered
2 Beer, Bread

3 Beer, Coke, Diaper, Milk {Milk} → {Coke}


4 Beer, Bread, Diaper, Milk {Diaper, Milk} → {Beer}
5 Coke, Diaper, Milk
Application of Association Rule Discovery

‣ Marketing and Sales promotion:

‣ Considering the rule: {Bagels, …}→{Chips}

‣ Chips as consequent

‣ Can be used to see what should be done to boost its sales

‣ Bagels in the antecedent

‣ Can be used to see which products will be a ected if the store


continues to sell Bagels

‣ Bagels in antecedent and Chips in consequent

‣ can be used to see what products should be sold with Bagels to


promote sale of chips
ff
Application of Association Rule Discovery

‣ Supermarket shelf management

‣ Goal: to identify items that are bought together by su ciently many


customers

‣ Approach: process the point-of-sale data collected with barcode


scanners to nd dependencies of items

‣ Classical rules:

‣ If a customer buys diaper and milk, then he is very likely to buy beer.

‣ So, don’t be surprised if you nd six-packs stacked next to diapers!


fi
fi
ffi
Basic Association Rule Mining Algorithm

‣ Given a set of purchase transactions, nd rules that


will predict the occurrence of an item based on the
occurrences of other items in the transaction. Customer Purchase

‣ Itemset: a collection of items, e.g., {milk, bread, 1 Bread, Milk

diaper} 2 Bread, Diaper, Beer, Eggs

‣ Support count (σ): frequency of occurrence of an 3 Milk, Diaper, Beer, Coke

items, e.g., σ({milk, bread}) = 3 4 Bread, Milk, Diaper, Beer

‣ Support: Fraction of transactions that contain an 5 Bread, Milk, Diaper, Coke

items, e.g., s({milk, bread})=3/5


fi
Basic Association Rule Mining Algorithm

‣ Association Rule Customer Purchase

‣ An implication expression of the form X → Y, 1 Bread, Milk

where X and Y are item sets


2 Bread, Diaper, Beer, Eggs


Milk, Diaper, Beer, Coke
Example: {Milk, Diaper}→{Beer}
4 Bread, Milk, Diaper, Beer

‣ Rule Evaluation Metrics 5 Bread, Milk, Diaper, Coke

‣ Support (s): fraction of transactions


containing both X and Y s({Milk, Diaper, Beer}) = 2/5

‣ Con dence (c): measures how often items in


c({Milk, Diaper}→{Beer}) = 2/3
Y appear in the transactions that contain X
fi
Basic Association Rule Mining Algorithm

‣ Given a set of transactions T, the goal of association rule mining is to


nd all rules having

‣ Support is su ciently large: s > sthreshold

‣ Con dence is su ciently large: c > cthreshold


‣ Brute-force approach:

‣ List all possible rules

‣ Calculating the support and con dence

‣ Prune rules based on sthreshold and cthreshold

‣ Computationally expensive!!
fi
fi
ffi
ffi
fi
Basic Association Rule Mining Algorithm

Customer Purchase
{Milk,Diaper} →{Beer} (s=0.4, c=0.67)
1 Bread, Milk {Milk,Beer} → {Diaper} (s=0.4, c=1.0)
2 Bread, Diaper, Beer, Eggs {Diaper,Beer} → {Milk} (s=0.4, c=0.67)
3 Milk, Diaper, Beer, Coke {Beer} → {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} → {Milk,Beer} (s=0.4, c=0.5)
4 Bread, Milk, Diaper, Beer
{Milk} → {Diaper,Beer} (s=0.4, c=0.5)
5 Bread, Milk, Diaper, Coke

Observations:
• All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but can have
di erent con dence
• Thus, we may decouple the support and con dence requirements
ff
fi
fi
Dimension Reduction
Given data points in d dimensions
Convert them to data points in r<d dimensions
With minimal loss of informa on

‣ Clustering

‣ One way to summarize a complex real-valued data point with a


single categorical variable.

‣ Dimension Reduction

‣ Another way to simplify complex high-dimensional data

‣ Summarize data with a lower dimensional real valued vector

‣ Given data points in d dimensions

‣ Convert them to data points in r < d dimensions with


minimal loss of information
ti
Data Compression

Reduce 2D data to 1D
Data Compression

One idea: only use one feature


Data Compression

Reduce 3D data to 2D: use two features out of three


Effective Projection

Criteria of good data compression

‣ Projection: x d
→ proj(x): ℝ → ℝ k

‣ Reconstruction: proj(x) → recon(proj(x)) : ℝ k


→ ℝ d

‣ Projection error: ∥x − recon(proj(x)∥ 2

‣ We hope to nd the projection approach that minimizes the projection error!


fi
A 2D example

‣ Projection to one of the original


variable may not be good

‣ Projections to PC1 could be a


better choice.
Covariance

‣ Variance and covariance:

‣ Measure of the “spread” of a set of points around their center of mass (mean)

‣ Variance:

‣ Measure of the deviation from the mean for points in one dimension

‣ Covariance:

‣ Measure of how much each of the dimensions vary from the mean with
respect to each other
Covariance Matrix

X = [X1, …, XK] Covariance between two di erent dimensions

Variance of the dimension


ff
Covariance for Computing Principle Components

‣ Given the data, how to nd the


principle components/directions?

‣ PC1 and PC2 are orthogonal.

‣ It can be seen that the projection of


data onto PC1 has very large variance.

https://cs.wmich.edu/alfuqaha/summer14/cs6530/lectures/DecisionTrees.pdf
fi
Eigen-decomposition of the Covariance Matrix

‣ Let A = COV(X) denotes the covariance matrix


‣ Eigenvalue λ and corresponding eigenvector v: Av = λv
‣ For d × d matrix, we have a set of eigenvalues λ1 , …, λd and corresponding
eigenvectors v1, …, vd such that

Avi = λivi

‣ Larger λi implies higher variance of data along the direction of vi


Application of Principle Components

‣ Principle components for face image


Application of Principle Components

‣ Image representation and reconstruction


Application of Principle Components

‣ Calculating principle components using di erent sample size, then do image


reconstruction

ff
Application of Principle Components

‣ Compressing the image by only using its principle components


Application of Principle Components
Application of Principle Components
Other Dimension Reduction Methods

‣ PCA (principle component analysis):

‣ Find linear projections that maximize the variance

‣ ICA (independent component analysis):

‣ Similar to PCA but has di erent statistical assumptions

‣ Auto-encoder-decoder:

‣ Training two neural networks to compress and reconstruct by minimizing


the reconstruction error.
ff
Summary

‣ Description Data Mining

‣ Clustering: K-means

‣ Associate Rule Discovery

‣ Principle component analysis: Eigen-decomposition of covariance matrix.

You might also like