Lecture 5
Lecture 5
of Big Data
Lecture 5
‣ The rst in-class quiz will be the week after reading week.
‣ Data Mining
‣ Predictive Mining
‣ Classi cation
‣ Regression
‣ Clustering
‣ Causal relationships
‣ Unsupervised learning
fi
Clustering
‣ What’s Clustering
Data 1 Data 2
Distance function
D(data1, data2)
Similarity
‣ 1
ℓ norm: ∥x1 − x ∥
2 1
Similarity
‣ If the data has categorical features, one can cluster datapoints directly.
Clustering With Categorical Features
Martial Taxable
Refund Cheat
status Income
Yes Single 125K No
Yes Married 120K No
Yes Divorced 220K No
Grouping by refund
Martial Taxable
Refund Cheat
status Income
No Married 100K No
No Single 70K No
No Divorced 95K Yes
No Married 60K No
No Single 85K Yes
No Married 75K No
No Single 90K Yes
Clustering by Key Words
?
K-Means Clustering
4
k1
3
2
k2
k3
0
0 1 2 3 4 5
fi
K-Means Clustering
‣ Step 2: assignK-means
all data to Clustering:
the nearest center Iteration 1 the centers to the
and move
Assign all objects to the nearest center.
means of the new clusters.
Move a center to the mean of its members.
5
4
k1
3
2
k2
k3
0
0 1 2 3 4 5
K-Means Clustering
4
k1
k3
1 k2
0
0 1 2 3 4 5
K-Means Clustering
‣ K-means
Step 4 or more: Clustering:
re-assign the data andFinished!
move centers again, until the
Re-assign and move centers, until …
clustering result does not change.
no objects changed membership.
expression in condition 2 5
4
k1
k2
1
k3
0
0 1 2 3 4 5
General Algorithm of K-means
Cluster k
Why K-means Works
K 2
∂ ∑k=1 ∑i∈C ∥xi − μk∥2
∑
k
= (μk − xi)
∂μk i∈Ck
What if the distance is ℓ1 norm? ⟹ one may use median to get the center
rather than average.
Choice of K
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
9.00E+02
Objective Function
8.00E+02
7.00E+02
6.00E+02
5.00E+02
4.00E+02
3.00E+02
2.00E+02
1.00E+02
0.00E+00
1 2 3
k 4 5 6
‣ Strengths
‣ Weakness
‣ Only applicable when the center can be
calculated, thus cannot be applied to
categorical data.
‣ Need to specify K
Customer Purchase
1 Beer, Chips, Diapers
2 Apple, Beer, Diapers
3 Beer, Egg
4 Chips
5 Beer, Egg, Diapers
Customers who like to buy diapers
6 Apple, Beer also like to buy beer.
7 Beer, Chips, Diapers, Egg
Association Rule Discovery
‣ Given a set of records each of which contain some number of items from a
given collection
Customer Purchase
‣ Chips as consequent
‣ Classical rules:
‣ If a customer buys diaper and milk, then he is very likely to buy beer.
‣
Milk, Diaper, Beer, Coke
Example: {Milk, Diaper}→{Beer}
4 Bread, Milk, Diaper, Beer
‣ Computationally expensive!!
fi
fi
ffi
ffi
fi
Basic Association Rule Mining Algorithm
Customer Purchase
{Milk,Diaper} →{Beer} (s=0.4, c=0.67)
1 Bread, Milk {Milk,Beer} → {Diaper} (s=0.4, c=1.0)
2 Bread, Diaper, Beer, Eggs {Diaper,Beer} → {Milk} (s=0.4, c=0.67)
3 Milk, Diaper, Beer, Coke {Beer} → {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} → {Milk,Beer} (s=0.4, c=0.5)
4 Bread, Milk, Diaper, Beer
{Milk} → {Diaper,Beer} (s=0.4, c=0.5)
5 Bread, Milk, Diaper, Coke
Observations:
• All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but can have
di erent con dence
• Thus, we may decouple the support and con dence requirements
ff
fi
fi
Dimension Reduction
Given data points in d dimensions
Convert them to data points in r<d dimensions
With minimal loss of informa on
‣ Clustering
‣ Dimension Reduction
Reduce 2D data to 1D
Data Compression
‣ Projection: x d
→ proj(x): ℝ → ℝ k
‣ Measure of the “spread” of a set of points around their center of mass (mean)
‣ Variance:
‣ Measure of the deviation from the mean for points in one dimension
‣ Covariance:
‣ Measure of how much each of the dimensions vary from the mean with
respect to each other
Covariance Matrix
https://cs.wmich.edu/alfuqaha/summer14/cs6530/lectures/DecisionTrees.pdf
fi
Eigen-decomposition of the Covariance Matrix
Avi = λivi
ff
Application of Principle Components
‣ Auto-encoder-decoder:
‣ Clustering: K-means