Data Mining Unupervised Techniques
Data Mining Unupervised Techniques
and Application in
Business
(Unsupervised)
Dr Sunil D Lakdawala
Content
Market Basket Analysis
Overview and Applications
Example
Virtual Items
Problems with Big Data
How to handle those problems
Exercise – Groceries
Content (cont)
Clustering
Overview
K-Means Method
Hierarchical Clustering
Applications
Exercise - IRIS
Dimension Reduction
Overview
Principal Components Analysis (PCA)
ASSOCIATION
(Market Basket Analysis)
Overview and Applications
Unsupervised
Tells what items are bought together (Same time or
within certain time period)
Blue Jeans, white shirt, Black Tie bought together
Bear and Diaper bought together on weekend by
Young couples
VCR is bought within 6 months of buying TV
Share Trading account opened within 3 months of
opening demat account
Certain symptoms go with certain disease
(Running Nose, Headache, High Fever with Flu)
Person who buy “DW” books also buy “BI” books
1 X X
2 X X X
3 X
4 X
5 X X X
6 X X
7 X
8 X X
9 X
10 X
Example
Jeans Shirt
(antecedent) (Consequent)
Support = 3/10 = 30% =P(Jeans, Shirt) / P(Jeans)
(# of times Jeans and Shirt bought Together)
Confidence = ¾ = 75% =P(Jeans, Shirt)
(Conditional Probability of buying Shirt)
Lift = Confidence / P(Shirt) = ¾ / 6/10 = 30/24 = 1.25
= P(Jeans, Shirt) / (P(Jeans)*P(Shirt))
LIMITATION
1. Value not considered
2. Frequency not considered
Example
Jeans, Shirt Tie
(antecedent) (Consequent)
Support = 1/10 = 10%
(# of times antecedent and consequent bought
Together)
Confidence = 1/3 = 33%
(Conditional Probability of buying Consequent)
Lift = Confidence / P(Consequent) = 2/3
Virtual Items
Store Location, Time of purchase, Mode of
Payment, Customer Profile (Signed Transaction)
Virtual items to be added to other item, e.g. Blue
Jeans and White Shirt bought at Andheri shop
on Sunday
Able to analyze time-wise, location-wise what
items go together (Which items are bought
together during Diwali, Which items are bought
together in rich locality vs. middle class locality)
500 gm White
Bread
Applications
Hotel Chain
Marketing Segmentation
Alumni
Exercise: IRIS
cluster distance
Iterative Method .. When to stop
Address Outliers
BI&A Overview
Effect of outlier on Clustering (K=2)
P7
P5 P6
P4
P3
P1
P2
BI&A Overview
Method : K-means Clustering (cont)
Guidelines for K
K=1? K=N? K : large integer? K: small integer?
Plot “r” vs “K” and look for elbow shape
Hierarchical Clustering
Business Interpretation
Distance
K=2 2
K=3 1
K=N=5
{A} {B} {C} {D} {E}
Cluster
correlated or superfluous
Accuracy of classification / prediction reduces
5
λ1 λ2
2
4.0 4.5 5.0 5.5 6.0
PCA
Z1 = C11*X1 + C12*X2 +C13*X3 + C14*X4
….
Z4 = C41*X1 + C42*X2 +C43*X3 + C44*X4
Z1 C11 C12 C13 C14 X1
Z2 C21 C22 C23 C24 X2
Z3 = C31 C32 C33 C34 X3
Z4 C41 C42 C43 C44 X4