0% found this document useful (0 votes)
11 views27 pages

Data Mining Unupervised Techniques

This document discusses various unsupervised data mining techniques including market basket analysis, clustering, and dimension reduction. It provides an overview and examples of each technique. For market basket analysis, it discusses how to identify items frequently bought together and how to handle large datasets. For clustering, it explains k-means clustering and hierarchical clustering. It also discusses principal component analysis (PCA) as a method for dimension reduction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views27 pages

Data Mining Unupervised Techniques

This document discusses various unsupervised data mining techniques including market basket analysis, clustering, and dimension reduction. It provides an overview and examples of each technique. For market basket analysis, it discusses how to identify items frequently bought together and how to handle large datasets. For clustering, it explains k-means clustering and hierarchical clustering. It also discusses principal component analysis (PCA) as a method for dimension reduction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Data Mining Techniques

and Application in
Business
(Unsupervised)

Dr Sunil D Lakdawala
Content
 Market Basket Analysis
 Overview and Applications
 Example
 Virtual Items
 Problems with Big Data
 How to handle those problems
 Exercise – Groceries
Content (cont)
Clustering
 Overview
 K-Means Method
 Hierarchical Clustering
 Applications
 Exercise - IRIS
Dimension Reduction
 Overview
 Principal Components Analysis (PCA)
ASSOCIATION
(Market Basket Analysis)
Overview and Applications
 Unsupervised
 Tells what items are bought together (Same time or
within certain time period)
 Blue Jeans, white shirt, Black Tie bought together
 Bear and Diaper bought together on weekend by
Young couples
 VCR is bought within 6 months of buying TV
 Share Trading account opened within 3 months of
opening demat account
 Certain symptoms go with certain disease
(Running Nose, Headache, High Fever with Flu)
 Person who buy “DW” books also buy “BI” books

Data Mining - Market Basket Analysis 5


Overview and Applications (Cont)
 Tells what items are bought together (Cont)
 Which products should be advertised in Spanish?
(Hamburgers and Potato Chips in English and
Sausages and Doritos in Spanish)
 Improving WEB Page design for e-commerce sites
 Homeopathic working
 Hotel Chain Services
 Preventive Maintenance

Data Mining - Market Basket Analysis 6


Example
Sr No Jean Shirt Tie Sock Jacket

1 X X

2 X X X

3 X

4 X

5 X X X

6 X X

7 X

8 X X

9 X

10 X
Example
Jeans  Shirt
(antecedent) (Consequent)
Support = 3/10 = 30% =P(Jeans, Shirt) / P(Jeans)
(# of times Jeans and Shirt bought Together)
Confidence = ¾ = 75% =P(Jeans, Shirt)
(Conditional Probability of buying Shirt)
Lift = Confidence / P(Shirt) = ¾ / 6/10 = 30/24 = 1.25
= P(Jeans, Shirt) / (P(Jeans)*P(Shirt))
LIMITATION
1. Value not considered
2. Frequency not considered
Example
Jeans, Shirt  Tie
(antecedent) (Consequent)
Support = 1/10 = 10%
(# of times antecedent and consequent bought
Together)
Confidence = 1/3 = 33%
(Conditional Probability of buying Consequent)
Lift = Confidence / P(Consequent) = 2/3
Virtual Items
 Store Location, Time of purchase, Mode of
Payment, Customer Profile (Signed Transaction)
 Virtual items to be added to other item, e.g. Blue
Jeans and White Shirt bought at Andheri shop
on Sunday
 Able to analyze time-wise, location-wise what
items go together (Which items are bought
together during Diwali, Which items are bought
together in rich locality vs. middle class locality)

Data Mining - Market Basket Analysis 10


Problems of Large Data
For menu having 100 items
# of Combinations with 1 item: 100
# of Combinations with 2 item: 4,950
# of Combinations with 3 item: 161,700

Typical super market has 10,000 items. # of


combination with 2 items: 50 million, 3 items:
100 billion!
Number of transactions: Million per year!!

Data Mining - Market Basket Analysis 11


Handling Problems of Big Data

 Pruning Techniques: Minimum Support,


Minimum Confidence, Minimum Lift (typically
1)
 Use of Taxonomies: e.g. instead of coke,
pepsi, etc.. Use soda, i.e. higher category

Data Mining - Market Basket Analysis 12


Taxonomy
Carbonated Drink
Bakery Product

Pepsi Coke ThumbsUp Bread Cake

200 Ml Plastic Britania Modern


Regular

500 gm White
Bread

Data Mining - Market Basket Analysis 13


CLUSTERING
Overview
Many times, whole population is diverse, but might consist
of number of similar Groups (Clusters)
“Clustering” is Undirected Knowledge Discovery or
Unsupervised Learning technique of Data Mining. It can
spot such similar groups.
Once cluster have been detected, other methods must be
applied in order to figure out what that cluster means.
Clustering is similar to classification, but classes / Groups /
Clusters are not predefined

Data Mining - Clustering 15


Overview (Cont)

Sometimes, common marketing strategy may not work out.


However, separate marketing strategy for each cluster,
based on Age, Gender, Income, Marital Status, Years of
loyalty (how long person has been a customer) might
work out.

Data Mining - Clustering 16


Applications and Exercise

Applications
 Hotel Chain

 Marketing Segmentation

 Offer by Pizza Hut

 Alumni

Exercise: IRIS

Data Mining - Clustering 17


Method : K-means Clustering
K Means Clustering
 Decide K

 Decide input attributes

 Define Distance Function

 Goodness of cluster r: Intra-cluster distance / inter-

cluster distance
 Iterative Method .. When to stop

 Address Outliers

Data Mining - Clustering 18


K Means Clustering
N = 50,000
K = ? =3 SIII
X1: Age
X2: Income
P1(18,25,000) SII
Income
P2(43, 54,000)
D(Age1,Age2,Income1,Income2)=?
SI P3
-------------------------------
SI : (27,28,000) P1 P2
SII : (39, 32,000)
SIII : (59, 65,000)
r = Intra-cluster / Inter-cluster = 0.2
Age

BI&A Overview
Effect of outlier on Clustering (K=2)
P7

P5 P6
P4
P3
P1
P2

BI&A Overview
Method : K-means Clustering (cont)

 Guidelines for K
 K=1? K=N? K : large integer? K: small integer?
 Plot “r” vs “K” and look for elbow shape
 Hierarchical Clustering
 Business Interpretation

Data Mining - Clustering 21


Hierarchical Clustering - Example
A B C D E
A 1 2 8 9
B 1 7 7
D(E1,E2) K=1
C 10 8
D 2
E

Distance

K=2 2

K=3 1
K=N=5
{A} {B} {C} {D} {E}
Cluster

Data Mining - Clustering 22


Dimension Reduction
Overview
Objective:
 Too many variables, many of them are

correlated or superfluous
 Accuracy of classification / prediction reduces

 Cost of data collection and processing high

 Need to reduce # of variables


PCA: Principle Component Analysis (Cont)
 This transformation is defined in such a way that the
first principal component has the largest possible
variance (that is, accounts for as much of the
variability in the data as possible)
 each succeeding component in turn has the highest
variance possible under the constraint that it is
orthogonal to the preceding components.
 The resulting vectors are an uncorrelated
orthogonal basis set.
 PCA is sensitive to the relative scaling of the original
variables.
 It is difficult to interpret PCA
PCA Eigenvalues

5
λ1 λ2

2
4.0 4.5 5.0 5.5 6.0
PCA
Z1 = C11*X1 + C12*X2 +C13*X3 + C14*X4
….
Z4 = C41*X1 + C42*X2 +C43*X3 + C44*X4
Z1 C11 C12 C13 C14 X1
Z2 C21 C22 C23 C24 X2
Z3 = C31 C32 C33 C34 X3
Z4 C41 C42 C43 C44 X4

 Should we normalize X1, X2, ..??


 Z1,Z2 are linearly independent, and Mean value
zero

You might also like