Factor Analysis - Segmentation New
Factor Analysis - Segmentation New
and
Segmentation
Disclaimer: This material is protected under copyright act AnalytixLabs ©, 2011-2016. Unauthorized use and/ or duplication of this material or any part of this material
including data, in any form without explicit and written permission from AnalytixLabs is strictly prohibited. Any violation of this copyright will attract legal actions
Introduction to Segmentation
Segmentation
Each individual is so different
that ideally we would want to reach out to each one of them in a different way
1 2 3 4 5 6
1 3 4
2
Solution : Identify segments where people have same characters and target each of
these segments in a different way
Total Population
(1000)
Avg. delinquency
Avg. delinquency age = 0 age = 75 days and
Avg. delinquency Avg. delinquency Avg. age = 50 yrs.
days and Avg. age = 35 yrs.
age = 15 days and age = 12 days and Avg. Utilization = 40%
Avg. Utilization > 80%
Avg. age = 33 yrs. Avg. age = 25 yrs.
Avg. Utilization = 60% Avg. Utilization = 90%
We can exclude the group with avg. delinquency age = 75 days from mailing
This type of segmentation is known as ‘Subjective Segmentation’. It gives the salient characteristics of
the best customers
Applications of Segmentation
Customer Segmentation
Customer Segmentation:
• Customer segmentation is the process of splitting your customer database into smaller groups. By
focusing on specific customer types, you will maximize customer lifetime value and better understand
who they are and what they need.
Age and gender – younger customers are often more impulsive and frequent buyers while female customers might
have a higher long-term value
Acquisition channel – e.g. customers from Social Media are often less valuable then customers navigating to your site
directly
First product purchased – pay close attention to the transaction value and product category to differentiate between
price-focused and quality-focused customers
Device types – e.g. customers using a mobile device typically spend less than customers on a desktop PC
Recency, Frequency and Monetary value of customer transactions is a complete segmentation strategy
etc…
Applications of customer segmentation
Customer segmentation can help other parts of your business. It will allow you to:
Grow you business quicker by focusing marketing campaigns on segments with higher propensity to buy
Improve customer lifetime value by identifying purchasing patterns and targeting customers when they are in the market
Identify new product opportunities and improve the products you already have
Optimize operations by focusing on geographies, age groups etc. with the most value
Gain brand evangelists by incentivising them to comment, review or talk about your product with free gifts or discounts
Reactivate customers who have churned and no longer interact with you
Types of customer Segmentation
Value Based Segmentation: Customer ranking and segmentation according to current and
expected/estimated customer value
Life Stage Segmentation: Segmentation according to current life stage which he/she
belongs
Segment customers using ▪ Data-driven segments CHAID ▪ Telecom client segmented customers
predictive algorithm, desired, but first and foremost on various factors that drive churn
Supervised:
Behavioral based on high number of segments need to be propensity and targeted high churn
2 With a dependent
segmentation factors that potentially differentiated on a specific segments with retention campaigns
variable
drive a specific outcome outcome/metric (e.g. and offers
revenue)
Segment customers using ▪ Data-driven segments TwoStep, ▪ Retail client segmented customers on
clustering algorithm desired K-Means behavioral shopping factors that
Unsupervis-ed:
3 Without a dependent
based on high number of ▪ Segments need to be included category spend, shopping
factors differentiated across many frequency/tendency, and store/channel
variable
behavioral factors shopped to inform merchandising and
offer strategy
RFM SEGMENTATION
RFM SEGMENTATION- STEPS
RFM SEGMENTATION - STEPS
RFM-SEGMENTATION STEPS
Behavioral Segmentation - Clustering Techniques
• K-means
• Iteratively re-assign points to the nearest cluster center
• Agglomerative clustering(Heirarchical)
• Start with each point as its own cluster and iteratively merge the closest
clusters
• Mean-shift clustering
• Estimate modes of pdf
• Spectral clustering
• Split the nodes in a graph based on assigned links with similarity weights
Big Small
Ticket Frequent Ticket Infrequent Returner Overall
% Customers 9.8 4.2 13.5 69.5 6.6 100.0
% Revenue 27.4 33.6 15.4 13.5 10.1 100.0
Revenue per customer ($) 1,038 8,618 1209.1 220 1613.5 1077.2
Visits per customer 3.1 34.2 16.1 2.1 8.3 4.8
Basket size ($) 970.1 252.7 75.1 105.2 165.1 224.8
Average departments shopped 3.6 5.5 1.9 1.2 2.9 1.9
Stores shopped 1.1 3.0 1.8 1.1 1.2 1.7
Returning propensity (%) 0.3 6.5 5.5 0.3 25.5 3.2
Shopped in December (%) 15.1 70.8 53.3 19.4 23.3 26.6
Shopped on Memorial Day (%) 1.6 17.9 2.4 0.9 2.1 2.2
Shopped on Labor Day (%) 1.0 14.1 1.8 0.6 1.5 1.7
Shopped on President's Day (%) 0.7 12.0 1.8 0.6 1.8 1.5
Average Discount Rate (%) 14.8 11.4 6.6 4.5 10.6 11.2
Customer lifetime (months) 25.2 46.2 42.2 28.4 27.2 30.8
Note that key profile variables are not always the same as basis variables
used to generate the segmentation
Subjective Segmentation: Cluster Analysis Process
Data Cleaning and Creating New
Selection of
Preparing the data set for Relevant Variables
Variables
analysis
Step 3
Step 1 Step 2
Overall population
K-Means clustering
Weight Age
Cust1 68 25
Which of the two customers are similar now?
Cust2 72 70
Cust3 100 28
Not sure all these measures will result in same clusters in the above example
Spectral Clustering (Density Based Clustering)
Density-based Clustering
• Basic idea
– Clusters are dense regions in the data space, separated by
regions of lower object density
– A cluster is defined as a maximal set of density- connected points
– Discovers clusters of arbitrary shape
• Method
– DBSCAN
Density Definition
ε-Neighborhood of p
ε ε ε-Neighborhood of q
q p
Density of p is “high” (MinPts = 4)
Density of q is “low” (MinPts = 4)
Core, Border & Outlier
• Directly density-reachable
• An object q is directly density-reachable from object p
if p is a core object and q is in p’s -neighborhood.
MinPts = 4
Density-reachability
• Parameter
• = 2 cm
• MinPts = 3
for each o D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
DBSCAN Algorithm: Example
• Parameter
• = 2 cm
• MinPts = 3
for each o D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
DBSCAN Algorithm: Example
• Parameter
• = 2 cm
• MinPts = 3
for each o D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
DBSCAN: Sensitive to Parameters
DBSCAN: Determining EPS and MinPts
• Idea is that for points in a cluster, their kth nearest
neighbors are at roughly the same distance
• Noise points have the kth nearest neighbor at farther distance
• So, plot sorted distance of every point to its kth nearest neighbor
When DBSCAN Works Well
• Resistant to Noise
• Can handle clusters of different shapes and sizes
When DBSCAN Does NOT Work Well
(MinPts=4, Eps=9.92).
Original Points
9 No Medium 75K No
Model
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
55
Examples of Classification Task
• Decision Tree
• Naïve Bayes
• Nearest Neighbor
• Rule-based Classification
• Logistic Regression
• Support Vector Machines
• Ensemble methods
• ……
Example of a Decision Tree
MarSt Single,
Tid Refund Marital Taxable Cheat
Married Divorced
Status Income
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No
< 80K > 80K
5 No Divorced 95K Yes
6 No Married 60K No NO YES
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that fits the same
10 No Single 90K Yes data!
10
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Tree Induction
1 Yes Large 125K No algorithm
2 No Medium 100K No
3 No Small 70K No
9 No Medium 75K No
Model
Training Set
Apply Decision
Model
Tid Attrib1 Attrib2 Attrib3 Class
Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable Cheat
Status Income
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable Cheat
Status Income
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable Cheat
Status Income
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable Cheat
Status Income
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable Cheat
Status Income
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable Cheat
Status Income
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Decision Tree Classification Task
3 No Small 70K No
9 No Medium 75K No
Model
Training Set
Apply
Decision
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
Tree
12 Yes Medium 80K ?
15 No Large 67K ?
10
Test Set
Decision Tree Induction
• Many Algorithms:
– Hunt’s Algorithm
– CART
– ID3, C4.5
– SLIQ ,SPRINT
– ……
General Structure of Hunt’s Algorithm
• Let Dt be the set of training
Tid Refund Marital Taxable Cheat
Status Income
Single, Single,
Married Married
Divorced Divorced
Don’t Taxable
Don’t
Cheat Income Cheat
< 80K >= 80K
Don’t Cheat
Cheat
Tree Induction
• Greedy strategy
– Split the records based on an attribute test that optimizes certain
criterion
• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
How to Specify Test Condition?
Size
• What about this split? {Small,
Large} {Medium}
Splitting Based on Continuous Attributes
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
• Greedy strategy
– Split the records based on an attribute test that optimizes certain
criterion.
• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
How to determine the Best Split
C0: 5 C0: 9
C1: 5 C1: 1
A? B?
Yes No Yes No
Node N1 Node N2 Node N3 Node N4
M1 M2 M3 M4
M12 M34
Gain = M0 – M12 vs M0 – M34
Measures of Node Impurity
• Gini Index
• Entropy
• Misclassification error
Measure of Impurity: GINI
• Gini Index for a given node t :
GINI(t) 1 [ p( j | t)]2
j
• (NOTE: p( j | t) is the relative frequency of class j at node t).
– Maximum (1 - 1/nc) when records are equally distributed among all classes,
implying least interesting information
– Minimum (0) when all records belong to one class, implying
• most useful information
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
Examples for computing GINI
GINI(t) 1 [ p( j | t)]2
j
k
GINI split
ni
GINI(i)
i1 n
Entropy(t) p( j | t) log p( j | t)
j 2
• Information Gain:
n
Entropy( p)
k
GAIN Entropy(i)
i
split
n
i1
• Greedy strategy
– Split the records based on an attribute test that optimizes certain
criterion.
• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
Stopping Criteria for Tree Induction
• Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Accuracy is comparable to other classification techniques for
many simple data sets
Underfitting and Overfitting (Example)
Circular points:
0.5 sqrt(x21 +x22 ) 1
Triangular points:
sqrt(x21 +x22 ) > 0.5 or
sqrt(x21 +x22 ) < 1
Underfitting and Overfitting
Overfitting
Occam’s Razor
• Post-pruning
– Grow decision tree to its entirety
– Trim the nodes of the decision tree in a bottom-up fashion
– If generalization error improves after trimming,
• replace sub-tree by a leaf node.
– Class label of leaf node is determined from majority class of
instances in the sub-tree
Handling Missing Attribute Values
50
Computing Impurity Measure
Tid Refund Marital Taxable Class Before Splitting:
Status Income Entropy(Parent)
1 Yes Single 125K No = -0.3 log(0.3)-(0.7)log(0.7) = 0.8813
2 No Married 100K No Class Class
3 No Single 70K No = Yes = No
Refund=Yes 0 3
4 Yes Married 120K No
Refund=No 2 4
5 No Divorced 95K Yes
Refund=? 1 0
6 No Married 60K No
7 Yes Divorced 220K No Split on Refund: Entropy(Refund=Yes) = 0
8 No Single 85K Yes
Entropy(Refund=No)
9 No Married 75K No = -(2/6)log(2/6) – (4/6)log(4/6) = 0.9183
10 ? Single 90K Yes
10
Entropy(Children)
= 0.3 (0) + 0.6 (0.9183) = 0.551
Missing
value Gain = 0.9 (0.8813 – 0.551) = 0.3303
Distribute Instances
Tid Refund Marital Taxable Class
Status Income
Tid Refund Marital Taxable Class
1 Yes Single 125K No
Status Income
2 No Married 100K No
10 ? Single 90K Yes
3 No Single 70K No 10
Class=Yes 0 Cheat=Yes 2 Assign record to the left child with weight = 3/9 and
Class=No 3 Cheat=No 4
to the right child with weight = 6/9
Classify Instances
• Data Fragmentation
• Search Strategy
• Expressiveness
• Tree Replication
Data Fragmentation
• Other strategies?
– Bottom-up
– Bi-directional
Expressiveness
0.9
0.8
x < 0.43?
0.7
Yes No
0.6
y
0.5
y < 0.47? y < 0.33?
0.4
0.3
Yes No Yes No
0.2
:4 :0 :0 :4
0.1 :0 :4 :3 :0
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
•Border line between two neighboring regions of different classes is known as decision boundary
•Decision boundary is parallel to axes because test condition involves a single attribute at-a-time
Oblique Decision Trees
x+y<1
Class = + Class =
PREDICTED CLASS
a
Precision (p)
ac
a
Recall (r)
ab
2rp
F - measure (F) 2a
r p 2a b c
Methods of Estimation
• Holdout
– Reserve 2/3 for training and 1/3 for testing
• Random subsampling
– Repeated holdout
• Cross validation
– Partition data into k disjoint subsets
– k-fold: train on k-1 partitions, test on the remaining one
– Leave-one-out: k=n
• Stratified sampling
– oversampling vs undersampling
• Bootstrap
– Sampling with replacement
Take-away Message
• What’s classificatioŶ?
• How to use decision tree to make predictions?
• How to construct a decision tree from training data?
• How to compute gini index, entropy, misclassification error?
• How to avoid overfitting by pre-pruning or post- pruning decision
tree?
• How to evaluate classification model?
Objective Segmentation: Decision Trees
Objective Segmentation: Decision Trees
N in node: 50,000
Average: 0.4
Age Age
Rule6 0.31 No wine purchase in Sep to Nov 2012; Age = >46; Total wine transactions in Sep to Nov 2012 =
38.4 4,5,6,7,11; Average of unit price of liquor purchase in Sep to Nov 2012 = <=457.778
0
Rule7 0.14 7.3 No wine purchase in Sep to Nov 2012; Age = >46; Total wine transactions in Sep to Nov 2012 =
0
4,5,6,7,11; Average of unit price of liquor purchase in Sep to Nov 2012 = (457.778,1550]
Rule8 35.9
No wine purchase in Sep to Nov 2012; Age = <=46
Rule9 4.7 Average unit price of wine purchase in Sep to Nov 2012 = <=980; Age = <=46
Decision Trees: CHAID Segmentation
CHAID Algorithm
Introduction to Factor Analysis - PCA
Look at below Cricket Team Players Data
Player Avg Runs Total wickets Height Not outs Highest Best
Score Bowling
1 45 3 5.5 15 120 1
2 50 34 5.2 34 209 2
3 38 0 6 36 183 0
4 46 9 6.1 78 160 3
5 37 45 5.8 56 98 1
6 32 0 5.10 89 183 0
7 18 123 6 2 35 4
8 19 239 6.1 3 56 5
9 18 96 6.6 5 87 7
10 16 83 5.9 7 32 7
11 17 138 5.10 9 12 6
Describe the players
• To find a linear combination of the original variables that has the largest variance
possible.
• Need some restriction on the entries in the linear combination or problem is not
well defined.
10
5 0.7071*LTN + 0.7071*LTG
0
LTN
-5
-10
-15
0.5034*LTN + 0.8641*LTG
-20
-20 -15 -10 -5 0 5 10 15
LTG
What is happening?
• Standard regression problem with response y and regressors X1, X2, …, Xp.
• And you may want to summarize these p responses with one number (“index”)
that best captures the diversity in responses.
• E.g. is common to add the responses, or average them, perhaps being sensitive
to questions that are reverse coded.
• Already should be clear to you that a simple averaging may not be the best way
to summarize the original p questions.
Reduction of Dimension
• Often able to replace the original variables X1, X2, …, Xp with a few new variables,
say, U1, U2, …, Uk where k is much smaller than p.
• By plotting the first two or three pairs of these new variables you can often see
structure you wouldn’t otherwise be able to see (e.g. clustering).
Interpretation
• In rarer cases the new variables, U1, U2, …, Uk, are interpretable and point to
some new facet of the study.
• As you will see, however, one must be very careful with this use of Principal
Components since it is a prime opportunity to go astray and over interpret.
• Look for weights a11, a12, …, a1p such that U1=a11*X1+a12*X2+…+a1p*Xp has the
largest variance subject to the restriction that (a11)2+(a12)2+…+(a1p)2=1
• The numbers a11, a12, …, a1p are called different things in different books. In
SAS they are arrayed in a column and called the first principal component
“eigenvector”.
• If the Xi variables have had their individual means subtracted off, then the new
variable U1is called the first principal component, or in most texts, the first
principal component score.
What’s Next?
• Look for weights a21, a22, …, a2p such that U2=a21*X1+a22*X2+…+a2p*Xp has the
next largest variance subject to the restriction that (a21)2+(a22)2+…+(a2p)2=1
• The numbers a21, a22, …, a2p are called different things in different books. In SAS
they are arrayed in a column and called the second principal component
“eigenvector”.
• If the Xi variables have had their individual means subtracted off, then the new
variable U2 is called the second principal component, or in most texts, the
second principal component score.
What’s News?
• Any two arrays of weights will cross-multiply and sum to 0. Example: (a11
a21)+(a12a22)+…+(a1p a2p)=0
• Same as saying: any two of the new variables will be uncorrelated. Example:
corr(U1,U2)=0.
How Far Does This Go?
• We will look at two or three criteria for how many of these scores to
construct. We’ll start with our common sense.
• Most of the time it is not as hard as it might sound. Basically, we will look
at “how much variance” in the original data is summarized by each new
component variable.
Two Basic Constructs
mean
projection
Recall
3
Eigenvalue
Suggests two
components
2
0
1 2 3 4 5
Number of Components
Loadings
Definition
Component loadings are the ordinary product-
moment correlation between each original variable
and each component score.
Interpretation
By looking at of component loadings one can
ascertain which of the original variables tend to
“load” on a given new variable. This may facilitate
interpretations, creation of subscales, etc.
Q&A
Contact us
Visit us on: http://www.analytixlabs.in/
Join us on:
Twitter - http://twitter.com/#!/AnalytixLabs
Facebook - http://www.facebook.com/analytixlabs
LinkedIn - http://www.linkedin.com/in/analytixlabs
Blog - http://www.analytixlabs.co.in/category/blog/