0% found this document useful (0 votes)
64 views142 pages

Factor Analysis - Segmentation New

There are different approaches to segmenting customers based on their behaviors. Rule-based segmentation involves manually grouping customers based on 1-3 key factors related to a business objective. Supervised segmentation uses predictive algorithms to group customers based on many factors that influence a specific outcome metric. Unsupervised segmentation uses clustering algorithms to group customers based only on behavioral patterns across multiple factors, without a predefined outcome variable. Common techniques include K-means clustering, hierarchical clustering, and mean-shift clustering. Segmentation provides benefits like improved targeting, retention, and lifetime value.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views142 pages

Factor Analysis - Segmentation New

There are different approaches to segmenting customers based on their behaviors. Rule-based segmentation involves manually grouping customers based on 1-3 key factors related to a business objective. Supervised segmentation uses predictive algorithms to group customers based on many factors that influence a specific outcome metric. Unsupervised segmentation uses clustering algorithms to group customers based only on behavioral patterns across multiple factors, without a predefined outcome variable. Common techniques include K-means clustering, hierarchical clustering, and mean-shift clustering. Segmentation provides benefits like improved targeting, retention, and lifetime value.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 142

Factor Analysis

and
Segmentation

Disclaimer: This material is protected under copyright act AnalytixLabs ©, 2011-2016. Unauthorized use and/ or duplication of this material or any part of this material
including data, in any form without explicit and written permission from AnalytixLabs is strictly prohibited. Any violation of this copyright will attract legal actions
Introduction to Segmentation
Segmentation
Each individual is so different
that ideally we would want to reach out to each one of them in a different way
1 2 3 4 5 6

Problem : The volume is too large for customization at individual level

1 3 4
2
Solution : Identify segments where people have same characters and target each of
these segments in a different way

Segmentation is for better targeting


Cluster Analysis
Example
Business Example
Consider a portfolio with 1000 customers having Credits. Business wants to make different strategies to different groups of people. How company
can group them into similar groups?

In this case we need some profiling as below: -

Total Population
(1000)

Avg. delinquency
Avg. delinquency age = 0 age = 75 days and
Avg. delinquency Avg. delinquency Avg. age = 50 yrs.
days and Avg. age = 35 yrs.
age = 15 days and age = 12 days and Avg. Utilization = 40%
Avg. Utilization > 80%
Avg. age = 33 yrs. Avg. age = 25 yrs.
Avg. Utilization = 60% Avg. Utilization = 90%

We can exclude the group with avg. delinquency age = 75 days from mailing

This type of segmentation is known as ‘Subjective Segmentation’. It gives the salient characteristics of
the best customers
Applications of Segmentation
Customer Segmentation
Customer Segmentation:
• Customer segmentation is the process of splitting your customer database into smaller groups. By
focusing on specific customer types, you will maximize customer lifetime value and better understand
who they are and what they need.

Typically customers differ in terms of:


• Products they are interested in
• Marketing channels they interact with (e.g. offline media like TV and press, social networks etc.)
• The maximum amount they can pay for a product (willingness to pay)
• Types of promotions and benefits they expect (discounts, free shipping)
• Buying patterns and frequency.
Key variables to use for customer segmentation
Geographical location – knowing where customers live can give you a good idea on their income and lifestyle (you can
also incorporate databases like Experian Mosaic)

Age and gender – younger customers are often more impulsive and frequent buyers while female customers might
have a higher long-term value

Acquisition channel – e.g. customers from Social Media are often less valuable then customers navigating to your site
directly

First product purchased – pay close attention to the transaction value and product category to differentiate between
price-focused and quality-focused customers

Device types – e.g. customers using a mobile device typically spend less than customers on a desktop PC

Recency, Frequency and Monetary value of customer transactions is a complete segmentation strategy

etc…
Applications of customer segmentation
Customer segmentation can help other parts of your business. It will allow you to:

 Improve customer retention by providing products tailored for specific segments

 Increase profits by leveraging disposable incomes and willingness to spend

 Grow you business quicker by focusing marketing campaigns on segments with higher propensity to buy

 Improve customer lifetime value by identifying purchasing patterns and targeting customers when they are in the market

 Retain customers by appearing as relevant and responsive

 Identify new product opportunities and improve the products you already have

 Optimize operations by focusing on geographies, age groups etc. with the most value

 Increase sales by offering free shipping to high frequency buyers

 Offer improved customer support to VIP customers

 Gain brand evangelists by incentivising them to comment, review or talk about your product with free gifts or discounts

 Reactivate customers who have churned and no longer interact with you
Types of customer Segmentation

 Value Based Segmentation: Customer ranking and segmentation according to current and
expected/estimated customer value

 Life Stage Segmentation: Segmentation according to current life stage which he/she
belongs

 Loyalty Segmentation: Segmentation according to current & Previous value

 Behavioral Segmentation: Customer segmentation based on behavioral attributes


There are 3 approaches to behavioral segmentation
Suggested
Description When to do technique Client example
Segment customers, ▪ Only a couple factors are ▪ Cable client segmented prospects on
manually, based on 1 to 3 thought to drive the segments Cross-tabs and their potential telecom spend and used
1
Rule-based: factors to drive specific ▪ Known hypothesis to cut conditional data the segmentation to align sales
Hypothesis driven business objective the data to create segments cuts resources and offers to improve go-to-
market strategy

Segment customers using ▪ Data-driven segments CHAID ▪ Telecom client segmented customers
predictive algorithm, desired, but first and foremost on various factors that drive churn
Supervised:
Behavioral based on high number of segments need to be propensity and targeted high churn
2 With a dependent
segmentation factors that potentially differentiated on a specific segments with retention campaigns
variable
drive a specific outcome outcome/metric (e.g. and offers
revenue)

Segment customers using ▪ Data-driven segments TwoStep, ▪ Retail client segmented customers on
clustering algorithm desired K-Means behavioral shopping factors that
Unsupervis-ed:
3 Without a dependent
based on high number of ▪ Segments need to be included category spend, shopping
factors differentiated across many frequency/tendency, and store/channel
variable
behavioral factors shopped to inform merchandising and
offer strategy
RFM SEGMENTATION
RFM SEGMENTATION- STEPS
RFM SEGMENTATION - STEPS
RFM-SEGMENTATION STEPS
Behavioral Segmentation - Clustering Techniques
• K-means
• Iteratively re-assign points to the nearest cluster center
• Agglomerative clustering(Heirarchical)
• Start with each point as its own cluster and iteratively merge the closest
clusters
• Mean-shift clustering
• Estimate modes of pdf
• Spectral clustering
• Split the nodes in a graph based on assigned links with similarity weights

As we go down this chart, the clustering strategies have more tendency to


transitively group points even if they are not nearby in feature space
Behavioral Segmentation: Hierarchical Vs. Non-hierarchical
Behavioral Segmentation: Subjective Segmentation-Cluster Analysis
Highest value segment

Big Small
Ticket Frequent Ticket Infrequent Returner Overall
% Customers 9.8 4.2 13.5 69.5 6.6 100.0
% Revenue 27.4 33.6 15.4 13.5 10.1 100.0
Revenue per customer ($) 1,038 8,618 1209.1 220 1613.5 1077.2
Visits per customer 3.1 34.2 16.1 2.1 8.3 4.8
Basket size ($) 970.1 252.7 75.1 105.2 165.1 224.8
Average departments shopped 3.6 5.5 1.9 1.2 2.9 1.9
Stores shopped 1.1 3.0 1.8 1.1 1.2 1.7
Returning propensity (%) 0.3 6.5 5.5 0.3 25.5 3.2
Shopped in December (%) 15.1 70.8 53.3 19.4 23.3 26.6
Shopped on Memorial Day (%) 1.6 17.9 2.4 0.9 2.1 2.2
Shopped on Labor Day (%) 1.0 14.1 1.8 0.6 1.5 1.7
Shopped on President's Day (%) 0.7 12.0 1.8 0.6 1.8 1.5

Average Discount Rate (%) 14.8 11.4 6.6 4.5 10.6 11.2
Customer lifetime (months) 25.2 46.2 42.2 28.4 27.2 30.8

Note that key profile variables are not always the same as basis variables
used to generate the segmentation
Subjective Segmentation: Cluster Analysis Process
Data Cleaning and Creating New
Selection of
Preparing the data set for Relevant Variables
Variables
analysis
Step 3
Step 1 Step 2

Multicollinearity Check Treatment of Missing Values Tackling the


Outliers
Step 6 Step 5 Step 4

Getting Cluster Checking the


Standardization Optimality of the
Solution
Step 7 Solution
Step 8
Step 9
Process Flow for Cluster Analysis
Subjective Segmentation: K-Means Clustering Algorithm
K-Means clustering

Overall population
K-Means clustering

Fix the number of clusters


K-Means clustering

Calculate the distance of each case


from all clusters
K-Means clustering

Assign each case to nearest cluster


K-Means clustering

Re calculate the cluster centers


K-Means clustering
K-Means clustering
K-Means clustering
K-Means clustering
K-Means clustering
K-Means clustering
K-Means clustering

Reassign after changing the cluster


centers
K-Means clustering
K-Means clustering

Continue till there is no significant


change between two iterations
Calculating the distance
Weight
Cust1 68 Which of the two customers are similar?
Cust2 72
Cust3 100

Weight Age
Cust1 68 25
Which of the two customers are similar now?
Cust2 72 70
Cust3 100 28

Weight Age Income


Cust1 68 25 60,000
Which two of the customers are similar in this
Cust2 72 70 9,000 case?
Cust3 100 28 62,000
Distance Measures
• Euclidean distance
• City-block (Manhattan) distance
• Chebychev similarity
• Minkowski distance
• Mahalanobis distance
• Maximum distance
• Cosine similarity
• Simple correlation between observations
• Minimum distance
• Weighted distance

Not sure all these measures will result in same clusters in the above example
Spectral Clustering (Density Based Clustering)
Density-based Clustering

• Basic idea
– Clusters are dense regions in the data space, separated by
regions of lower object density
– A cluster is defined as a maximal set of density- connected points
– Discovers clusters of arbitrary shape
• Method
– DBSCAN
Density Definition

• -Neighborhood – Objects within a radius of  from


an object.
N  ( p) :{q | d ( p, q)   }
• ͞High density͟ - ε-Neighborhood of an object contains at least MinPts of
objects.

ε-Neighborhood of p
ε ε ε-Neighborhood of q
q p
Density of p is “high” (MinPts = 4)
Density of q is “low” (MinPts = 4)
Core, Border & Outlier

Outlier Given  and MinPts, categorize the


objects into three exclusive groups.
Border
A point is a core point if it has more than a specified
Core number of points (MinPts) within Eps—These are points
that are at the interior of a cluster.

A border point has fewer than MinPts within Eps,


 = 1unit, MinPts = 5 but is in the neighborhood of a core point.

A noise point is any point that is not a core point


nor a border point.
Example

Original Points Point types: core, border


and outliers
 = 10, MinPts = 4
Density-reachability

• Directly density-reachable
• An object q is directly density-reachable from object p
if p is a core object and q is in p’s -neighborhood.

• q is directly density-reachable from p


ε ε • p is not directly density-reachable from q
q p • Density-reachability is asymmetric

MinPts = 4
Density-reachability

• Density-Reachable (directly and indirectly):


– A point p is directly density-reachable from p2
– p2 is directly density-reachable from p1
– p1 is directly density-reachable from q
– p  p2  p1 q form a chain

p • p is (indirectly) density-reachable from q


p2 • q is not density-reachable from p
p1
q
MinPts = 7
DBSCAN Algorithm: Example

• Parameter
•  = 2 cm
• MinPts = 3

for each o  D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
DBSCAN Algorithm: Example

• Parameter
•  = 2 cm
• MinPts = 3

for each o  D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
DBSCAN Algorithm: Example

• Parameter
•  = 2 cm
• MinPts = 3

for each o  D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
DBSCAN: Sensitive to Parameters
DBSCAN: Determining EPS and MinPts
• Idea is that for points in a cluster, their kth nearest
neighbors are at roughly the same distance
• Noise points have the kth nearest neighbor at farther distance
• So, plot sorted distance of every point to its kth nearest neighbor
When DBSCAN Works Well

Original Points Clusters

• Resistant to Noise
• Can handle clusters of different shapes and sizes
When DBSCAN Does NOT Work Well

(MinPts=4, Eps=9.92).

Original Points

•Cannot handle varying densities


•sensitive to parameters—hard to determine the correct
set of parameters
(MinPts=4, Eps=9.75)
Take-away Message

• The basic idea of density-based clustering


• The two important parameters and the definitions of neighborhood
and density in DBSCAN
• Core, border and outlier points
• DBSCAN algorithm
• DBSCAN’s pros and cons
Objective Segmentation
Classification: Definition

• Given a collection of records (training set )


– Each record contains a set of attributes, one of the attributes is the
class.
• Find a model for class attribute as a function of the values of
other attributes.
• Goal: previously unseen records should be assigned a class
as accurately as possible.
– A test set is used to determine the accuracy of the model. Usually, the given
data set is divided into training and test sets, with training set used to build
the model and test set used to validate it.
Illustrating Classification Task

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No


Learning
2 No Medium 100K No
algorithm
3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes


Induction
6 No Medium 60K No

7 Yes Large 220K No Learn Model


8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes


10

Model
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set

55
Examples of Classification Task

• Predicting tumor cells as benign or malignant

• Classifying credit card transactions as legitimate or


fraudulent

• Classifying emails as spams or normal emails

• Categorizing news stories as finance, weather, entertainment,


sports, etc
Classification Techniques

• Decision Tree
• Naïve Bayes
• Nearest Neighbor
• Rule-based Classification
• Logistic Regression
• Support Vector Machines
• Ensemble methods
• ……
Example of a Decision Tree

Tid Refund Marital Taxable Cheat


Splitting Attributes
Status Income
1 Yes Single 125K No
2 No Married 100K No Refund
3 No Single 70K No Yes No
4 Yes Married 120K No MarSt
NO
5 No Divorced 95K Yes
Single, Divorced Married
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree


Another Example of Decision Tree

MarSt Single,
Tid Refund Marital Taxable Cheat
Married Divorced
Status Income
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No
< 80K > 80K
5 No Divorced 95K Yes
6 No Married 60K No NO YES
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that fits the same
10 No Single 90K Yes data!
10
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Tree Induction
1 Yes Large 125K No algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No Learn Model


7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes


10

Model
Training Set
Apply Decision
Model
Tid Attrib1 Attrib2 Attrib3 Class
Tree
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable Cheat
Status Income

No Married 80K ?
Refund 10

Yes No

NO MarSt

Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable Cheat
Status Income

No Married 80K ?
Refund 10

Yes No

NO MarSt

Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable Cheat
Status Income

No Married 80K ?
Refund 10

Yes No

NO MarSt

Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable Cheat
Status Income

No Married 80K ?
Refund 10

Yes No

NO MarSt

Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable Cheat
Status Income

No Married 80K ?
Refund 10

Yes No

NO MarSt

Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable Cheat
Status Income

No Married 80K ?
Refund 10

Yes No

NO MarSt

Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES
Decision Tree Classification Task

Tid Attrib1 Attrib2 Attrib3 Class Tree Induction


1 Yes Large 125K No algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No Learn Model


7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes


10

Model
Training Set
Apply
Decision
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
Tree
12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Decision Tree Induction

• Many Algorithms:
– Hunt’s Algorithm
– CART
– ID3, C4.5
– SLIQ ,SPRINT
– ……
General Structure of Hunt’s Algorithm
• Let Dt be the set of training
Tid Refund Marital Taxable Cheat
Status Income

records that reach a node t 1 Yes Single 125K No


2 No Married 100K No
• General Procedure: 3 No Single 70K No

– If Dt contains records that belong the


4 Yes Married 120K No
5 No Divorced 95K Yes
same class yt, then t is a leaf node 6 No Married 60K No

labeled as yt 7 Yes Divorced 220K No

– If Dt contains records that belong to


8 No Single 85K Yes
9 No Married 75K No
more than one class, use an attribute to 10 No Single 90K Yes

split the data into smaller subsets.


10

Recursively apply the procedure to each Dt


subset
?
Hunt’s Algorithm Tid Refund Marital Taxable Cheat
Status Income
1 Yes Single 125K No
Refund
2 No Married 100K No
Yes No
3 No Single 70K No
Don’t
4 Yes Married 120K No
Cheat
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
Refund Refund
8 No Single 85K Yes
Yes No Yes No
9 No Married 75K No

Don’t Don’t Marital 10 No Single 90K Yes


Marital
Cheat Status
Cheat Status
10

Single, Single,
Married Married
Divorced Divorced
Don’t Taxable
Don’t
Cheat Income Cheat
< 80K >= 80K
Don’t Cheat
Cheat
Tree Induction

• Greedy strategy
– Split the records based on an attribute test that optimizes certain
criterion

• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
How to Specify Test Condition?

• Depends on attribute types


– Nominal
– Ordinal
– Continuous

• Depends on number of ways to split


– 2-way split
– Multi-way split
Splitting Based on Nominal Attributes

• Multi-way split: Use as many partitions as


distinct values
CarType
Family Luxury
Sports

• Binary split: Divides values into two subsets


Need to find optimal partitioning
CarType CarType
{Sports, {Family,
Luxury} {Family} OR Luxury} {Sports}
Splitting Based on Ordinal Attributes
• Multi-way split: Use as many partitions as
distinct values.
Size
Small Large
Medium

• Binary split: Divides values into two subsets


Need to find optimal partitioning
Size Size
{Small,
{Large} OR {Medium,
{Small}
Medium} Large}

Size
• What about this split? {Small,
Large} {Medium}
Splitting Based on Continuous Attributes

• Different ways of handling


– Discretization to form an ordinal categorical attribute

– Binary Decision: (A < v) or (A  v)


• consider all possible splits and finds the best cut
• can be more computation intensive
Splitting Based on Continuous Attributes

Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split


Tree Induction

• Greedy strategy
– Split the records based on an attribute test that optimizes certain
criterion.

• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
How to determine the Best Split

Before Splitting: 10 records of class 0,


10 records of class 1
On- Car Student
Campus?
Type? ID?

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 C0: 1 C0: 0 C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0
... C1: 0 C1: 1 ... C1: 1

Which test condition is the best?


How to determine the Best Split
• Greedy approach:
– Nodes with homogeneous class distribution are preferred
• Need a measure of node impurity:

C0: 5 C0: 9
C1: 5 C1: 1

Non-homogeneous, High Homogeneous,

degree of impurity Low degree of impurity


How to Find the Best Split
C0 N00
Before Splitting: M0
C1 N01

A? B?
Yes No Yes No
Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40


C1 N11 C1 N21 C1 N31 C1 N41

M1 M2 M3 M4

M12 M34
Gain = M0 – M12 vs M0 – M34
Measures of Node Impurity

• Gini Index

• Entropy

• Misclassification error
Measure of Impurity: GINI
• Gini Index for a given node t :

GINI(t)  1  [ p( j | t)]2
j
• (NOTE: p( j | t) is the relative frequency of class j at node t).

– Maximum (1 - 1/nc) when records are equally distributed among all classes,
implying least interesting information
– Minimum (0) when all records belong to one class, implying
• most useful information

C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
Examples for computing GINI

GINI(t)  1  [ p( j | t)]2
j

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444
Splitting Based on GINI
• Used in CART, SLIQ , SPRINT.
• When a node p is split into k partitions (children), the quality of split is computed as,

k
GINI split 
ni
GINI(i)
i1 n

where, ni = number of records at child i, n = number


of records at node p.
Binary Attributes: Computing GINI Index

 Splits into two partitions


 Effect of Weighing partitions:
– Larger and Purer Partitions are sought for
Parent
B? C1 6
C2 6
Yes No
Gini = 0.500
Node N1 Node N2
Gini(N1)
= 1 – (5/7)2 – (2/7)2 N1 N2 Gini(Children)
= 0.408 C1 5 1 = 7/12 * 0.408 + 5/12
Gini(N2) C2 2 4 * 0.32
= 1 – (1/5)2 – (4/5)2 Gini=0.333 = 0.371
= 0.32
Entropy

• Entropy at a given node t:


Entropy(t)    p( j | t) log p( j | t)
j

(NOTE: p( j | t) is the relative frequency of class j at node t).


– Measures purity of a node
• Maximum (log nc) when records are equally distributed
among all classes implying least information
• Minimum (0.0) when all records belong to one class, implying
most information
Examples for computing Entropy

Entropy(t)    p( j | t) log p( j | t)
j 2

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
Splitting Based on Information Gain

• Information Gain:
 n 
 Entropy( p)   
k

GAIN Entropy(i)
i



split

 n
i1

Parent Node, p is split into k partitions; ni is number of


records in partition i
– Measures reduction in entropy achieved because of the split.
Choose the split that achieves most reduction (maximizes GAIN)
– Used in ID3 and C4.5
Splitting Criteria based on Classification Error

• Classification error at a node t :

Error(t)  1 max P(i | t) i

• Measures misclassification error made by a node.


• Maximum (1 - 1/nc) when records are equally distributed among
all classes, implying least interesting information
• Minimum (0.0) when all records belong to one class, implying most
interesting information
Examples for Computing Error

Error(t)  1 max P(i | t) i

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Error = 1 – max (0, 1) = 1 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
Comparison among Splitting Criteria
For a 2-class problem:
Tree Induction

• Greedy strategy
– Split the records based on an attribute test that optimizes certain
criterion.

• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
Stopping Criteria for Tree Induction

• Stop expanding a node when all the records belong to the


same class

• Stop expanding a node when all the records


have similar attribute values

• Early termination (to be discussed later)


Decision Tree Based Classification

• Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Accuracy is comparable to other classification techniques for
many simple data sets
Underfitting and Overfitting (Example)

500 circular and 500


triangular data points.

Circular points:
0.5  sqrt(x21 +x22 )  1

Triangular points:
sqrt(x21 +x22 ) > 0.5 or
sqrt(x21 +x22 ) < 1
Underfitting and Overfitting

Overfitting
Occam’s Razor

• Given two models of similar errors, one should prefer


the simpler model over the more complex model

• For complex models, there is a greater chance


that it was fitted accidentally by errors in data

• Therefore, one should include model complexity


when evaluating a model
How to Address Overfitting

• Pre-Pruning (Early Stopping Rule)


– Stop the algorithm before it becomes a fully-grown tree
– Typical stopping conditions for a node:
• Stop if all instances belong to the same class
• Stop if all the attribute values are the same
– More restrictive conditions:
• Stop if number of instances is less than some user-specified threshold
• Stop if class distribution of instances are independent of the available
features
• Stop if expanding the current node does not improve impurity measures (e.g., Gini or
information gain).
How to Address Overfitting

• Post-pruning
– Grow decision tree to its entirety
– Trim the nodes of the decision tree in a bottom-up fashion
– If generalization error improves after trimming,
• replace sub-tree by a leaf node.
– Class label of leaf node is determined from majority class of
instances in the sub-tree
Handling Missing Attribute Values

• Missing values affect decision tree construction in


three different ways:
– Affects how impurity measures are computed
– Affects how to distribute instance with missing value to child
nodes
– Affects how a test instance with missing value is
classified

50
Computing Impurity Measure
Tid Refund Marital Taxable Class Before Splitting:
Status Income Entropy(Parent)
1 Yes Single 125K No = -0.3 log(0.3)-(0.7)log(0.7) = 0.8813
2 No Married 100K No Class Class
3 No Single 70K No = Yes = No
Refund=Yes 0 3
4 Yes Married 120K No
Refund=No 2 4
5 No Divorced 95K Yes
Refund=? 1 0
6 No Married 60K No
7 Yes Divorced 220K No Split on Refund: Entropy(Refund=Yes) = 0
8 No Single 85K Yes
Entropy(Refund=No)
9 No Married 75K No = -(2/6)log(2/6) – (4/6)log(4/6) = 0.9183
10 ? Single 90K Yes
10
Entropy(Children)
= 0.3 (0) + 0.6 (0.9183) = 0.551
Missing
value Gain = 0.9  (0.8813 – 0.551) = 0.3303
Distribute Instances
Tid Refund Marital Taxable Class
Status Income
Tid Refund Marital Taxable Class
1 Yes Single 125K No
Status Income
2 No Married 100K No
10 ? Single 90K Yes
3 No Single 70K No 10

4 Yes Married 120K No


Refund
5 No Divorced 95K Yes
Yes No
6 No Married 60K No
7 Yes Divorced 220K No Class=Yes 0 + 3/9 Class=Yes 2 + 6/9
Class=No 3 Class=No 4
8 No Single 85K Yes
9 No Married 75K No
10

Probability that Refund=Yes is 3/9 Probability that


Refund
Yes No Refund=No is 6/9

Class=Yes 0 Cheat=Yes 2 Assign record to the left child with weight = 3/9 and
Class=No 3 Cheat=No 4
to the right child with weight = 6/9
Classify Instances

New record: Married Single Divorced Total


Tid Refund Marital Taxable Class
Status Income Class=No 3 1 0 4
11 No ? 85K ?
10

Class=Yes 6/9 1 1 2.67

Total 3.67 2 1 6.67


Refund
Yes
No
NO MarSt
Single,
Married Probability that Marital Status
Divorced
TaxInc = Married is 3.67/6.67
NO
< 80K > 80K Probability that Marital Status
={Single,Divorced} is 3/6.67
NO YES
Other Issues

• Data Fragmentation
• Search Strategy
• Expressiveness
• Tree Replication
Data Fragmentation

• Number of instances gets smaller as you traverse down


the tree

• Number of instances at the leaf nodes could be too small to


make any statistically significant decision
Search Strategy

• Finding an optimal decision tree is NP-hard

• The algorithm presented so far uses a greedy, top-down,


recursive partitioning strategy to induce a reasonable solution

• Other strategies?
– Bottom-up
– Bi-directional
Expressiveness

• Decision tree provides expressive representation for learning discrete-valued


function
– But they do not generalize well to certain types of Boolean functions
• Example: parity function:
– Class = 1 if there is an even number of Boolean attributes with truth value = True
– Class = 0 if there is an odd number of Boolean attributes with truth
value = True
• For accurate modeling, must have a complete tree

• Not expressive enough for modeling continuous variables


– Particularly when test condition involves only a single attribute at-a-time
Decision Boundary
1

0.9

0.8
x < 0.43?

0.7
Yes No
0.6
y

0.5
y < 0.47? y < 0.33?
0.4

0.3
Yes No Yes No
0.2
:4 :0 :0 :4
0.1 :0 :4 :3 :0
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x
•Border line between two neighboring regions of different classes is known as decision boundary

•Decision boundary is parallel to axes because test condition involves a single attribute at-a-time
Oblique Decision Trees

x+y<1

Class = + Class =

• Test condition may involve multiple attributes


• More expressive representation
• Finding optimal test condition is computationally expensive
Metrics for Performance Evaluation

• Focus on the predictive capability of a model


– Rather than how fast it takes to classify or build models,
scalability, etc.
• Confusion Matrix:
PREDICTED CLASS

ACTUAL Class=Yes Class=No


CLASS a: TP (true positive) b: FN
Class=Yes a b
(false negative) c: FP (false
positive) d: TN (true
Class=No c d
negative)
Metrics for Performance Evaluation

PREDICTED CLASS

ACTUAL Class=Yes Class=No


CLASS
Class=Yes a (TP) b (FN)

Class=No c (FP) d (TN)

• Most widely-used metric:


ad TP  TN
Accuracy  
abcd TP  TN  FP  FN
Limitation of Accuracy

• Consider a 2-class problem


– Number of Class 0 examples = 9990
– Number of Class 1 examples = 10

• If model predicts everything to be class 0, accuracy is


9990/10000 = 99.9 %
– Accuracy is misleading because model does not detect any class
1 example
Cost-Sensitive Measures

a
Precision (p) 
ac
a
Recall (r) 
ab
2rp
F - measure (F)   2a
r  p 2a  b  c
Methods of Estimation
• Holdout
– Reserve 2/3 for training and 1/3 for testing
• Random subsampling
– Repeated holdout
• Cross validation
– Partition data into k disjoint subsets
– k-fold: train on k-1 partitions, test on the remaining one
– Leave-one-out: k=n
• Stratified sampling
– oversampling vs undersampling
• Bootstrap
– Sampling with replacement
Take-away Message

• What’s classificatioŶ?
• How to use decision tree to make predictions?
• How to construct a decision tree from training data?
• How to compute gini index, entropy, misclassification error?
• How to avoid overfitting by pre-pruning or post- pruning decision
tree?
• How to evaluate classification model?
Objective Segmentation: Decision Trees
Objective Segmentation: Decision Trees
N in node: 50,000
Average: 0.4

Average unit price of wine purchase in Sep to Nov 2012

<=980 >980 No purchase

N in node: 5,658 N in node: 1206 N in node: 42,229


Average: 0.7 Average: 5.8 Average: 0.2

Age Age

<= 46 > 46 <= 46 > 46

N in node: 2,283 N in node: 3,375 N in node: 17,619 N in node: 24,610


Average: 0.0 Average: 1.2 Average: 0.0 Average: 0.4

Total number of items bought in the Total number of wine transactions


sub-product level wine in Aug to Sep 2012 In Aug to Sep 2012
<= 4 >4 1,2,3 4,5,6,7,11
N in node: 916 N in node: 2,459 N in node: 1,187 N in node: 23,425
Average: 2.4 Average: 0.7 Average: 1.6 Average: 0.3

Average unit price of liquor purchase


In Sep to Nov 2012
<= 457.778 (457.778, 1550] >1550

N in node: 18,874 N in node: 3,585 N in node: 964


Average: 0.3 Average: 0.1 Average: 1.0
Decision Tree Example– business rules
Business rule statistics and description

Business Rules % Customer


Propensity to buy Description of the rule
Average unit price of wine purchase in Sep to Nov 2012 = >980
Rule1 5.80 2.5
Average unit price of wine purchase in Sep to Nov 2012 = <=980; Age = >46; Total number of
items bought in sub-product level wine in Aug to Sep 2012 = <=4
Rule2 2.40 1.9
No wine purchase in Sep to Nov 2012; Age = >46; Total wine transactions in Sep to Nov 2012 =
Rule3 1.60 2.4 1,2,3
No wine purchase in Sep to Nov 2012; Age = >46; Total wine transactions in Sep to Nov 2012 =
4,5,6,7,11; Average of unit price of liquor purchase in Sep to Nov 2012 = >1,550
Rule4 1.04 2.0
Average unit price of wine purchase in Sep to Nov 2012 = <=980; Age = >46; Total number of
Rule5 0.69 5.0 items bought in sub-product level wine in Aug to Sep 2012 = >4

Rule6 0.31 No wine purchase in Sep to Nov 2012; Age = >46; Total wine transactions in Sep to Nov 2012 =
38.4 4,5,6,7,11; Average of unit price of liquor purchase in Sep to Nov 2012 = <=457.778
0
Rule7 0.14 7.3 No wine purchase in Sep to Nov 2012; Age = >46; Total wine transactions in Sep to Nov 2012 =
0
4,5,6,7,11; Average of unit price of liquor purchase in Sep to Nov 2012 = (457.778,1550]
Rule8 35.9
No wine purchase in Sep to Nov 2012; Age = <=46

Rule9 4.7 Average unit price of wine purchase in Sep to Nov 2012 = <=980; Age = <=46
Decision Trees: CHAID Segmentation
CHAID Algorithm
Introduction to Factor Analysis - PCA
Look at below Cricket Team Players Data
Player Avg Runs Total wickets Height Not outs Highest Best
Score Bowling
1 45 3 5.5 15 120 1
2 50 34 5.2 34 209 2
3 38 0 6 36 183 0
4 46 9 6.1 78 160 3
5 37 45 5.8 56 98 1
6 32 0 5.10 89 183 0
7 18 123 6 2 35 4
8 19 239 6.1 3 56 5
9 18 96 6.6 5 87 7
10 16 83 5.9 7 32 7
11 17 138 5.10 9 12 6
Describe the players

• If we have to describe or segregate the players do we really need Avg


Runs, Total wickets, Height, Not outs, Highest Score, Best bowling?
• Can we simply take
• Avg Runs+ Not outs+ Highest Score as one factor?
• Total wickets+ Height+ Best bowling as second factor?

Defining these imaginary variables or a linear combination of variables to


reduce the dimensions is called PCA or FA
Look at below Cricket Team Players Data
Player Avg Runs Total wickets Height Not outs Highest Best
Score Bowling
1 45 3 5.5 15 120 1
2 50 34 5.2 34 209 2
3 38 0 6 36 183 0
4 46 9 6.1 78 160 3
5 37 45 5.8 56 98 1
6 32 0 5.10 89 183 0
7 18 123 6 2 35 4
8 19 239 6.1 3 56 5
9 18 96 6.6 5 87 7
10 16 83 5.9 7 32 7
11 17 138 5.10 9 12 6
Purpose of PCA

• To find a linear combination of the original variables that has the largest variance
possible.

• Need some restriction on the entries in the linear combination or problem is not
well defined.

• Usually require sums of the squares of weights to be 1.


Example
Helmet Data
15

10

5 0.7071*LTN + 0.7071*LTG

0
LTN

-5

-10

-15
0.5034*LTN + 0.8641*LTG
-20
-20 -15 -10 -5 0 5 10 15
LTG
What is happening?

• Trying to find a direction where the physical scatter of points is most


clearly “jutting out”
• This “diversity” may be just what you are looking for in your data
• Why would anyone want to find such directions?
Principal Components Regression

• Standard regression problem with response y and regressors X1, X2, …, Xp.

• X1, X2, …, Xp may be exactly collinear or nearly so.

• Least squares estimates of regression coefficients are not possible, or not


reliable in that case.

• Can use Principal Components to address the problem.


Intelligent Index Formation

• May have answers to p questions, say X1, X2, …, Xp.

• And you may want to summarize these p responses with one number (“index”)
that best captures the diversity in responses.

• E.g. is common to add the responses, or average them, perhaps being sensitive
to questions that are reverse coded.

• Already should be clear to you that a simple averaging may not be the best way
to summarize the original p questions.
Reduction of Dimension

• Often able to replace the original variables X1, X2, …, Xp with a few new variables,
say, U1, U2, …, Uk where k is much smaller than p.

• By plotting the first two or three pairs of these new variables you can often see
structure you wouldn’t otherwise be able to see (e.g. clustering).
Interpretation

• In rarer cases the new variables, U1, U2, …, Uk, are interpretable and point to
some new facet of the study.

• As you will see, however, one must be very careful with this use of Principal
Components since it is a prime opportunity to go astray and over interpret.

• This is often where PCA is confused with Factor Analysis.


How Does PCA Work?

• Look for weights a11, a12, …, a1p such that U1=a11*X1+a12*X2+…+a1p*Xp has the
largest variance subject to the restriction that (a11)2+(a12)2+…+(a1p)2=1
• The numbers a11, a12, …, a1p are called different things in different books. In
SAS they are arrayed in a column and called the first principal component
“eigenvector”.
• If the Xi variables have had their individual means subtracted off, then the new
variable U1is called the first principal component, or in most texts, the first
principal component score.
What’s Next?

• Look for weights a21, a22, …, a2p such that U2=a21*X1+a22*X2+…+a2p*Xp has the
next largest variance subject to the restriction that (a21)2+(a22)2+…+(a2p)2=1

• The numbers a21, a22, …, a2p are called different things in different books. In SAS
they are arrayed in a column and called the second principal component
“eigenvector”.

• If the Xi variables have had their individual means subtracted off, then the new
variable U2 is called the second principal component, or in most texts, the
second principal component score.
What’s News?

• Any two arrays of weights will cross-multiply and sum to 0. Example: (a11
a21)+(a12a22)+…+(a1p a2p)=0
• Same as saying: any two of the new variables will be uncorrelated. Example:
corr(U1,U2)=0.
How Far Does This Go?

• Until the original data are described adequately.

• We will look at two or three criteria for how many of these scores to
construct. We’ll start with our common sense.

• Most of the time it is not as hard as it might sound. Basically, we will look
at “how much variance” in the original data is summarized by each new
component variable.
Two Basic Constructs

• Weights (used “a” to denote).


• Weights arrayed in columns and called “eigenvectors” on SAS output.
• Weights come from looking at all pairwise covariances associated with the
original p variables.
• Scores (used “u” to denote).
• Scores called “principal components” and are the new variables.
• Typically use Weights for interpretation and development of subscales.
• Typically use Scores for clustering and as a substitution for the original
data.
Geometry

principal component direction

distance is essentially the score

mean

projection
Recall

Eigenvalue Difference Proportion Cumulative


1 4.19711750 3.52963341 0.8394 0.8394
2 0.66748410 0.57285125 0.1335 0.9729
3 0.09463284 0.05392125 0.0189 0.9918
4 0.04071159 0.04065762 0.0081 1.0000
5 0.00005397 0.0000 1.0000
Scree Plots
SCREE Plot for Hospital Data

3
Eigenvalue

Suggests two
components
2

0
1 2 3 4 5
Number of Components
Loadings

Definition
Component loadings are the ordinary product-
moment correlation between each original variable
and each component score.
Interpretation
By looking at of component loadings one can
ascertain which of the original variables tend to
“load” on a given new variable. This may facilitate
interpretations, creation of subscales, etc.
Q&A
Contact us
Visit us on: http://www.analytixlabs.in/

For course registration, please visit: http://www.analytixlabs.co.in/course-registration/

For more information, please contact us: http://www.analytixlabs.co.in/contact-us/


Or email: [email protected]

Call us we would love to speak with you: (+91) 88021-73069

Join us on:
Twitter - http://twitter.com/#!/AnalytixLabs
Facebook - http://www.facebook.com/analytixlabs
LinkedIn - http://www.linkedin.com/in/analytixlabs
Blog - http://www.analytixlabs.co.in/category/blog/

You might also like