0% found this document useful (0 votes)
107 views

Segmentation Tutorial

Uploaded by

Roshan Velpula
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views

Segmentation Tutorial

Uploaded by

Roshan Velpula
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Licensed to Roshan Velpula, ESSEC Business School ([email protected]). Do not copy or distribute.

TUTORIAL

Segmentation and Classification

Copyright © 2020-22 by DecisionPro Inc.

This document is primarily intended to be used in conjunction with the Enginius


software suite. To order copies or request permission to reproduce materials, go to
http://www.enginius.biz. No part of this publication may be reproduced, stored in a
retrieval system, used in a spreadsheet, or transmitted in any form or by any means
–electronic, mechanical, photocopying, recording or otherwise– without the
permission of DecisionPro, Inc. v220907

1
Licensed to Roshan Velpula, ESSEC Business School ([email protected]). Do not copy or distribute.

Overview
Segmentation and classification are analytic techniques that helps firms compare and group customers who share
common characteristics (i.e., segmentation variables) into homogeneous segments and identify ways to target
particular segments of customers in a market on the basis of external variables (i.e., descriptor variables).

Segmentation refers to the process of classifying customers into homogenous groups (segments), such that each
group of customers shares enough characteristics in common to make it viable for the firm to design specific offerings
or products for it. This application identifies customer segments using needs-based variables called basis variables.
Cluster analysis helps firms to:

✓ better understand their customers.

✓ identify different segments in a market.

✓ choose attractive customer segments for classification with its marketing programs.

Getting Started
To apply segmentation and classification analysis, you can use your own data or use a template preformatted by the
Enginius software. Because the Segmentation model requires a specific data format, users with their own data should
review the preformatted template to become familiar with the appropriate structure. The next section explains how
to create an easy-to-use template to enter your own data.


The following section, “Creating a template”, is used if you need to
enter your own data for analysis. If you are using one of our supplied
cases, or the tutorial, you may want to move ahead to the “Entering
your data” portion of this tutorial (page 5).

Creating a Template
From the Enginius Dashboard, click the Templates dropdown and select Segmentation to open the dialog box to create
a Segmentation template.

2
Licensed to Roshan Velpula, ESSEC Business School ([email protected]). Do not copy or distribute.

The options for the Segmentation template are as follows:

▪ Segmentation data:

1. Number of segmentation variables: These variables serve as the basis for segmentation
and are often called basis variables. They might include customer's needs, wants,
expectations, or preferences.
2. Number of respondents: The number of customers or respondents in the data that
need to be clustered.

▪ Descriptor data (optional):

• Include descriptor data: Check the box if you have descriptor variables. Descriptor
variables, or descriptors, are variables that do not contribute to the segment definition,
but can be used to describe them (e.g., age, gender).

• Number of descriptor variables: The total number of respondents (customers) in your


study.

▪ Out-of-sample classification data (optional – option only available when descriptor data is also
included in template):

3
Licensed to Roshan Velpula, ESSEC Business School ([email protected]). Do not copy or distribute.

• Include classification data: Check the box if you have classification data. Classification
data refers to individuals for whom you only have descriptors available, and wish to
classify into their most likely segments.

• Number of respondents to classify: The number of individuals with descriptor data only
that you would like to classify into segments.

Note: the check box at the bottom of the dialog box will cause the template to populate with sample
(random) data that will allow you to run Segmentation modeling immediately so you can preview the
output produced.


It is not always clear whether a specific variable should be treated as a
segmentation variable or descriptor variable. This choice might depend
on the context, the managerial question, or the product category.

When in doubt, ask yourself the following questions: (1) Would this piece
of information tell me what that customer wants, in which case it should
be treated as segmentation variable, or (2) does this piece of information
tell me who that customer is and therefore should be treated as
descriptor variable? For example, “gender” would fall in the second
category most of the time, whereas “need for timely information” usually
falls in the former category.

After selecting the desired model options, click Run to generate the data collection template. The software generates
the required data blocks depending on whether you have included descriptor data (and classification data) and fills
with random data (if selected):

4
Licensed to Roshan Velpula, ESSEC Business School ([email protected]). Do not copy or distribute.

Entering Your Data


A typical segmentation set up contains one or two data blocks that contain segmentation and/or discrimination data.
If running classification, an additional data block is used for the data to be classified.

▪ Segmentation data are required for the segmentation model. This data block contains the
respondent identifier and a column for each segmentation variable collected in the study. The
data within each column must be scaled using the same scale (e.g., 1–10), but each column can
have a different scale (e.g., 1–10 for satisfaction, 1–5 for convenience). Typically, segmentation
variables are numerical values (interval or ratio scale). The data set contains one row per
respondent in your study. If you must use basis variables that are nominal (e.g., “male” “female”),
then you can apply latent class segmentation analysis (see appendix).

▪ Descriptor data constitute an optional data block, depending on whether your study has
collected discrimination data. Recall that discrimination data enables you to differentiate one
customer from another (e.g., age, income, gender). Again, data within a column must be scaled
using the same scale, but different columns may use different scales. Typically, descriptor
variables are numerical (interval or ratio scale) or nominal (“male”, “female”). Each respondent
in your study appears in a separate row.

▪ Classification data is an optional data block. The block will contain columns that match your
Descriptor data block that is already being used for analysis.

5
Licensed to Roshan Velpula, ESSEC Business School ([email protected]). Do not copy or distribute.

Running Segmentation Analyses


For the remainder of this tutorial, we will use the “OfficeStar:
Segmentation” data set that is available with the Enginius Segmentation
tutorial. To access the data set, open “Segmentation” under the Tutorials
dropdown in the Enginius Dashboard. This will automatically load the
“OfficeStar: Segmentation” data.

After you enter and/or upload your data, click on the Run Segmentation Analysis button in upper left corner to begin
the Analysis.

Analysis options

When you click on the Run Segmentation Analysis, a number of analysis options will be presented:

6
Licensed to Roshan Velpula, ESSEC Business School ([email protected]). Do not copy or distribute.

Segmentation method
You may specify the number of segments (clusters) to develop during the analysis or you can allow Enginius to
determine the appropriate number of segments. If you allow Enginius to determine the number of segments
automatically, it will do so strictly on a mathematical basis. This may, or may not, be appropriate from a management
perspective but can be useful as a starting point for manually determining the number of segments.


Usually, a segmentation analysis consists of two steps when manually
determining the number of segments (i.e., using “Force number of
segments” option). First, you run the analysis with a large number of
segments (up to 9). Second, on the basis of analysis from the initial
report (discussed subsequently), you can determine the number of
segments to retain for further analysis.

Segmentation data
The dropdown box that appears under Segmentation data allows you to select the data block that corresponds to
your segmentation data. In OfficeStar, this data block is named “Segmentation data”.

A check box under the Segmentation data section allows you to choose whether to Standardize data. This option
scales all variables to 0 mean and unit variance before the analysis. Choosing this option is recommended if you have
measured the variables on different scales.

Descriptor analysis
To perform a Descriptor analysis, check the box beside “Run descriptor analysis” and then choose the data block where
your descriptor data is located.

You may also choose to run Classification analysis if you have provided classification data. Please refer to page 17 for
a description of classification analysis. One would not typically include classification analysis until the segmentation
analysis was complete.

Advanced options
Checking the Advanced checkbox will provide two additional options for running your segmentation analysis.

7
Licensed to Roshan Velpula, ESSEC Business School ([email protected]). Do not copy or distribute.

Segmentation method
For the segmentation method, you may choose either Hierarchical clustering or K-means (Hierarchical clustering is
the default if the Advanced option is not checked).

▪ Hierarchical clustering builds up or breaks down the data, customer by customer (row by row).
Due to the computational requirements, hierarchical clustering is not suitable for large data
sets.
Note: if there are more than 2,000 data points, Enginius will use K-means regardless of which
method is selected.
▪ K-means partitioning breaks the data into a pre-specified number of segments and then
reallocates or swaps customers to improve some measure of effectiveness.

Data transformation
The Data transformation dropdown under the Segmentation data section allows you to select a method to pre-process
the data.

▪ None. This option indicates you want to use the original data.
▪ Standardization (by column). This will standardize the data by column (variable), so that columns
measured on different scales become comparable.

8
Licensed to Roshan Velpula, ESSEC Business School ([email protected]). Do not copy or distribute.

▪ Standardization (by row). This will standardize the data by row (respondent), so that data will be
measured as a deviation from each respondent's average response. This method is only valid if
variables are already measured on the same scale.
▪ Box-Cox normalization. This will apply Box-Cox normalization to the data which reduces data
skewness and the effect of potential outliers.
▪ Factorization. This will transform the data using factorization, and then segment the (weighted)
factor loadings instead. This will remove smaller factors (i.e., noise) from the data.

After selecting all the options, click the Run button found at the bottom of the Segmentation analysis setup window to
begin the analysis. By default, the report will output as a web page.

 Reminder: Clicking the world icon beside the “Run” option will
allow you to choose a different output format for the report.

You will see a pop-up indicating the segmentation analysis is underway. Your report will output in the format chosen
(Microsoft, PDF, or Zip format may automatically download to your hard drive).

Interpreting the Segmentation Results


The report generated by segmentation analysis contains several sections, depending on the options chosen. The
results described below were generated with these model settings:

9
Licensed to Roshan Velpula, ESSEC Business School ([email protected]). Do not copy or distribute.

The first section depicts the number of segments (either chosen by the user or automatically chosen by Enginius).
These segments are depicted in 3 different displays: dendrogram, silhouette chart, and scree plot.

Dendrogram

Dendrograms provide graphical representations of the loss of information generated by grouping different clusters
(or customers) together.

At one extreme (upper part of the dendrogram), all customers group into one cluster, and the loss of information is
maximum, because they all receive undifferentiated treatment, regardless of their characteristics.

10
Licensed to Roshan Velpula, ESSEC Business School ([email protected]). Do not copy or distribute.

At the other extreme (lower part of the dendrogram), customers appear in separate, small clusters, and only those
customers very similar to one another group together (“similar” or “close” in this context refers to the distance between
two customers in terms of the segmentation variables).

When reviewing a dendrogram, look for significant distances or “jumps” in the distances (using the scale on the Y axis).
For example, the OfficeStar example contains a very large jump when moving from three to two clusters. Grouping
these three clusters into two generates a significant loss of information; in other words, it results in grouping within
the same cluster customers who are very dissimilar. In the preceding example, a three-cluster solution seems to be
the best approach.

Scree Plot

The scree plot compares the sum of squared error (SSE) for each cluster solution. A good cluster solution might be
when the SSE slows dramatically, creating an 'elbow'. Such elbow may not always exist.

The above charts are simply a graphical representation of the clustering output. For a more detailed understanding
of cluster members and attributes, you must analyze the other segmentation output as well.

Segment description
The section of the report contains the statistical output of the cluster process in terms of Segment size, Segment
description, Segment differences and spatial depiction of segments and segment variables.

Segment size: The population of each segment in count and percent is shown in the table below.

11
Licensed to Roshan Velpula, ESSEC Business School ([email protected]). Do not copy or distribute.

Segment description: Average value of each segmentation variable, overall for each segment (centroid).
Segmentation variables that are statistically different from the rest of the population are highlighted in red (lower) or
green (higher).

Segment differences per segment: Expanding on the previous chart colors, the shade of cell color indicates to what
extent a segment is statistically different from the rest of the population on each segmentation variable.

Segment space: Spatial representation of segments and segmentation variables, using principal component analysis.
Because only the first two dimensions of the PCA are displayed, and these two dimensions capture only part of the
variance in the data, some differences between segments might not appear here. Note that segmentation variables
with no variance, if any, have been excluded.

Two clusters that appear to overlap in the first two dimensions might actually be distinct on other dimensions.
Consequently, this chart is a useful guide, especially to see which segmentation variables are correlated, but may be
misleading if used to select the optimal number of segments.

12
Licensed to Roshan Velpula, ESSEC Business School ([email protected]). Do not copy or distribute.

Segment membership: The chart below shows an excerpt of the respondents mapped to their segment. The complete
membership list is only available in the Excel formatted report.

13
Licensed to Roshan Velpula, ESSEC Business School ([email protected]). Do not copy or distribute.

Segment profiles (only available when data is NOT standardized)

A spider chart is displayed showing the averages of the segment variables across all segments.

To easier visualize each segment and the segmentation variables, a chart is created to represent the profile of each
segment. For each segment, the segmentation variables have been ordered in decreasing order of magnitude.

A. The colored dots represent the average of the segment.


B. The horizontal lines represent the standard deviations within that segment.
C. The vertical, gray lines represent the averages of the rest of the population, after excluding
members of the segment under scrutiny.

14
Licensed to Roshan Velpula, ESSEC Business School ([email protected]). Do not copy or distribute.

Descriptor analysis

The next section of the report shows the output of Descriptor analysis (if selected). This portion of the report will show
information regarding:

▪ Segment sizes depict the number of respondents who appear in each cluster, along with the
proportion of the whole population that each cluster represents.

▪ Descriptor variables depict the means of each descriptor variable for each cluster.

▪ Descriptor function reflects the correlation of the variables with each significant descriptor
function and thus indicates the predictive ability of each descriptor function.

▪ Confusion matrix depicts how well the descriptor data predict correct clusters. Two matrices
are available, one showing the actual data counts and the other showing percentages for these
same data.

▪ Classification weights and classification coefficients are intermediary results required to run
further classification analyses on external data. These matrices are of no particular interest as
is, and cannot be easily interpreted, but are necessary to carry over further classification
analyses.

Descriptors
This table reports the descriptor variable averages of each segment. The more differences can be found, the easier it
will be to predict segment membership based on descriptor data alone.

Descriptor data per segment: Average value of each descriptor variable, overall and within each cluster. Descriptor
variables that are statistically different from the rest of the population are highlighted in red (lower) or green (higher).

Descriptor differences per segment. Expanding on the previous chart colors, the shade of cell color indicates to what
extent the distribution of a descriptor variable in a segment is statistically different from the rest of the population.

15
Licensed to Roshan Velpula, ESSEC Business School ([email protected]). Do not copy or distribute.

Descriptor space
Spatial representation of segments and descriptor variables, using principal component analysis. Because only the
first two dimensions of the PCA are displayed, and these two dimensions capture only part of the variance in the data,
some differences between segments might not appear here. Note that descriptor variables with no variance, if any,
have been excluded.

If two or more segments fully overlap, it is unlikely that they could be clearly separated based on descriptor data alone.

However, two segments that seem to overlap on two dimensions may be more clearly separated on other dimensions.
Consequently, the confusion matrix is a better guide to assess the quality of segment discrimination.

Classification model

Often, segmentation variables may not be available to managers, but descriptors may be.

In this section, we explore whether descriptors alone could predict segment membership with sufficient accuracy. The
confusion matrix and hit rates (reported below) indicate whether the model is accurate enough.

For descriptor analysis, Enginius uses a multinomial logit model (similar to the one used to predict 'choices between
multiple alternatives (A/B/C)' in the predictive modeling module.

The largest segment is selected as the default option (dummy), and the model identifies which descriptor variables are
the most significant to predict cluster memberships. If a descriptor variable is highly predictive, its p-values will be
close to zero, and the cells will appear in green (or red).

16
Licensed to Roshan Velpula, ESSEC Business School ([email protected]). Do not copy or distribute.

Model coefficients

P-values

Confusion matrix
The confusion matrix compares actual segment membership (obtained from the segmentation analysis and the
original segmentation variables) and predicted segment membership (obtained from the descriptor analysis and the
descriptors alone). When actual and predicted segment memberships coincide, the diagonal elements will be
comparatively large, indicating that the descriptor model is accurate.

The plot below shows the graphic representation of the confusion matrix. Bubbles along the diagonal shows where
respondents were correctly classified.

17
Licensed to Roshan Velpula, ESSEC Business School ([email protected]). Do not copy or distribute.

Model predictions
This table details the probabilities of each member of the segmentation dataset to belong to each cluster (as predicted
by the descriptor model and the descriptors alone). The segment with the highest probability is retained, and is
compared to the actual segment membership to measure model accuracy and classification errors. The complete list
is only available in the Excel formatted report.

18
Licensed to Roshan Velpula, ESSEC Business School ([email protected]). Do not copy or distribute.

Interpreting the Classification Output

Introduction

If you ran selected analysis with descriptor data, the software estimated the best way to predict to which cluster an
individual is most likely to belong based solely on descriptor data. This is very useful to predict whether young people
(age as a descriptor factor) are more likely to be more price sensitive (price sensitivity as a segmentation variable); or
if businesses in certain industries require more support than others.

The ability of recouping segment membership based on descriptor variables is best summarized by the confusion
matrix and hit rate (see above).

Once this descriptor analysis has been applied to the original dataset, it can be applied again to external customers
for whom descriptor data—but no segmentation data—is available. The process of classifying customers among
segments, based on a preceding segmentation analysis, but using descriptor data only, is called classification analysis.

Note that this classification of customers across segments is our best guess based on descriptor analysis. It is not
perfect, and some customers might be misclassified, that is, they are the closest to segment A in terms of needs, but
their descriptor variables send us astray and predict they are more likely to belong to segment B.


Classification analysis is usually applied to new customers, for whom
segmentation data is not available. For learning purpose, you can also
apply it to descriptor data of customers for whom segmentation data is
available, and see how well segment memberships are recouped. This
analysis is automatically done when you run a segmentation analysis,
and its results are summarized by the confusion matrix.

Interpreting the results


The Classification output shows the output from applying the descriptor model to the respondents to be classified.
Because segmentation variables and actual segment membership are unavailable, the actual accuracy of the model
predictions are unknown and can only be inferred from the previous section.

Segment size

19
Licensed to Roshan Velpula, ESSEC Business School ([email protected]). Do not copy or distribute.

Model predictions
This table details the probabilities of each member of the classification dataset to belong to each cluster (as predicted
by the descriptor model and the descriptors alone). The segment with the highest probability is retained.

20

Powered by TCPDF (www.tcpdf.org)

You might also like