Comparison of Segmentation Approaches: by Beth Horn and Wei Huang
Comparison of Segmentation Approaches: by Beth Horn and Wei Huang
2 Decision Analyst: Comparison of Segmentation Approaches Copyright © 2016 Decision Analyst. All rights reserved.
Table 1: Segmentation Items
Attribute Battery—How satisfied are you currently with each of the following things in your life? (Each item was rated on
a three-point scale: not satisfied, somewhat satisfied, and completely satisfied.)
1. Amount of exercise I get 10. My fitness level 17. My level of education 24. My social activities
2. My current weight 11. My health 18. My level of energy 25. My spouse (or significant
3. My breakfast choices 12. My hobbies or leisure 19. My level of happiness other or close friend)
4. My circle of friends activities 20. My lifestyle 26. Community I live in
5. Clothes in my closet 13. My home 21. My lunch choices 27. My success at following
6. My coworkers 14. My home’s yard or 22. My reflection in the mirror a diet
7. My dinner choices landscaping 23. My security and personal 28. My travel opportunities
8. My faith 15. My job or livelihood safety 29. Vehicle I drive
9. My financial situation 16. My last vacation
Related Items
Question Scale Rated
30. How would you describe your physical health overall? Excellent, Very good, Good, Fair, Poor
31. How would you describe your emotional health overall? Excellent, Very good, Good, Fair, Poor
32. How would you describe the level of stress in your life? A lot of stress, Moderate stress, Minor stress, No stress
33. How would you best describe the quality of your diet (i.e., Very healthy, Somewhat healthy, Somewhat unhealthy, Very
what you eat and drink) overall? unhealthy
We selected an attribute battery containing 29 items plus The results of the factor segmentation classification are
an additional four items (overall physical health, overall shown in Table 2 on page 4.
emotional health, level of stress, and overall quality
of diet). Each item in the attribute battery related to Factor Segmentation Conclusions
satisfaction with components of the respondent’s life, An advantage of this segmentation method is that the
and it was rated on a three-point satisfaction scale results are very clear. The respondents in the “Fitness”
(not satisfied, somewhat satisfied, and completely segment have the highest standardized score on the
satisfied). The four additional items were rated on either “Fitness” factor across all segments. We can say that
4-point or 5-point categorical scales. The segmentation these respondents are satisfied with the attributes of the
items appear in Table 1. “Fitness” factor (such as my current weight and my
fitness level) but not as satisfied with Home and Work
A factor score was computed for each respondent for
Environment, Social Support, Diet, and Health. A
each of the five factors from Table 2 on page 4 using
similar pattern emerges across all segments. Another
the regression method. Factor scores are standardized
plus is that it is relatively simple to execute, as most
values with a mean of zero and a standard deviation of
statistical software packages perform factor analysis.
one. Higher factor scores indicate that the respondents
are more satisfied with the items in the factor or have As an artifact of the method, respondents tend to have a
rated the items in the factor more positively. high score on the one factor that describes the segment
to which they have been assigned and low scores on the
Each respondent was then assigned to the factor for
other factors. This may not be realistic. For example, we
which he or she had the highest and most positive score.
Copyright © 2016 Decision Analyst. All rights reserved. Decision Analyst: Comparison of Segmentation Approaches 3
Table 2: Factor Segmentation—Average Factor Scores by Segment
Segments
Home and Work
Fitness Social Support Diet Health
Environment
Percent of 25% 23% 18% 19% 21%
Respondents
Fitness 0.984 -0.450 -0.419 -0.271 -0.212
Home and Work -0.166 0.872 -0.087 -0.204 -0.256
Environment
Social Support -0.184 -0.135 0.906 -0.233 -0.237
Diet -0.121 -0.272 -0.252 0.935 -0.262
Health -0.114 -0.305 -0.283 -0.326 0.931
Note: The values in the table are standard normal scores (z-scores) that have a standard deviation of one and range from -1 to +1. A
higher factor score indicates higher levels of satisfaction with the items contained within the factor. Scores that are relatively high across
segments are highlighted in blue.
can probably think of people we know who are satisfied selected where many of the attributes’ standard scores
with both Fitness and Social Support or both Diet were significantly different across the clusters. To aid
and Health or perhaps who are dissatisfied with all five interpretation, the clusters (segments) were named.
factors. Factor segmentation might fail to capture the
Unlike factor segmentation, k-means clustering will often
multifaceted nature of consumers.
reveal segments of respondents who are highly satisfied
or dissatisfied on more than one attribute dimension. To
K-Means Cluster Analysis
further illustrate, factor scores were calculated for each of
This method can use as input the factor scores (such
the k-means clusters.
as those developed using factor analysis), the individual
attributes, or a combination. In this paper, the 33 In Table 3 on page 5, we can see that members of
individual attributes were used as the segmentation the Satisfied With Environment But Not With
variables. Fitness segment are satisfied with Home and Work
Environment and Social Support, but are not satisfied
Because k-means does not handle variables of
with their Fitness. Members of the Ultra Satisfied With
different scales very well, the individual attributes were
Life segment are satisfied with everything, but especially
transformed into a common metric—a z-score. These
satisfied by their Fitness and Diet.
standardized scores have a mean of zero and a standard
deviation of one. The higher a variable’s score, the
K-Means Cluster Analysis Conclusions
higher the actual rating on that particular variable. These
K-means cluster analysis overcomes one of the
standardized attributes were then used as input into a
potential shortfalls of factor segmentation by describing
k-means procedure.
the multidimensionality of attitudes and behaviors.
The algorithm is affected by order of the records in Consumers can be satisfied or dissatisfied with more
the data set; thus, various seed numbers and sorting than one lifestyle area, for example. K-means also offers
schemes were explored. A five-cluster solution was F-statistics that provide information about each attribute’s
4 Decision Analyst: Comparison of Segmentation Approaches Copyright © 2016 Decision Analyst. All rights reserved.
Table 3: K-Means—Average Factor Scores by Segment
Segments
Satisfied With
Fitness But Not Satisfied With Ultra
Ultra Dissatisfied Dissatisfied With With Environment But Satisfied With
With Life Fitness & Health Environment Not With Fitness Life
Percent of 16% 23% 26% 19% 15%
Respondents
Fitness -0.433 -0.802 0.563 -0.315 1.121
Home and Work -0.712 0.039 -0.389 0.623 0.564
Environment
Social Support -0.854 0.097 -0.349 0.673 0.491
Diet -0.615 -0.144 -0.059 0.216 0.695
Health -0.627 -0.342 0.169 0.258 0.566
Note: The values in the table are standard normal scores (z-scores) that have a standard deviation of 1 and range from -1 to +1. A higher
factor score indicates higher levels of satisfaction with the items contained within the factor. Scores that are relatively high across seg-
ments are highlighted in yellow. Scores that are relatively low across segments are highlighted in blue.
contribution to differentiating the clusters. These statistics are rendered useless if the segmentation inputs
statistics can be used to simplify the segmentation by are correlated (which is true in many cases). In the end,
allowing the analyst to omit attributes that have a small the analyst must use additional statistical testing, plotting
impact on the cluster solution. of differences among the attributes across clusters, and
a good dose of personal judgment to arrive at the optimal
K-means, though, assumes that all underlying variables
segmentation solution.
are continuous (interval level data). Segmentation
inputs that are count, ordinal, or ranked variables are
TwoStep Cluster Analysis
not appropriate. Transformations of such attributes to a
Factor scores or individual attributes can serve as input
common metric must be accomplished before clustering.
into TwoStep cluster analysis. Additionally, TwoStep can
Another disadvantage to k-means is that the outcome
handle categorical variables, such as demographics
is affected by the order of the data records. Various
(e.g., gender, ethnicity) rated on a satisfaction scale.
ordering schemes can be explored to test the robustness
For the current analysis, the 33 individual attributes,
of the k-means solutions.
classified as categorical, were used as the segmentation
K-means also requires the analyst to specify the number variables.
of clusters desired. In some statistical packages,
To determine the number of clusters, the analyst can
the procedure provides limited statistics to guide the
specify the number or have the procedure select the
analyst in identifying the optimal number of clusters.
number of clusters, based on the Bayesian Information
For example, the FASTCLUS procedure in SAS® (SAS
Criterion (BIC) or Akaike Information Criterion (AIC).
Institute Inc., 2008) prints the approximate expected
There is also a provision for handling respondents who
overall R2 and the cubic-clustering criterion that can be
do not meet the criteria for inclusion in any cluster.
used to evaluate cluster solutions. Unfortunately, both
Copyright © 2016 Decision Analyst. All rights reserved. Decision Analyst: Comparison of Segmentation Approaches 5
Table 4: TwoStep Cluster—Average Factor Scores by Segment
Segments
Satisfied With Satisfied With
Ultra Dissatisfied Fitness But Not Environment Ultra
Dissatisfied With Fitness & With But Not With Satisfied With
With Life Health Environment Fitness Life
Percent of Respondents 10% 30% 28% 24% 8%
Fitness -0.466 -0.749 0.450 0.173 1.355
Home and Work Environment -0.733 -0.057 -0.265 0.465 0.727
Social Support -0.970 -0.024 -0.278 0.596 0.547
Diet -0.778 -0.171 -0.102 0.414 0.796
Health -0.747 -0.330 0.142 0.332 0.729
Note: The values in the table are standard normal scores (z-scores) that have a standard deviation of one and range from -1 to +1. A
higher factor score indicates higher levels of satisfaction with the items contained within the factor. Scores that are relatively high across
segments are highlighted in yellow. Scores that are relatively low across segments are highlighted in blue.
These “outlier” respondents are grouped together so that shown in Table 4. The five segments were assigned
they can be excluded from further profiling. the same names used in the k-means profile to aid
comparison.
The number of clusters produced by each procedure
was intended to be the same to facilitate comparisons The profile of the cluster produced by TwoStep was
among methods. Yet the automatic determination of similar to the profile of the clusters developed by
clusters was implemented in TwoStep to identify what k-means. For example, both profiles showed a segment
the “optimal” statistical solution might be, assuming no of respondents, Ultra Satisfied With Life, whose
outliers. The optimal number of clusters ranged from two members are happy with most aspects of life, and
to three, based on different orderings of the records in another segment, Ultra Dissatisfied With Life, whose
the data file. members are woefully depressed.
A five-cluster solution, in contrast, produced more As shown in Table 4, TwoStep also reveals segments of
interesting differentiation among the clusters. TwoStep respondents who are satisfied or dissatisfied on more
provides statistics (chi-square statistics for categorical than one factor. Respondents who are in the Satisfied
variables and t-statistics for continuous variables) that With Fitness But Not With Environment segment, for
quantify the relative contribution of each variable to the example, are satisfied with Fitness, but dissatisfied with
formation of a cluster. In the five-cluster solution, all Home and Work Environment and Social Support.
except five of the attributes were significant contributors. Members of the Ultra Dissatisfied With Life segment
Using this information, we omitted the five attributes (my are very unhappy with everything.
faith, my last vacation, my spouse [or significant
other or close friend], community I live in, and TwoStep Cluster Analysis Conclusions
vehicle I drive) and ran the analysis again to refine the TwoStep cluster analysis has advantages versus the
segmentation solution. The profile of the segments is methods previously discussed. One advantage deals
6 Decision Analyst: Comparison of Segmentation Approaches Copyright © 2016 Decision Analyst. All rights reserved.
with the range of cluster sizes. Factor segmentation used as a starting point for further consideration as the
and k-means tend to produce clusters that are very analyses proceed with additional clusters.
similar in size, as shown previously (ranging from 15%
Overall, TwoStep represents a mathematical
to 26%). TwoStep yielded clusters that had a larger size
improvement over factor segmentation and k-means with
range (8% to 30%). Having a segmentation solution that
handling of categorical variables and providing statistics
contains clusters of different sizes has more face validity.
to guide in determining the number of clusters.
For example, we could imagine that consumers who are
really happy with life and those who are very unhappy
Latent Class (LC) Cluster Analysis
with life comprise a smaller group than those who are
LC cluster analysis, as implemented by Latent GOLD®
more middle-of-the-road.
4.5 (Statistical Innovations Inc., 2008), allows the analyst
Another advantage is that TwoStep can use variables to select any number of segmentation inputs or indicators
that have differing scale types. Factor segmentation and covariates (such as demographics) for the model.
and k-means cannot treat variables as categorical; the The indicators are dependent variables that are used
variables must be considered continuous or transformed to define or measure the latent classes in an LC cluster
in some manner (i.e., standard score). In TwoStep, model. They are the primary drivers that determine the
though, categorical attributes can be specified as segmentation. The secondary drivers are the covariates,
such. This can encourage better separation among the which can be demographics or critical outcome
segments and easier interpretation of the results. variables, such as purchase intent for a new product.
Covariates can be treated as either active (allowed to
Yet there are disadvantages to the TwoStep method. influence the clustering) or inactive (serve as profiling
Like k-means clustering, TwoStep is influenced by the variables only) in the analysis.
order of the records in the data set. Sorting the data
Segment solutions for two different model structures
records in several ways can help the analyst understand
are reported. The first model used the 29 satisfaction
how the cluster profiles change with different orderings.
attributes as indicators, and (the four additional items
In addition, respondents with any missing values are overall physical health, overall emotional health, level of
excluded from the analysis altogether. This could stress, and overall quality of diet) as active covariates.
decrease the sample size available for segmentation if In the second model the 29 satisfaction attributes were
a large number of respondents skip or refuse to answer considered covariates, while the other four variables
critical segmentation questions. became nominal indicators. (Transformation of the
data is not needed in LC cluster analysis; the model
TwoStep gives some guidance as to the optimal
treats each variable according to its own type—nominal,
number of clusters via the BIC and AIC, whereas
ordinal, count, rank, and continuous.)
factor segmentation and k-means do not. However,
in this paper and in the experience of the authors, the Similar to TwoStep cluster, LC cluster analysis provides
automatic-clustering routine yields too few clusters and a set of cluster model selection tools, including the BIC.
is not usually useful. However, the AIC or BIC can be Statistically, the lower the BIC, the better the model
describes the data. The BIC value was still decreasing
Copyright © 2016 Decision Analyst. All rights reserved. Decision Analyst: Comparison of Segmentation Approaches 7
Table 5: LC Cluster Analysis Approach 1—Average Factor Scores by Segment
Segments
Satisfied with Satisfied With
Ultra Dissatisfied Fitness But Environment
Dissatisfied With Fitness & Not With But Not With Ultra Satisfied
With Life Health Environment Fitness With Life
Percent of Respondents 14% 23% 30% 21% 12%
Fitness -0.439 -0.882 0.502 -0.100 1.182
Home and Work Environment -0.783 0.084 -0.357 0.539 0.673
Social Support -0.909 0.105 -0.352 0.656 0.555
Diet -0.657 -0.173 -0.062 0.301 0.722
Health -0.594 -0.356 0.126 0.261 0.614
Note: The values in the table are standard normal scores (z-scores) that have a standard deviation of one and range from -1 to +1. A
higher factor score indicates higher levels of satisfaction with the items contained within the factor. Scores that are relatively high across
segments are highlighted in yellow. Scores that are relatively low across segments are highlighted in blue.
8 Decision Analyst: Comparison of Segmentation Approaches Copyright © 2016 Decision Analyst. All rights reserved.
for models that contained more than five clusters for on page 8), there is some overlap among segment
each of the three LC cluster models tested. Thus, membership (52% to 65%) between Latent Class
statistically, more than five clusters would be optimal Approach 1 and Approach 2. Yet classifying overall
for this data. To facilitate comparison with the other physical health, overall emotional health, level of
techniques reported in this paper, however, the five- stress, and overall quality of diet as indicators and
cluster model solution was selected for each of the LC classifying the satisfaction attributes as covariates
cluster models tested. (Approach 2) did yield segments with somewhat stronger
profiles than did Approach 1, especially in the Satisfied
LC Cluster Analysis—Approach 1 With Environment But Not With Fitness segment.
In this approach, overall physical health, overall
The Satisfied With Fitness But Not With Environment
emotional health, level of stress, and overall quality
segment is neither strongly satisfied nor dissatisfied in
of diet were used as active covariates in the model. The
any dimension. However, because these respondents
model’s covariates play a less important role (i.e., show
are moderately dissatisfied about their Social Support,
less differentiation among the segments) in the analysis
it indicates they could be on the verge of a downslide
than do the indicators (the 29 satisfaction attributes).
and might respond favorably to products/services that
Likewise the average scores for the factors in Table 5 on increase their emotional well-being. Satisfied With
page 8 are very similar to the factor scores shown for the Environment But Not With Fitness respondents have
k-means and TwoStep. the highest home and work satisfaction, yet they feel
their fitness level is lacking. These respondents might be
For the final variation on the LC cluster analysis, overall and products for weight loss that fit with their busy
Copyright © 2016 Decision Analyst. All rights reserved. Decision Analyst: Comparison of Segmentation Approaches 9
yields for each respondent the probability of belonging to Implications for Marketing and
each segment. Respondents are assigned to the cluster Research
to which they have the highest probability of belonging. Within the confines of our empirical test, each
Indeed, respondents could be assigned to more than one segmentation method yielded a different segmentation
cluster, based on their probabilities. solution. Indeed, within the same method, different
variable classifications and ordering of data records can
The ability to consider segmentation inputs as either produce dissimilar solutions. Consider that there are
indicators or covariates allows the analyst to uncover even more techniques available with which to segment
potentially useful segments that may not be identified and endless permutations of variables that can be
using other methods. For example, in LC Cluster included in the analysis. The options are overwhelming.
Analysis—Approach 2, somewhat stronger segments
were found by modeling several overarching outcome Taking a step back, though, it can be helpful to consider
variables as covariates and attitudes as indicators. how the segmentation solution will be used before
selecting a technique. The segmentation methods
LC cluster analysis provides model selection criteria, discussed in this paper can provide unique benefits given
as does TwoStep cluster analysis. Yet in our data, particular business objectives.
TwoStep’s automatic cluster selection feature found
two to three clusters as optimal for the data. LC cluster If the objective is marketing communications, factor
analysis found that more than five clusters were optimal, segmentation might be the approach to use. The
statistically. Relying on TwoStep’s automatic selection of analysis is simple to execute, and the results are fairly
clusters might lead the analyst to overlook key marketing straightforward. Respondents are assigned to the
segments. segment for which they have the highest factor score;
each segment is represented by one attitudinal or
LC cluster analysis, however, can take longer to run
behavioral theme. This makes targeting a particular
versus other approaches, especially with data sets that
consumer group easier. Consumers in the Diet segment
contain thousands of respondents. For large, complex
might be targeted with a message such as “Product X is
segmentation projects, the authors have experienced run
a healthful lunch choice,” while consumers in the Fitness
times of several hours using a high-speed computer. LC
segment might receive messages such as “Product X will
cluster analysis requires advanced knowledge of statistics
help you maintain optimal fitness.”
to help the analyst wade through the myriad of options
available. Because LC cluster analysis can handle so If the business objective is new product development,
many variables, it is tempting to add more segmentation it is vital to understand how consumers group together
inputs than are really necessary. The analyst must guard according to needs. The cluster analyses, k-means,
against the urge to place “everything but the kitchen sink” TwoStep, and latent class best accomplish grouping
into the model. Undue complexity makes interpreting the respondents according to their patterns of needs.
segmentation solution more difficult. The resulting segments are based on multiple needs,
attitudes, and behaviors. Segments defined by various
10 Decision Analyst: Comparison of Segmentation Approaches Copyright © 2016 Decision Analyst. All rights reserved.
need states allow product developers to create new
products or line extensions that can meet core needs of
consumers within a particular segment. Product Y might
be developed, for instance, to address several need
states among consumers in the Sort of Dissatisfied
With Life segment—improve health and fitness,
successfully follow a diet, and decrease weight.
Examine the data. because previous research yielded groups that had weak
Are there many different types of scales represented relationships with key measures, such as purchase intent
in your segmentation inputs? Select the method that for new or existing products and messaging components
best accounts for the differences in variable types. Do for promotion strategy. At the initial planning stage, it
you have long attribute lists? Try factoring or other data is vital to understand which key metrics are important
reduction methods to decrease the number of variables to the client and craft an analysis plan to include these
that enter into the segmentation. There are countless metrics. In LC cluster analysis, for instance, attitudinal
ways in which variables can be combined and factored. and behavioral variables can be selected as cluster
model indicators, and new product purchase intent and
Know the techniques. demographics can be covariates in the model. Modeling
We discussed four methods in this paper. There are the data in this way can increase the likelihood that
others as well, such as discriminant analysis, principal certain attitudes and behaviors are “linked” to different
components analysis, and so forth. Review the strengths levels of purchase intent. Such results can help the
and weaknesses of each technique and understand the client company determine which segments to target first
software to which you have access. (groups that are likely to purchase the product) and how
Copyright © 2016 Decision Analyst. All rights reserved. Decision Analyst: Comparison of Segmentation Approaches 11
segment members to communicate with them, then the References
segments are not useful. Segmentation solutions that T. Chiu, D. Fang, J. Chen, Y. Wang, and C. Jeris.
accomplish these objectives should be favored over (2001). A Robust and Scalable Clustering Algorithm for
other solutions. Mixed Type Attributes in Large Database Environment.
Proceedings of the seventh ACM SIGKDD international
Although there can be a great deal of sophistication in
conference on knowledge discovery and data mining,
the analysis stage, segmentation is not a purely scientific
San Francisco, CA: ACM, PP. 263–268.
pursuit. Sadly, there are no magic buttons to press to
generate the “best” segments. Given that the data have SAS Institute Inc. (2008). SAS/STAT® 9.2 User’s Guide.
been modeled with the most appropriate technique(s) Cary, NC: SAS Institute, Inc.
available and that the basics are addressed, category
SPSS Inc. (2001). The SPSS TwoStep Cluster
experience and expert judgment are the final guides to
Component: A Scalable Component Enabling More
the selection of the “best” segmentation solution.
Efficient Customer Segmentation. Technical report,
The dataset used consisted of 4,156 respondents from Statistical Innovations, Inc. (2008). Latent GOLD® 4.5.
the Decision Analyst’s Health and Nutrition Strategist™ Belmont, MA.
research study. The data were collected online in 2006
using a nationally representative sample of adults T. Zhang, R. Ramakrishnan, and M. Livny (1996).
in the U.S. recruited from the American Consumer BIRCH: An Efficient Data Clustering Method for Very
Opinion® panel. The Health and Nutrition Strategist™ Large Databases. Proceedings of the ACM SIGMOD
is a massive, integrated knowledge base of food and Conference on Management of Data, Montreal, Canada:
Decision Analyst is a leading international marketing research and analytical consulting firm. The company
specializes in advertising testing, strategy research, new product ideation, new product research, and
advanced modeling for marketing-decision optimization.
12 Decision Analyst: Comparison of Segmentation Approaches Copyright © 2016 Decision Analyst. All rights reserved.