0% found this document useful (0 votes)
84 views5 pages

Cluster Analysis Detail Steps

This document provides steps for performing cluster analysis, including: (1) formulating the problem by identifying dependent and independent variables, (2) deciding which variables to use as the basis for clustering, (3) selecting a distance measure such as Euclidean distance, (4) choosing a clustering procedure like hierarchical or k-means, and (5) deciding on the optimal number of clusters. Key factors discussed include avoiding autocorrelation among variables, standardizing data on different scales, interpreting results, and evaluating solutions using metrics like iterations, agglomeration coefficients, and ANOVA. The document uses examples from burger and NFL datasets to illustrate the steps and interpretation of cluster analysis outputs.

Uploaded by

Tram Anh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views5 pages

Cluster Analysis Detail Steps

This document provides steps for performing cluster analysis, including: (1) formulating the problem by identifying dependent and independent variables, (2) deciding which variables to use as the basis for clustering, (3) selecting a distance measure such as Euclidean distance, (4) choosing a clustering procedure like hierarchical or k-means, and (5) deciding on the optimal number of clusters. Key factors discussed include avoiding autocorrelation among variables, standardizing data on different scales, interpreting results, and evaluating solutions using metrics like iterations, agglomeration coefficients, and ANOVA. The document uses examples from burger and NFL datasets to illustrate the steps and interpretation of cluster analysis outputs.

Uploaded by

Tram Anh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

MKTG3850 Cluster Analysis Page 1

Steps to cluster analysis.

Formulate the Problem

You need a DV and IV thought process where the DV are the groups and the IV are the
reasons why observations or rows belong to one group and not the other group(s).

In the Burger dataset, the groups would separate on calories for two reasons. One, there
is a range of values from less than 200 calories to more than a thousand calories. Two,
calories reflects the content of the item. As I add more sodium, protein, carbohydrates,
then the caloric count increases.

In addition to calories, I would consider what items would explain membership in a


cluster. Wheat flour wrap compared to almond flour wrap would not explain much if
anything. Bun compared to bunless could explain something. On the other hand, from a
nutrition view, carbohydrates would serve as a better variable instead of bun compared
to bunless. From an analytics view, carbohydrates measured in grams serves as a better
measure instead of bun compared to bunless because carbohydrates is measured as a
ratio variable and bun-bunless is measured as a nominal variable.

In the NFL dataset, I would look for groups based on wins because we have been using
wins throughout the semester. Separately, we could perform cluster analysis around (a)
offensive variables, (b) defensive variables, or (c) special teams variables. For our
purposes, though, we have been considering wins.

Decide Which Variables to Use as Bases for Clustering

Select variables for inclusion. This step represents the most critical of all the steps. The
inclusion and exclusion of variables drives your cluster effort. Too often we rely on
intuition and data availability to drive this decision.

There are several approaches to this step. One, review the managerial question to
understand the objective that you (as the analyst) are trying to accomplish through
cluster analysis. This objective should serve as your guide throughout the analysis.

Two, perform a correlation matrix. As you add more variables or columns to the
analysis, autocorrelation becomes a problem. In the Burger dataset, in reviewing
correlation coefficients for the three measures of fat, I observe coefficients high enough
(greater than .8) to consider removing at least two from the analysis (see Correlation
tab).

In reviewing the correlation coefficients involving sugar, I see that two variables, sugar
and protein, appear as a weak relationship instead of a strong relationship or
autocorrelated.
MKTG3850 Cluster Analysis Page 2

Three, perform a factor analysis. Items that load on to one factor higher than .8 or lower
than .6 could be considered for removal because of either autocorrelation or lack of an
explanatory power.

Select Distance Measure

Euclidian distance is the most common. Squared Euclidian distance amplifies the
dissimilarity between groups. City block or block appears in textbook because it is
easier to calculate with less robust packages.

We need to standardize our data because distance serves as our measure. Standardizing
the data allows us to include variables measured on different scales or magnitude. A
variable measured in millions such as household income for a market and percent of
households with at least one person holding a college degree reflect different
magnitudes. To include them in an analysis such as cluster analysis, which is based on
distance, I should standardize both variables.

Remember, when standardizing variables, you (as the analyst) lose or forfeit the ability
to make interpretations or inferences about that variables. To make interpretations or
inferences about that variables, you (as the analyst) should look at the unstandardized
values AFTER a cluster solution has been reached.

Select Clustering Procedure

Two types exist of clustering procedures exist, including: (1) hierarchical and (2)
nonhierarchical. Hierarchical is good for smaller datasets or when the number of
groups is unknown. hierarchical clustering generates the dendogram that looks good in
presentations.

Several techniques exist to generate hierarchical clusters. Ward’s is focused on variance.


You (as the analyst) should know the relationship between variance and distance.
Ward’s typically generates clusters with equal membership. Centroid is good for
smaller samples. You (as the analyst) should develop a solution one method such as
Ward’s and verify in another method such as Centroid. Nearest neighbor and farthest
neighbor possess a similar relationship as Ward’s and Centroid.

Nonhierarchical has become known as k-means. The k refers to the number of groups.
Unlike hierarchical cluster, you (as the analyst) must specify the number of groups at
the outset. Nonhierarchical will not generate a dendogram. It performs better compared
to hierarchical when dealing with large datasets.
MKTG3850 Cluster Analysis Page 3

In this approach, the package will pick a random observation to serve as the initial or
seed value and then group observations that minimizes distance to that initial or seed
value. The package then repeats until a solution converges.

In terms of groups, typically three to eight groups serve as a good rule of thumb. In the
NFL dataset, we are probably looking at three to five groups if we think about wins as a
variable to separate groups. In the Dunnhumby dataset, we are probably considering no
more than nine but that would be seemingly unusual. In the Burger dataset, we are
most likely looking at three to five groups.

Two groups usually generate a high group and a low group and, as a consequence,
probably not interesting. A two-group cluster solution could work in terms of
subsetting a dataset so that we have a cluster of observations that are A and another
cluster of observations that are B. A two-group solution should lead to more analysis on
each group.

More than eight groups would be difficult to justify and require additional analysis for
support.

To verify a nonhierarchical cluster, running an ANOVA on the dependent variable


should provide sufficient support. Alternatively, a hierarchal approach should verify
the number of clusters established in the nonhierarchal approach.

Decide on Number of Clusters

In hierarchal clustering, as the amount of distance increases, then you are losing
information. Distance is your guide in these procedures. The more distance needed, the
more heterogeneous the membership within the groups becomes because you are losing
information related to the uniqueness of the group.

The dendogram will display distance in relation to number of groups formed.


Typically, the most distance shown occurs when combining two groups into one large
group.

Some packages including Enginius (nee Marketing Engineering) will display a stress
test based on the amount of distance (y-axis) in relation to the number of groups (x-
axis). In these graphs, you (as the analyst) are looking for an inflection point where the
amount of amount distance as shown by the y-axis flattens or plateau along the a-axis.

In other packages including SPSS, such a line graph can be created by charting the
coefficients from the Agglomeration Schedule. Alternatively, you (as the analyst) can
subtract the distance from each stage where two observation or observations are
grouped.
MKTG3850 Cluster Analysis Page 4

In the Burger dataset, I include the standardized scores of calories, total fat, sodium,
cholesterol, carbohydrates, fiber, and sugars. I then created a hierarchal cluster method
using Centroid and squared Euclidian distance.

In looking at the agglomeration schedule (see Agg Schedule tab), the coeffiecent to form
the first group (observation 129 and observation 142) is .004 as shown in the coefficient
column. The next group (observation 149 and observation 150) is then formed and the
coefficient is .013. The amount of distance from the first group to the second group is .
009 (.013 - .004). That’s not a lot distance.

Scroll to the end of the schedule. The distance to form one group by combining the
remaining two group is 4.634 (52.73 – 48.096). That appears as a lot of distance
compared to the amount of distance to form the first two groups. To form two groups
by combing the remaining three groups is 20.675 (48.096 – 27.421). Looking at the
difference, I am looking at either a six-group solution, four-group, or a three-group
solution. Remember, I want to work with as few groups as possible while retaining as
much information as possible.

In looking at the cluster membership (see Cluster Membership tab), cluster 2 from a
seven-cluster solution merges with cluster 1. In looking at the table, it would be seem
reasonable for those two types of food items to group.

Alternatively, you (as the analyst) could work with a nonhierarchal approach. I prefer
that you (as the student) take this approach for this course so that you gain experience
with it. Nonhierarchal has gained popularity among digital analysts because
nonhierarchal performs better with large datasets.

Using the Burger dataset, there are several pieces of the output that I want to look at it.
First, I want to consider the Iteration History. Some researchers will set the iteration
value at 10, others at 20, and some at 50. In the Iteration History, you (as the analyst)
want to know how many iterations it took for the change in the cluster centers to reach
zero (0). The fewer the iterations, the more support for that cluster solution.

At two clusters, it takes 10 iterations for the values to reach zero in both groups (see 2
Cluster tab). At three clusters, the solution does not converge at the tenth iteration (see 3
Cluster tab). At six clusters, the solution converges at the fourth iteration (see 6 Cluster
tab).

Second, the Final Cluster Centers will provide the centroid value for each variable.
Looking at each variable, you (as the analyst) can get a sense of what each cluster is
high, medium, and low in for each variable. A comparison to each cluster based on each
variable remains possible because the you (as the analyst) relied on the standardized
value.
MKTG3850 Cluster Analysis Page 5

Graphing the values provided in the Final Cluster Center with a bar chart will provide a
visual display to make interpretation easier. In the two-cluster solution, I see that I have
a high group and a low group.

Third, you (as the analyst) can determine the heterogeneity or dissimilarity between
groups using the values provided in the Distance between Final Cluster Centers. In the
two-cluster solution, the distance between clusters appears as 3.37, which provides
enough support that heterogeneity has been achieved.

At four clusters, the clusters appear sufficiently heterogenous or dissimilar (see 4


Cluster tab). At six clusters, cluster 2 and cluster 6 could be homogenous or similar
based on calories.

To resolve this issue, you (as the analyst) should conduct an ANOVA with the cluster
membership serving as the independent variable and the dependent variable should be
standardize calories. The post-hoc analysis must be included. In the post-hoc analysis,
clusters 1 and 2 do not appear different than other clusters (see 6 Cluster ANOVA).
Therefore, a six-cluster solution should no longer be considered. In the four-cluster and
five-cluster solutions, support exists for both solutions.

Finally, back in the cluster analysis output, the ANOVA table provides two needed
elements. One, if a variable lacks significance then the variable should be removed from
the cluster analysis. Two, based on the F-ratio values, a comparison between variables
can be determined. In the two-clusters solution, calories (F = 240.967) provide more
weight compared to sugars (F = 22.67), fiber (F = 28.134), and carbohydrates(F = 53.611).
That is, the package relies more on calories to form groups than other variables (see 2
Cluster tab).

Interpret

You (as the analyst) need to decide on the number of clusters and support that decision.
Then, you (as the analyst) should develop the mean values of each unstandardized
variable for each cluster level, and finish by naming each cluster group. Finally, provide
a recommendation.

In the Burger dataset, I arrived at a five-cluster solution because clusters 2 through 4


appear unique enough that a managerial recommendation would exist for each group
(see 5 Cluster Means tab). At a four-cluster solution, which is also defensible, too much
information would be lost.

A fast-casual outlet could introduce a sandwich with (a) flavored cheeses, sauce, and
three beef hamburger patties (Big and Beefy), (b) cheese, sauce, and bacon with either
fried chicken or a lot of turkey (Sweet and Salty), or (c) unbreaded chicken breast or
turkey with no cheese or sauce (Flat Fiber).

You might also like