0% found this document useful (0 votes)
68 views10 pages

K-Means Clustering Using RapidMiner

Uploaded by

chamarilk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views10 pages

K-Means Clustering Using RapidMiner

Uploaded by

chamarilk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

K-means Clustering

USING RAPIDMINER

Sanchit Kumar | Data Warehousing and Data Mining | April 27, 2016
Problem Statement

Sonia is a program director for a major health insurance provider. Recently she has
been reading in medical journals and other articles, and found a strong emphasis
on the influence of weight, gender and cholesterol on the development of coronary
heart disease. The research she’s read confirms time after time that there is a
connection between these three variables, and while there is little that can be done
about one’s gender, there are certainly life choices that can be made to alter one’s
cholesterol and weight. She begins brainstorming ideas for her company to offer
weight and cholesterol management programs to individuals who receive health
insurance through her employer. As she considers where her efforts might be most
effective, she finds herself wondering if there are natural groups of individuals who
are most at risk for high weight and high cholesterol, and if there are such groups,
where the natural dividing lines between the groups occur.

PAGE 1
Algorithm Used
K-MEANS ALGORITHM

Formally, given a data set, D, of n objects, and k, the number of clusters to form,
the k-means algorithm organizes the objects into k partitions, where each
partition represents a cluster. The clusters are formed to optimize an objective
partitioning criterion, such as a dissimilarity function based on distance, so that
the objects within a cluster are “similar” to one another and “dissimilar” to objects
in other clusters in terms of the data set attributes.

PAGE 2
Data Set Used

Using the insurance company’s claims database, Sonia extracts three attributes for
547 randomly selected individuals. The three attributes are the insured’s weight in
pounds as recorded on the person’s most recent medical examination, their last
cholesterol level determined by blood work in their doctor’s lab, and their gender.
As is typical in many data sets, the gender attribute uses 0 to indicate Female and 1
to indicate Male. We will use this sample data from Sonia’s employer’s database to
build a cluster model to help Sonia understand how her company’s clients, the
health insurance policy holders, appear to group together on the basis of their
weights, genders and cholesterol levels.

A data set has been prepared for this example, and is available as
Chapter06DataSet.csv on the book’s (Data Mining for the Masses) companion web
site.

PAGE 3
Applications of the Algorithm

K-means clustering is very flexible in its ability to group observations together. For
this example, it does not necessarily predict which insurance policy holders will or
will not develop heart disease. It simply takes known indicators from the attributes
in a data set, and groups them together based on those attributes’ similarity to
group averages. Because any attributes that can be quantified can also have means
calculated, k-means clustering provides an effective way of grouping observations
together based on what is typical or normal for that group. It also helps us
understand where one group begins and the other ends, or in other words, where
the natural breaks occur between groups in a data set.

The k-Means operator in RapidMiner allows data miners to set the number of
clusters they wish to generate, to dictate the number of sample means used to
determine the clusters, and to use a number of different algorithms to evaluate
means. While fairly simple in its set-up and definition, k-Means clustering is a
powerful method for finding natural groups of observations in a data set.

PAGE 4
Screenshots

Fig 1. Process View

Fig 2. Cluster Model

PAGE 5
Fig 3. Centroid Table

Fig 4. Folder View of Cluster 3

PAGE 6
Fig. 5. Filtered View of the data belonging to Cluster 3

PAGE 7
Evaluation

Sonia’s major objective in the hypothetical scenario posed at the beginning of the
chapter was to try to find natural breaks between different types of heart disease
risk groups. Using the k-Means operator in RapidMiner, we have identified four
clusters for Sonia, and we can now evaluate their usefulness in addressing Sonia’s
question.

We see in the screenshots that cluster 3 has the highest average weight and
cholesterol. With 0 representing Female and 1 representing Male, a mean of 0.591
indicates that we have more men than women represented in this cluster.
Knowing that high cholesterol and weight are two key indicators of heart disease
risk that policy holders can do something about, Sonia would likely want to start
with the members of cluster 3 when promoting her new programs. She could then
extend her programming to include the people in clusters 2 and 0, which have the
next incrementally lower means for these two key risk factor attributes.

PAGE 8
References

1. Book: Data Mining for the Masses - Dr. Matthew A North


2. Book: Data Mining: Concepts and Techniques - Han, Kamber, Pei
3. Dataset: Data Mining for the Masses - Site
4. Video: K-Means Clustering in RapidMiner - YouTube

PAGE 9

You might also like