0% found this document useful (0 votes)
95 views

Clustering: Analisis Big Data - Pertemuan 6

The document discusses clustering analysis, an unsupervised machine learning technique for grouping unlabeled data points into clusters based on similarities. It covers concepts of clustering including k-means clustering, which groups data points into k clusters defined by k centroids, and iterates until cluster membership stabilizes. Applications of clustering discussed include customer segmentation, image processing, and medical research.

Uploaded by

Nadya Novita
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views

Clustering: Analisis Big Data - Pertemuan 6

The document discusses clustering analysis, an unsupervised machine learning technique for grouping unlabeled data points into clusters based on similarities. It covers concepts of clustering including k-means clustering, which groups data points into k clusters defined by k centroids, and iterates until cluster membership stabilizes. Applications of clustering discussed include customer segmentation, image processing, and medical research.

Uploaded by

Nadya Novita
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

CLUSTERING

ANALISIS BIG DATA - PERTEMUAN 6

ANNE MUDYA YOL ANDA, S. STAT. , M. SI .


CLUSTERING
CAPAIAN PEMBELAJARAN MATA KULIAH POKOK BAHASAN

Mampu menjelaskan dan Konsep Clustering


memahami Teori dan Metode Algoritma / Metodologi pada
Analitik Data Tingkat Lanjut clustering
(Clustering) dan mengerjakan studi Studi Kasus Clustering
kasusnya Penggunaan software untuk
analisis clustering

02 YOLANDAMUDYANNEE
CLUSTERING
is the use of unsupervised techniques for grouping similar
objects. In machine learning, unsupervised refers to the problem
of finding hidden structures within unlabeled data.
Clustering techniques are unsupervised in the sense that the
data scientist does not determine, in advance, the labels to
apply to the clusters.
The structure of the data describes the objects of interest and
determines how best to group the objects.

03 Yael Amari | VA Presentation 2020


For example,
based on
customers’ Earn less Earn Earn
personal income, than between $100,000 or
it is
$10,000 $10,000 and more
straightforward
to divide the $99,999
customers into
three groups
depending on
arbitrarily
selected values.

04 Yael Amari | VA Presentation 2020


In this case, the income levels were chosen somewhat
subjectively based on easy-tocommunicate points of
delineation. However, such groupings do not indicate a
natural affinity of the customers within each group

In other words, there is no As additional dimensions are For instance, suppose


inherent reason to believe that introduced by adding more variables such as age, years
the customer making $90,000 variables about the of education, household size,
will behave any differently customers, the task of finding and annual purchase
than the customer making meaningful groupings expenditures were considered
$110,000. becomes more complex. along with the personal
income variable.
What are the natural occurring groupings of customers?
This is the type of question that clustering analysis can
help answer.
05
Clustering techniques are utilized in
marketing, economics, and various
branches of science.

Clustering is a In clustering, there clustering find the


method often used are no predictions similarities between
for exploratory made objects according to
analysis of the data. the object attributes
and group the similar
objects into clusters
clustering

Given a collection of objects each with n


measurable attributes, k-means is an
analytical technique that, for a chosen value
of k, identifies k clusters of objects based on
the objects’ proximity to the center of the k
groups. The center is determined as the
arithmetic average (mean) of each cluster’s
n-dimensional vector of attributes

03 Yael Amari | VA Presentation 2020


This Figure illustrates
three clusters of objects
with two attributes. Each
object in the dataset is
represented by a small
dot color-coded to the
closest large dot, the
mean of the cluster.

10
Clustering is often used as a lead-in to
classification. Once the clusters are
identified, labels can be applied to each
cluster to classify each group based on
its characteristics.

Clustering is primarily an exploratory


technique to discover hidden structures
of the data, possibly as a prelude to
more focused analysis or decision
processes.

Some specific applications of k-means


are image processing, medical, and
customer segmentation
Image Processing
Video is one example of the growing
volumes of unstructured data being
collected.
Within each frame of a video, k-means
analysis can be used to identify objects in
the video.
For each frame, the task is to determine
which pixels are most similar to each other.
The attributes of each pixel can include
brightness, color, and location, the x and y
coordinates in the frame.
With security video images, for example,
successive frames are examined to identify
any changes to the clusters.
These newly identified clusters may indicate
unauthorized access to a facility

Yael Amari | VA Presentation 2020


Medical
Patient attributes such as age, height,
weight, systolic and diastolic blood
pressures, cholesterol level, and other
attributes can identify naturally
occurring clusters.
These clusters could be used to target
individuals for specific preventive
measures or clinical trial participation.
Clustering, in general, is useful in
biology for the classification of plants
and animals as well as in the field of
human genetics.

Yael Amari | VA Presentation 2020


Customer Marketing and sales groups use k-means

Segmentation to better identify customers who have


similar behaviors and spending patterns.
For example, a wireless provider may
look at the following customer attributes:
monthly bill, number of text messages,
data volume consumed, minutes used
during various daily periods, and years
as a customer.
The wireless company could then look at
the naturally occurring clusters and
consider tactics to increase sales or
reduce the customer churn rate, the
proportion of customers who end their
relationship with a particular company.
The k-means algorithm to
find k clusters can be
described in the following
four steps.
Choose the value of k Compute the distance from each Compute the centroid, Repeat Steps 2 and 3
and the k initial data point (xi, yi) to each the center of mass, of until the algorithm
guesses for the centroid. Assign each point to each newly defined converges to an
centroids. the closest centroid. This cluster from Step 2 answer
association defines the first k
In this example, k = 3, and the the computed centroids in Step Assign each point to the
initial centroids are indicated by clusters. 3 are the lightly shaded points closest centroid computed in
the points shaded in red, green, of the corresponding color Step 3.
and blue In two dimensions, the distance, d, between Compute the centroid of
any two points, and , in the Cartesian plane is newly defined clusters.
typically expressed by using the Euclidean Repeat until the algorithm
distance measure reaches the final answer
Convergence
Convergence is reached when the
computed centroids do not
change or the centroids and the
assigned points oscillate back and
forth from one iteration to the
next. The latter case can occur
when there are one or more points
that are equal distances from the
computed centroid.
ILUSTRASI SEDERHANA
Determi The value of k can be chosen based on a reasonable
guess or some predefined requirement.

ning the However, even then, it would be good to know how


much better or worse having k clusters versus k – 1

Number or k + 1 clusters would be in explaining the structure


of the data.
of Next, a heuristic using the Within Sum of Squares
(WSS) metric is examined to determine a reasonably
Clusters optimal value of k
WSS is the sum of the squares of the distances between each data point and the
closest centroid. The term indicates the closest centroid that is associated with
the ith point. If the points are relatively close to their respective centroids, the
WSS is relatively small. Thus, if k + 1 clusters do not greatly reduce the value of
WSS from the case with only k clusters, there may be little benefit to adding
another cluster.
Using R to
Perform a
K-means The task is to group 620 high
school seniors based on their
Analysis grades in three subject areas:
English, mathematics, and
science. The grades are
averaged over their high school
career and assume values from
0 to 100

09
The following R code establishes the necessary R libraries and imports the CSV
file containing the grades

library(plyr)
library(ggplot2)
library(cluster)
library(lattice) library(graphics)
library(grid) library(gridExtra) #import the
student grades grade_input =
as.data.frame(read.csv(“c:/data/grades_k
m_input.csv”))
The following R code formats the grades for processing. The data file contains four
columns. The first column holds a student identification (ID) number, and the other
three columns are for the grades in the three subject areas. Because the student ID is
not used in the clustering analysis, it is excluded from the k-means input matrix,
kmdata.

kmdata_orig = as.matrix(grade_input[,c(“Student”,“English”, “Math”,“Science”)]) kmdata <-


kmdata_orig[,2:4] kmdata[1:10,]
To determine an appropriate value for k, the k-means algorithm is
used to identify clusters for k = 1, 2, …, 15. For each value of k, the
WSS is calculated. If an additional cluster provides a better
partitioning of the data points, the WSS should be markedly smaller
than without the additional cluster.

The following R code loops through several k-means analyses for the
number of centroids, k, varying from 1 to 15. For each k, the option
nstart=25 specifies that the k-means algorithm will be repeated 25
times, each starting with k random initial centroids. The
corresponding value of WSS for each k-mean analysis is stored in the
wss vector.
wss <- numeric(15)
for (k in 1:15) wss[k] <- sum(kmeans(kmdata, centers=k,
nstart=25)$withinss)
Using the basic R plot function, each WSS is plotted against the
respective number of centroids, 1 through 15.
plot(1:15, wss, type=“b”, xlab=“Number of Clusters”, ylab=“Within Sum of
Squares”)
As can be seen, the WSS is greatly reduced when k increases from one to
two. Another substantial reduction in WSS occurs at k = 3. However, the
improvement in WSS is fairly linear for k > 3. Therefore, the k-means analysis
will be conducted for k = 3. The process of identifying the appropriate value
of k is referred to as finding the “elbow” of the WSS curve.
two alternative
based on data or based on
researcher/data analyst

Number of Preferable approach: “let the


clusters data speak”
based on statistics
number of clusters based on data analyst

A retailer wants to The budget only allows A partition into three


identify several him to open three types clusters follows
shopping profiles in of outlets naturally, although it is
order to activate new not necessarily the
and targeted retail optimal one.
outlets
number of clusters based on data

Clustering of shopping For market Statistical tests are not always


univocal, leaving some room to the
profiles is expected to segmentation researcher’s experience and
arbitrariness Statistical rigidities
detect a new market purposes, it is less should be balanced with the
niche. advisable to constrain knowledge gained from and
interpretability of the final
the analysis to a fixed classification.
number of clusters
Graphical
scree diagram > lihat loncatan
Determining the jarak penggabungan yang
paling besar
optimal number
of cluster from Statistical
• Within sum of squares
hierarchical • Silhouette coefficient
• Pseudo F
methods

07 Yael Amari | VA Presentation 2020


The displayed contents of the
variable km include the following:
The location of the cluster means
A clustering vector that defines
the membership of each student
to a corresponding
cluster 1, 2, or 3
The WSS of each cluster
A list of all the available k-means
components
The large circles represent the
location of the cluster means
provided earlier in the display of
the km contents. The small dots
represent the students
corresponding to the
appropriate cluster by assigned
color: red, blue, or green. In
general, the plots indicate the
three clusters of students: the
top academic students (red),
the academically challenged
students (green), and the other
students (blue) who fall
somewhere between those two
groups. The plots also highlight
which students may excel in one
or two subject areas but
struggle in other areas
Assigning labels to the identified clusters is
useful to communicate the results of an
analysis. In a marketing context, it is common to
label a group of customers as frequent
shoppers or big spenders. Such designations
are especially useful when communicating the
clustering results to business users or
executives. It is better to describe the
marketing plan for big spenders rather than
Cluster #1.

03 Yael Amari | VA Presentation 2020


The heuristic using WSS
can provide at least several Are the clusters well
possible k values to separated from each other?
consider. When the number
of attributes is relatively
small, a common approach Do any of the clusters have
to further refine the choice
only a few points?
of k is to plot the data to
determine how distinct the
identified clusters are from
Do any of the centroids
each other. In general, the
following questions should appear to be too close to
be considered each other?
In the first case, ideally the plot would look like the one shown in
Figure 4.7, when n = 2. The clusters are well defined, with
considerable space between the four identified clusters. However,
in other cases, such as Figure 4.8, the clusters may be close to each
other, and the distinction may not be so obvious.
In such cases, it is important to apply some judgment on whether
anything different will result by using more clusters. For example,
Figure 4.9 uses six clusters to describe the same dataset as used in
Figure 4.8. If using more clusters does not better distinguish the
groups, it is almost certainly better to go with fewer clusters.
Reasons to Choose What object attributes
and Cautions should be included in the
analysis?
K-means is a simple and straightforward method
What unit of measure (for
for defining clusters. Once clusters and their
example, miles or
associated centroids are identified, it is easy to
kilometers) should be used
assign new objects (for example, new
for each attribute?
customers) to a cluster based on the object’s Do the attributes need to be
distance from the closest centroid. Because the rescaled so that one
method is unsupervised, using k-means helps to attribute does not have a
eliminate subjectivity from the analysis. disproportionate effect on
the results?
Although k-means is considered an
unsupervised method, there are still several What other
decisions that the practitioner must make: considerations might
apply?
Object Attributes
Regarding which object attributes (for example, age and income) to use in the
analysis, it is important to understand what attributes will be known at the time a new
object will be assigned to a cluster. For example, information on existing customers’
satisfaction or purchase frequency may be available, but such information may not be
available for potential customers.
The Data Scientist may have a choice of a dozen or more attributes to use in the
clustering analysis. Whenever possible and based on the data, it is best to reduce the
number of attributes to the extent possible. Too many attributes can minimize the
impact of the most important variables. Also, the use of several similar attributes can
place too much importance on one type of attribute. For example, if five attributes
related to personal wealth are included in clustering analysis, the wealth attributes
dominate the analysis and possibly mask the importance of other attributes, such as
age.

When dealing with the problem of too many attributes, one useful approach is to
identify any highly correlated attributes and use only one or two of the correlated
attributes in the clustering analysis. A scatterplot matrix is a useful tool to visualize the
pair-wise relationships between the attributes.
Scatterplot matrix for seven attributes
The strongest relationship is observed to be
between Attribute3 and Attribute7. If the value of
one of these two attributes is known, it appears
that the value of the other attribute is known with
near certainty. Other linear relationships are also
identified in the plot. For example, consider the
plot of Attribute2 against Attribute3. If the value of
Attribute2 is known, there is still a wide range of
possible values for Attribute3. Thus, greater
consideration must be given prior to dropping one
of these attributes from the clustering analysis.

Another option to reduce the number of attributes


is to combine several attributes into one measure.
For example, instead of using two attribute
variables, one for Debt and one for Assets, a Debt
to Asset ratio could be used. This option also
addresses the problem when the magnitude of an
attribute is not of real interest, but the relative
magnitude is a more important measure
Units of Measure
From a computational perspective, the k-means algorithm is somewhat indifferent to the units of
measure for a given attribute (for example, meters or centimeters for a patient’s height). However, the
algorithm will identify different clusters depending on the choice of the units of measure. For example,
suppose that k-means is used to cluster patients based on age in years and height in centimeter, then
rescaled from centimeters to meters by dividing by 100, the resulting clusters would be slightly different,
Rescaling
Attributes that are expressed in dollars are common in clustering analyses and can differ in magnitude
from the other attributes. For example, if personal income is expressed in dollars and age is expressed in
years, the income attribute, often exceeding $10,000, can easily dominate the distance calculation with
ages typically less than 100 years. Although some adjustments could be made by expressing the income
in thousands of dollars (for example, 10 for $10,000), a more straightforward method is to divide each
attribute by the attribute’s standard deviation. The resulting attributes will each have a standard
deviation equal to 1 and will be without units. Returning to the age and height example, the standard
deviations are 23.1 years and 36.4 cm, respectively.
Rescaling
With the rescaled attributes for age and height, the borders of the resulting
clusters now fall somewhere between the two earlier clustering analyses. Such an
occurrence is not surprising based on the magnitudes of the attributes of the
previous clustering attempts. Some practitioners also subtract the means of the
attributes to center the attributes around zero. However, this step is unnecessary
because the distance formula is only sensitive to the scale of the attribute, not its
location.

In many statistical analyses, it is common to transform typically skewed data, such


as income, with long tails by taking the logarithm of the data. Such transformation
can also be applied in k-means, but the Data Scientist needs to be aware of what
effect this transformation will have
Additional Considerations
The k-means algorithm is sensitive to the starting positions of the initial centroid.
Thus, it is important to rerun the k-means analysis several times for a particular
value of k to ensure the cluster results provide the overall minimum WSS. As seen
earlier, this task is accomplished in R by using the nstart option in the kmeans()
function call.

Euclidean distance
Cosine similarity
Manhattan distance
Silhouette Coefficient
k means : R
setwd ("D:/ANNE/Kuliah S2 --- Pemodelan Klasifikasi/data")
data <- read.csv("ilustrasikm.csv")
cluster <- kmeans(data, 3) plot(data[,1], data[,2], col=cluster$cluster)
points(cluster$centers, pch=9)
Silhouette Coefficient
Silhouette Coefficient
ilustrasi <- read.csv("D:/anne/ClusterAnalysis/ilustrasi2a.csv", header=T, sep=";")
head(ilustrasi)
hasilgerombol <- kmeans(ilustrasi, centers=3, iter.max =10)
hasilgerombol$cluster
hasilgerombol$tot.withinss
wssplot <- function(data, nc=15, seed=1234){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")}
wssplot(ilustrasi, nc=10)
Silhouette Coefficient
library("cluster")
jarak <- as.matrix(dist(ilustrasi))
hasilgerombol <- kmeans(ilustrasi, centers=3, iter.max =10)
sil.3 <- mean(silhouette(hasilgerombol$cluster,dmatrix=jarak)[,3])

hasilgerombol <- kmeans(ilustrasi, centers=4, iter.max =10)


sil.4 <- mean(silhouette(hasilgerombol$cluster,dmatrix=jarak)[,3])
c(sil.3, sil.4)
Additional
Algorithms
k-means does not handle categorical data.

k-modes is a commonly used method for clustering


categorical data based on the number of differences
in the respective components of the attributes.

Partitioning around Medoidsh. the medoids are the objects


in each cluster that minimize the sum of the distances from
the medoid to the other objects in the cluster.
RINGKASAN
References
EMC EDUCATION SERVICES

Data Science and Big Data Analytics


Let's work together

Email

[email protected]

Ways to Website

reach out
http://stat.fmipa.unri.ac.id/

You might also like