Clustering: Analisis Big Data - Pertemuan 6
Clustering: Analisis Big Data - Pertemuan 6
02 YOLANDAMUDYANNEE
CLUSTERING
is the use of unsupervised techniques for grouping similar
objects. In machine learning, unsupervised refers to the problem
of finding hidden structures within unlabeled data.
Clustering techniques are unsupervised in the sense that the
data scientist does not determine, in advance, the labels to
apply to the clusters.
The structure of the data describes the objects of interest and
determines how best to group the objects.
10
Clustering is often used as a lead-in to
classification. Once the clusters are
identified, labels can be applied to each
cluster to classify each group based on
its characteristics.
09
The following R code establishes the necessary R libraries and imports the CSV
file containing the grades
library(plyr)
library(ggplot2)
library(cluster)
library(lattice) library(graphics)
library(grid) library(gridExtra) #import the
student grades grade_input =
as.data.frame(read.csv(“c:/data/grades_k
m_input.csv”))
The following R code formats the grades for processing. The data file contains four
columns. The first column holds a student identification (ID) number, and the other
three columns are for the grades in the three subject areas. Because the student ID is
not used in the clustering analysis, it is excluded from the k-means input matrix,
kmdata.
The following R code loops through several k-means analyses for the
number of centroids, k, varying from 1 to 15. For each k, the option
nstart=25 specifies that the k-means algorithm will be repeated 25
times, each starting with k random initial centroids. The
corresponding value of WSS for each k-mean analysis is stored in the
wss vector.
wss <- numeric(15)
for (k in 1:15) wss[k] <- sum(kmeans(kmdata, centers=k,
nstart=25)$withinss)
Using the basic R plot function, each WSS is plotted against the
respective number of centroids, 1 through 15.
plot(1:15, wss, type=“b”, xlab=“Number of Clusters”, ylab=“Within Sum of
Squares”)
As can be seen, the WSS is greatly reduced when k increases from one to
two. Another substantial reduction in WSS occurs at k = 3. However, the
improvement in WSS is fairly linear for k > 3. Therefore, the k-means analysis
will be conducted for k = 3. The process of identifying the appropriate value
of k is referred to as finding the “elbow” of the WSS curve.
two alternative
based on data or based on
researcher/data analyst
When dealing with the problem of too many attributes, one useful approach is to
identify any highly correlated attributes and use only one or two of the correlated
attributes in the clustering analysis. A scatterplot matrix is a useful tool to visualize the
pair-wise relationships between the attributes.
Scatterplot matrix for seven attributes
The strongest relationship is observed to be
between Attribute3 and Attribute7. If the value of
one of these two attributes is known, it appears
that the value of the other attribute is known with
near certainty. Other linear relationships are also
identified in the plot. For example, consider the
plot of Attribute2 against Attribute3. If the value of
Attribute2 is known, there is still a wide range of
possible values for Attribute3. Thus, greater
consideration must be given prior to dropping one
of these attributes from the clustering analysis.
Euclidean distance
Cosine similarity
Manhattan distance
Silhouette Coefficient
k means : R
setwd ("D:/ANNE/Kuliah S2 --- Pemodelan Klasifikasi/data")
data <- read.csv("ilustrasikm.csv")
cluster <- kmeans(data, 3) plot(data[,1], data[,2], col=cluster$cluster)
points(cluster$centers, pch=9)
Silhouette Coefficient
Silhouette Coefficient
ilustrasi <- read.csv("D:/anne/ClusterAnalysis/ilustrasi2a.csv", header=T, sep=";")
head(ilustrasi)
hasilgerombol <- kmeans(ilustrasi, centers=3, iter.max =10)
hasilgerombol$cluster
hasilgerombol$tot.withinss
wssplot <- function(data, nc=15, seed=1234){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")}
wssplot(ilustrasi, nc=10)
Silhouette Coefficient
library("cluster")
jarak <- as.matrix(dist(ilustrasi))
hasilgerombol <- kmeans(ilustrasi, centers=3, iter.max =10)
sil.3 <- mean(silhouette(hasilgerombol$cluster,dmatrix=jarak)[,3])
Ways to Website
reach out
http://stat.fmipa.unri.ac.id/