Marutho 2018
Marutho 2018
Abstract— Information is one of the most important particular topic they wanted. Grouping of news by title can be
thing in our lives, while humans is naturally impatient done by looking at similarities in a title [3].
when searching for information from the internet. Users The three problems in the clustering process are: (i)
want to get the right answer instantaneously with minimal determining the size of the similarity between different
effort. News headlines can be used to categorize news elements (ii) applying efficient algorithms to find the group of
types, as appropriate. The appropriate type of news can elements, which are most similar to the unsupervised way and
make it easier for us to choose the particular topic we (iii) obtaining descriptions that can characterize the elements of
want. Similarity in a title can be used to clustering news the cluster[4].
based on news title. From those reason this dataset
research contain the title of online news site. TFIDF used K-means algorithm is a reliable algorithm for clustering
process. The headline data are clustered so that the appropriate
as Document Preprocessing method, K-Means as
classes are obtained. The problem that often arises is the
clustering method, and elbow method used to optimize
determination of the number of clusters. The exact number of
number of cluster. Purity method applied to evaluate news clusters will show the maximum resemblance in each class
title clustering as internal evaluation. SSE (Sum Square generated. The Elbow method can be a principle method to
Error) of each cluster are calculate and compared to determine the exact value for the number of clusters. The purity
optimize number of cluster in the elbow method, the result value is used to evaluate the result of the method, purposed for
of those comparison evaluate using internal evaluation producing the expected cluster number.
called purity, purity value is conformity between cluster
and ideal cluster. From the calculation of elbow method, II. RELATED WORKS
the most optimal number of cluster are 8 cluster, there is K-Means is a popular and simple clustering technique, but
0.228 point between 7cluster and 8 cluster SSE value so the the results are based on the chosen cluster center so that it can
elbow form are made. Purity evaluation method generates easily generate local optimization. Since the cluster center is
value 0.514 in the number of cluster are 8, this is the randomly selected, the center may be poorly selected. A
highest value and the one closest to one rather than the research conducted by Aditi Anand Shetkar et al. proposed a
other number of cluster which mean the most ideal. The K-means ++ algorithm to solve this problem, by spreading the
conclusion is the elbow method can be used to optimize cluster center evenly. In K-means ++, the first cluster center is
number of cluster on K-Mean clustering method.. randomly selected then it looks for another point based on the
exact possibility. K-Means ++ gives better results than K-
Keywords—Clustering; K-Means;ElbowMethod;Purity;Tf- Means [5].
idf; Ahmad Izzuddin et al used Principal Component Analysis
I. INTRODUCTION (PCA) to reduce the lecturer performance data before the k-
mean algorithm was applied. It is proven to be effective to
Thousands and even millions of news generated by various improve a model quality. The study used DBI to measure the
news sites every day. Political news, economy, social culture, validity of the cluster. The result showed that DBI, in case of
and entertainment are continuously published every time combining PCA and K-mean is smaller than the conventional
through the Internet, while humans are naturally impatient k-mean[6].
when searching for information from the Internet; users want to
get the right answer instantaneously with minimum effort [1]. Ni Putu Eka Merliana et al used the Elbow method to
In many applications, the title is the first thing users notice. determine the best number of clusters in the k-means
Although only a word or phrase, it can dramatically change the algorithm. First, calculate the SSE value of the specified
overall message of the content. A proper title is very important, cluster. Secondly, when a drastic decrease in the SSE value and
therefore it should be descriptive, concise and correct in there is no significant change, then the current Elbow point is
grammar [2]. determined. That point is the optimal k value [7].
News headlines can be used to categorize news type; The cluster validity of k-mean and fuzzy c-mean
appropriate news type can make it easier for users to choose the algorithms is measured using purity and entropy. Entropy uses
external class information and cluster purity is measured by Here we got dataset contain news title from news site, the
the reference to the labels called entropy. The lower entropy data are preprocessed using four type of preprocessing, first are
means better clustering. The entropy strengthens when the tokenize next are stop word removal and TF-IDF term
objects truth in the cluster is more diverse. The higher entropy weighting are performed. The data reduced using principal
means the worse cluster. The number of disorders found using component analysis.
entropy [8]. Satya Chaitanya Sripada et al mentioned that the K-Mean clustering method applied using number of cluster
higher the purity value indicates a good cluster. Entropy is an are 2 until 10, while SSE (Sum Square Error) are calculated
inverted measurement, the lower the entropy value the better and recorded. Put it on a graphic and determine the corner of
the clustering result. the elbow method. Those result are evaluate using purity and
Budi Santoso et al used the Genetic Algorithm (GA) to each number of cluster are generate purity number.
optimize the cluster's initial center on the k-means algorithm.
The k-means algorithm is applied to the lecturer performance B. K-means
data in a Faculty of Computer Science of Brawijaya University K-means is a simple unsupervised learning algorithm, used
in 2016. The result showed that GA-K-means algorithm has to classify data based on Euclidian Distance technic between
higher cluster quality that achieved 2,74 % compared to the k- the data [11]. K-Means is a fast and simple clustering method
means algorithm without Genetic Algorithm (GA), where the with a smaller number of iterations. This algorithm divides data
quality of cluster obtained using the Silhouette Coefficient into k section. Cluster requirements are estimated based on user
method [9]. choice. Computers randomly select and assign objects to one
cluster (k). The distance between each object and the center of
Kamaljit Kaur et al compared K-means method with each cluster is calculated and resulted in an optimal cluster
Median-Based K-means. The method was applied to the solution. Objects within a particular cluster are adjacent to each
datasets of the UCI machine learning repository, in which the other [12].
Median-based k-means was used to select the initial centroids.
The result was to determine the initial centroid first which is C. Preprocessing
better than random selection[10]. 1) Tokenize
III. METHOD Tokenization is the process of dividing a sentence into a
word by omitting commas, spaces and special symbols in a
A. Proposed System sentence. This process generates tokens [5].
Optimizing number of cluster of K-Mean method is the 2) Stop word removal
main purpose of this study. Elbow method are choose to Prepositions, articles, and pronouns etc, are the most
determine the number of cluster while purity are perform to common words in text documents and do not give any meaning
evaluate. to the document. These words are eliminated and not required
News Title for text mining applications [13].
3) Weighting
preprocessing A weighting step is done by calculating the frequency of
documents, one of the methods used is TF-IDF. After the pre-
Tokenization
Stop word removal processing of the document, each document in the dataset is
Term weighting
Principal component analysis
represented as an N-dimensional vector in the term space,
where "N" denotes the number of words/terms. Document
vectors are exposed to some standard weighting schemes, such
elbows method as Term Frequency-Inverse Document Frequency (TF-IDF).
We need to calculate the weight of Term Frequency (TF),
Inverse Document Frequency (IDF) and finally TF * IDF i.e.
Determine number of cluster TF and IDF products. TF-IDF analysis is done by considering
two factors: Term Frequency (TF) and Inverse Document
Frequency (IDF).
k-mean
calculate SSE
TF-IDF = TF * IDF
TF: - K / T, where K = the total number of certain words in
No document d and T = number of words in document d.
Elbows formed
IDF = D / DF, where D = number of documents in dataset,
Yes DF = total number of documents containing a particular word
[5].
D. Principal Component Analysis (PCA)
evaluation
using purity Principal Component Analysis (PCA) is a technique for
connecting new variables that are linear combinations of the
original variable. The maximum number of these new variables B. Stop word removal
will be equal to the sum of the old variables and are not The next process is stop word removal, which removes
correlated to each other [6]. The algorithm is used to reduce the prepositions, articles, pronouns etc.
data of several ratios into several indexes, which is a linear
combination of all initial ratios [14]. C. Term weighting
E. Elbow method Next step is the term weighting implemented to the data
which then performed optimization using PCA technic so that
This method focuses on the percentage of variants as the it resulted in 500 data.
function of the number of clusters. Based on the idea that there
should be an optimal number of k-means algorithm, so adding From the above processes, the data is then processed using
the number k will not contribute significantly [15]. The value the k-mean algorithm, starting with n = 2 to n = 10. In this
of k is added one by one and the Sum Square Error (SSE) value process, the SSE value is revealed and showed in the following
is recorded. table.
= 1/ ∩
The next step is to compare the result of the optimal cluster TABLE VI. DATA CLUSTER N=6
value using Elbow method and purity calculation for each
value of n. The purity is the maximum number of each cluster Category a b c d
divided by the amount of data. It used the class label as a Cluster 1 0 30 0 0
category. In case of that, in this study the data processed
amounted to 1000 with the label is as many as 4 pieces namely Cluster 2 0 0 50 0
a, b, c, and d. The value of purity is the sum of the maximum Cluster 3 248 140 172 247
value of each cluster divided by 1000. The value of n = 2 to n =
10 in arithmetic purity value, which are shown in the table Cluster 4 0 0 28 0
below. Cluster 5 0 54 0 0
Cluster 6 2 26 0 3
TABLE II. DATA CLUSTER N=2
Cluster 1 0 0 66 0 Category a b c d
Cluster 2 250 250 184 250 Cluster 1 0 30 0 0
Cluster 2 248 140 170 247
Cluster 3 0 0 40 0
In the cluster number equal to 2, which most categories of Cluster 4 0 0 22 0
data are inserted into cluster 2, only 66 from category c which
becomes member of cluster 1. The maximum value at cluster 1 Cluster 5 0 0 18 0
is 66 and on cluster 2 is 250 then obtained the purity value, Cluster 6 0 54 0 0
which is equal to 0.316.
Cluster 7 2 26 0 3
Furthermore, at the value of n = 3 to n = 7 that has a purity
value ranging from 0.409 to 0.438, the value is still lower than
the purity value of n = 8, is 0.514. The successive result of
clustering n = 3 to n = 7 is in the following table. The clustering value of n = 8 shown in Table VIII,
produces a purity value of 0.514, which is a vertex on the
Elbow method, based on that method the value at this point has
to be the highest one. This is proven by the value of n = 9, in
TABLE III. DATA CLUSTER N=3
which the purity value decreased continuously into 0.503,
Category a b c d shown in Table IX.
TABLE IX. DATA CLUSTER N=9 The maximum purity value is 0.514 which is generated in the
value of n = 8 and n = 10.
Category a b c d
Cluster 1 0 29 0 0
Cluster 2 0 29 0 2
Cluster 3 0 1 9 0
Cluster 4 0 0 39 0
Cluster 5 0 0 20 0
Cluster 6 0 0 18 0
Cluster 7 0 52 0 0
Cluster 8 20 39 22 77
Cluster 9 230 100 142 171
The purity value increased to 0.514 at the value of n = 10, the Figure 2. SSE and Purity
clustering results can be seen in Table X. In Figure 2, the SSE and Purity values intersect at the value
of n = 8 with the value of SSE = 0.427 and the maximum
purity value is at 0.514.
TABLE X. DATA CLUSTER N=10
V. CONCLUSION
Category a b c d the Elbow method is purposed for determining the number
Cluster 1 0 21 0 0 of the cluster in the k-mean algorithm that we adopt in this
study. It is said that the optimal point for determining the k
Cluster 2 0 0 18 0 value is at the point, in which there is a significant change in
Cluster 3 0 32 0 0 the SSE value so that an angle is formed. The experiment was
Cluster 4 19 47 23 89 conducted on news headline data and performed an internal
evaluation through purity. The purity test results showed that
Cluster 5 229 92 141 158 the number of the cluster in the Elbow method is same as the
Cluster 6 0 0 20 0 best result of internal purity evaluation measurement.
Cluster 7 0 1 9 0
Cluster 8 0 38 0 0 REFERENCE
Cluster 9 0 0 39 0 [1] N. Gali, R. Mariescu-Istodor, and P. Fränti, “Using
Cluster 10 2 19 0 3 linguistic features to automatically extract web page
title,” Expert Syst. Appl., vol. 79, pp. 296–312, 2017.
TABLE XI. PURITY [2] H. Yunhua et al., “Title Extraction from Bodies of
HTML Documents and its Application to Web Page
k value purity Retrieval,” Res. Dev. Inf. Retr., pp. 250–257, 2005.
2 0.316 [3] I. Blokh and V. Alexandrov, “News clustering based
on similarity analysis,” Procedia Comput. Sci., vol.
3 0.409 122, pp. 715–719, 2017.
4 0.422 [4] A. Ahmad and L. Dey, “A k-mean clustering
5 0.433 algorithm for mixed numeric and categorical data,”
Data Knowl. Eng., vol. 63, no. 2, pp. 503–527, 2007.
6 0.436 [5] A. A. Shetkar and S. Fernandes, “Text Categorization
7 0.438 of Documents using K-Means and K-Means ++
8 0.514 Clustering Algorithm,” Int. J. Recent Innov. Trends
Comput. Commun., vol. 4, no. June, pp. 485–489,
9 0.503 2016.
10 0.514 [6] A. Izzuddin, “Optimasi Cluster pada Algoritma K-
Means dengan Reduksi Dimensi Dataset
The clustering processes using k-means algorithm resulted Menggunakan Principal Component Analysis untuk
in some groupings as shown in above tables. Table XI Pemetaan Kinerja Dosen,” Energy J. Ilm. Ilmu-Ilmu
displayed purity calculation result on each specified n value. Tek., vol. 5, no. 2, pp. 41–46, 2015.
[7] N. Putu, E. Merliana, P. Studi, M. Teknik, F. T.
Industri, and U. A. Jaya, “Analisa Penentuan Jumlah vol. 6, no. 2, pp. 20285–20288, 2017.
Cluster Terbaik Pada Metode K-Means,” Semin. [14] S. Sharma, “Applied Multivariate Techniques,” John
Nasionalmulti Disiplin Ilmu&Call Pap. Unisbank, pp. Wiley Sons Inc., 1996.
978–979. [15] P. Bholowalia and A. Kumar, “EBK-Means : A
[8] S. C. Sripada, “Comparison of Purity and Entropy of Clustering Technique based on Elbow Method and K-
K-Means Clustering and Fuzzy C Means Clustering,” Means in WSN,” Int. J. Comput. Appl., vol. 105, no. 9,
Indian J. Comput. Sci. Eng., vol. 2, no. 3, pp. 343– pp. 17–24, 2014.
346, 2011. [16] E. Muningsih and A. B. S. I. Yogyakarta, “Optimasi
[9] B. Santoso, I. Cholissodin, and B. D. Setiawan, jumlah cluster k-means dengan metode elbow untuk
“Optimasi K-Means untuk Clustering Kinerja pemetaan pelanggan,” Pros. Semin. Nas. ELINVO, no.
Akademik Dosen Menggunakan Algoritme Genetika,” September, pp. 105–114, 2017.
J. Pengemb. Teknol. Inf. dan Ilmu Komput., vol. 1, no. [17] H. Jain and R. Grover, “Clustering Analysis with
12, pp. 1652–1659, 2017. Purity Calculation of Text and SQL Data using K-
[10] K. Kaur, D. Singh Dhaliwal, and R. Kumar Vohra, means Clustering Algorithm,” IJAPRR, vol. IV, no.
“Statistically Refining the Initial Points for K-Means 44557, pp. 47–58, 2017.
Clustering Algorithm,” Int. J. Adv. Res. Comput. Eng.
Technol., vol. 2, no. 11, pp. 2278–1323, 2013.
[11] S. S. Jamadar and P. D. Y. Loni, “Efficient Cluster
Head Selection Method Based On K-means Algorithm
to Maximize Energy of Wireless Sensor Networks,”
Int. Res. J. Eng. Technol., vol. 3, no. 08| Aug 2016,
pp. 1579–1583, 2016.
[12] M.Anusha, “An Enhanced K-Means Genetic
Algorithms for Optimal Clustering,” Int. Conf.
Comput. Intell. Comput. Res., vol. 14, pp. 1–13, 2016.
[13] M. Raghuvanshi and R. Patel, “An Improved
Document Clustering with Multiview Point Similarity
/ Dissimilarity measures,” Int. J. Eng. Comput. Sci.,