0% found this document useful (0 votes)
61 views4 pages

Applying Clustering Techniques For Efficient Text Mining in Twitter Data

Knowledge is the ultimate output of decisions on a dataset. The revolution of the Internet has made the global distance closer with the touch on the hand held electronic devices. Usage of social media sites have increased in the past decades. One of the most popular social media micro blog is Twitter. Twitter has millions of users in the world. In this paper the analysis of Twitter data is performed through the text contained in hash tags. After Preprocessing clustering algorithms are applied on

Uploaded by

ijbui iir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views4 pages

Applying Clustering Techniques For Efficient Text Mining in Twitter Data

Knowledge is the ultimate output of decisions on a dataset. The revolution of the Internet has made the global distance closer with the touch on the hand held electronic devices. Usage of social media sites have increased in the past decades. One of the most popular social media micro blog is Twitter. Twitter has millions of users in the world. In this paper the analysis of Twitter data is performed through the text contained in hash tags. After Preprocessing clustering algorithms are applied on

Uploaded by

ijbui iir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 4

Integrated Intelligent Research (IIR) International Journal of Web Technology

Volume: 04 Issue: 02 December 2015 Page No.77-80


ISSN: 2278-2389

Applying Clustering Techniques for Efficient Text


Mining in Twitter Data
K.T.Mathuna , I.Elizabeth Shanthi, K.Nandhini
Department of Computer Science, Avinashilingam Institute for Home Science and Higher Education for Women, Coimbatore
Email: [email protected], [email protected], [email protected],

Abstract- Knowledge is the ultimate output of decisions on a Mining is similar to data mining, for storing information in
dataset. The revolution of the Internet has made the global structured manner and on the contrary text mining uses texts
distance closer with the touch on the hand held electronic that are unstructured or semi-structured. The data is collected
devices. Usage of social media sites have increased in the past through World Wide Web, Linked In, government portals, face
decades. One of the most popular social media micro blog is book, twitter, blog, news articles, digital libraries, electronic
Twitter. Twitter has millions of users in the world. In this mail and so on. Approximately 80% of the organizational
paper the analysis of Twitter data is performed through the text information is stored in unstructured form [5].
contained in hash tags. After Preprocessing clustering
algorithms are applied on text data. The different clusters B. Applications and Issues in Text Mining
formed are compared through various parameters.
Visualization techniques are used to portray the results from Text mining has its roots in almost all the areas. Text is used
which inferences like time series and topic flow can be easily everywhere and has to be mined for information retrieval and
made. The observed results show that the hierarchical knowledge gathering. It is applied in fields like publishing,
clustering algorithm performs better than other algorithms. media, telecommunications, research, banks, insurance,
finance, government administration, legal documents, health
Keywords- Text mining, Visualization, Twitter, Clustering, care, business intelligence, national security, etc. All these
Time series analysis, Social media. fields have been improved and still waiting for betterment. The
major challenge arises from the natural language processing of
I. INTRODUCTION large information to extract required knowledge. Complexity
of extraction and computational cost initiates the need for
The communication of people has greatly changed in recent further improvement. Unstructured and multiple forms of text
years, one of the most important being social media documents makes the retrieval process challenging. Research
networking [1]. Nowadays, social media is not only used for work is still in need on issues like text mining that uses
personal networking but also for commercial purposes. different intermediate forms, integration of domain knowledge,
Happenings in the real world is shared and communicated via analysis of social media network, etc [6].
Internet. Internet forms a bridge between the users irrespective
of the global distance. The most popular social sites includes C. Integration of Text Mining and Visualization
Face book, Twitter and You tube [2]. The social media has
opened up many research opportunities because of the Visualization aids in better understanding of the extracted
increased amount of information. content from the raw database. It further gives a clear picture
of the information that has to be delivered. Text mining
A. Background integrated with visualization provides a better and fast
understanding of interpreted results. Text visualization has two
This paper deals with the analysis of the topic and event forms, Topic based and Feature based. In Topic based method,
detection on the social micro blogs. Particularly, Twitter which topics and events are visualized through visualization
is widely used and fast growing in real world blog. More than techniques. Some of the techniques include Tag clouds which
500 million users are assessing twitter and above 302 are active depict the keywords or named entities. They use features like
users which generate about 340 million messages everyday [1] colour, size and layout based on usability and importance.
[3]. People upload their opinion and real world happenings in Information landscape provides a geographical view of large
this public site. Current topics and trends are the main features set of documents for analysis. Text Flow method combines
of twitter. Twitter provides “hash tag (#)”, which is used for topic mining and interactive visualization techniques to
providing topics for tweets. If this hash tag is used by many visually analyze the evolution of topics in due course. In
people then the topic becomes the current trending topic [4]. Feature based method, Word clouds are commonly generated
Hence gathering, mining and analyzing tweets has its own to provide an intuitive visual summary of documents by
importance in all areas. This paper aims in analyzing the displaying the keywords in a compact layout. Facet Atlas
efficient text mining algorithms via clustering techniques. method integrates node-link diagram with density map to
visually analyze the multifaceted relations of the document
Text Mining- Text mining, also known as text data mining or [7].The organization of the paper is as follows. Section II deals
knowledge discovery process from the textual databases is with the recent research. Section III tells about the
generally the process of extracting interesting and non-trivial methodology and algorithms used. Section IV explains the
patterns or knowledge from unstructured text documents. Text
77
Integrated Intelligent Research (IIR) International Journal of Web Technology
Volume: 04 Issue: 02 December 2015 Page No.77-80
ISSN: 2278-2389
experimental results and discussions and finally in Section V
Authors Technique(s) used Year
conclusion is given. Observations
II. LITERATURE REVIEW Kaleel et K-means algorithm improved
al.[12] is used to form accuracy and
Recent research work on Twitter data has been in various detecting and runtime.
topics. Density based clustering, naïve based and other trending events
techniques used in research along with their observations are form twitter
listed in Table 1. clusters. Word cloud
and Google maps
Table 1: Review of the Related Work are used.

Form the recent study, Support vector machine, naïve base


Authors Technique(s) used Year
Observations classifier, density based clustering and k-means clustering
Chung- Density-based Temporal and 2012 algorithms were applied and the observations show that works
Hong Lee clustering method is spatial features of are done on sentiment analysis to improve the accuracy. Works
[8] used to mine micro real world events based on current issues and real world happenings still have a
blogging text are analyzed and wider scope in research.
streams. estimated.
Rischan Naïve Bayes Analysis about 2014 III. METHODOLOGY
Mafrur et Classification the election
al.[4] technique was campaign A. Overview of Methodology
applied and word through twitter
cloud visualization shows the Tweets are gathered from twitter for text mining process. The
method was used. positive and user generated twitter data contains slang, noise and
negative grammatical mistakes. These have to be cleaned for improving
comments from the quality of the tweet features [11]. Preprocessing in text data
people. involves Stop word removal, Stemming, Converting upper
Farhan Proposed hybrid The hybrid 2014 cases to lower, removing punctuations and numbers. This
Hassan model for model showed makes the text more content specific. Stop word removal aims
Khan et classification of better accuracy. It in removing stop words which has no meaning when it is
al.[9] tweets. Enhanced decreases the single. Articles, prepositions, pro-nouns and conjunctions are
Emoticon Classifier neutral opinion the most common stop words which includes words like is, the,
(EEC), Improved by correctly an, but, for etc. These words have to be removed to make text
Polarity Classifier classifying into processing fewer complexes to facilitate the reduction in the
(IPC) and positive or number of words for retrieval [13, 14]. Stemming removes
SentiWordNet negative affixes in a word leaving the root word. For example the words
Classifier (SWNC) sentiment. study, studied, studying, studies gives the root word “study”.
where used. This method reduces the indexing structure size as the numbers
Isti Support vector Support vector 2015 of distinct index terms are reduced [14, 15]. Finally converting
Surjandar machine (SVM), machine resulted upper case lower, removing punctuations and numbers reduces
i et al.[10] naïve bayes and with higher the retrieval complexity. These methods of preprocessing make
decision tree accuracy in the text corpus less complex for text mining.
algorithms are used analyzing public
to classify. Chi opinion.
squared test and
Marascuilo
procedure were
performed for
framing association
rules.
Janez ClowdFlows, cloud- A web service is 2015
Kranjc et based scientific built to apply the
al.[11] workflow platform models on
with widget unlabeled tweets.
memory and halting
mechanism was
created. Word cloud
and stream based
visualization is
given.
Figure1. Overview of Methodology
Shakira Locality sensitive Locality sensitive 2015
Banu hashing (LSH) and hashing (LSH)
78
Integrated Intelligent Research (IIR) International Journal of Web Technology
Volume: 04 Issue: 02 December 2015 Page No.77-80
ISSN: 2278-2389
The preprocessed text data is given as input to the clustering
algorithms. Clustering is an unsupervised technique; it aims in
grouping similar words in the same cluster and dissimilar
words in different clusters [16]. The K-means, Entropy
weighted K-means and Hierarchical algorithms are some of the
most commonly used algorithms for text mining process. K-
means algorithm is the basic and popular algorithms in text
mining and used for large set of data. Entropy weighted K
means (EWKM) algorithm is mainly used in Rattle data
mining. Ewkm algorithm includes weighting scheme for
clustering data and used for high dimensional data [17, 18].
Hierarchical Clustering algorithm forms clusters based on the
dissimilarity between the observations. This algorithm decides
when a cluster has to be splitted and when clusters have to be
joined [16].These algorithms are applied in Rattle environment
in R language. Visualization is the communication of data
through the use of interactive graphical user interfaces. In text
mining visualization methods can improve and simplify the
discovery or extraction of relevant patterns of information [19].

IV. EXPERIMENTAL RESULTS AND DISCUSSIONS

The twitter dataset is taken from the Sentiment 140 website for
academics [20]. R language and environment for statistical
computing and graphics is used to provide a wide variety of
statistical and graphical techniques, and is highly extensible.
Tweets are preprocessed through text preprocessing methods
and then applied on the clustering algorithms. K-means,
Entropy Weighted K-means and Hierarchical clustering
algorithms form clusters of similar data. The words that are
similar in tweets are grouped together in single cluster and
dissimilar words are grouped in different clusters. These
clusters formed are evaluated based on the internal cluster
validation parameters that are commonly used. Parameters like
Average Between, Average Silhouette Width, Pearson gamma,
Dunn and Dunn2 are calculated for all the three algorithms and
compared to find the best applying algorithm. The experiment
was performed on cluster size increasing from 2 to 20, on
preprocessed twitter data but the optimal and standard values
for the data set have been obtained for cluster size 6 to 10. The
optimal cluster size was obtained based on experiments
conducted. Table 2 lists observed values of these parameters
for the clustering algorithms used.It is clear from Table 2 that
Hierarchical algorithm performs better than the other two (a)
clustering algorithms for clusters of size ranging from 6 to 10.
In this research work visualization methods are applied to the
results interpreted from Hierarchical clustering algorithm.
Word cloud is used to display the frequency of words that
occur in twitter data. Time series analysis is one important
issue in the current scenario. The day to day happenings are
recorded and interpreted for knowledge discovery, information
retrieval and decision making. Area plot is created using Plotly
package in R language. This shows the time line of the topic
that is tweeted. The frequency of a topic that is currently
tweeted decides on the current trending news which is given
through 2D Histograms. Figure 2(a) shows the word cloud of
tweets, Figure 2(b) shows the time series visualization of
(b)
twitter data and Figure 2(c) shows the topic frequency of
twitter data. The interpreted results are visualized through
visualization methods.

Table 2. Comparison of Clustering Algorithms


79
Integrated Intelligent Research (IIR) International Journal of Web Technology
Volume: 04 Issue: 02 December 2015 Page No.77-80
ISSN: 2278-2389
[12] Shakira Banu Kaleel, Abdolreza Abhari “Cluster-discovery of Twitter
messages for event detectionand trending,” Journal of Computational
Science, Elsevier, vol. 6, pp. 47–57, January 2015.
[13] Dr. S. Vijayarani, Ms. J. Ilamathi, Ms. Nithya “Preprocessing Techniques
for Text Mining - An Overview,” International Journal of Computer
Science & Communication Networks, vol 5(1), pp. 7-16.
[14] Ms. Nikita P.Katariya, Prof. M. S. Chaudhari “Text Preprocessing For
Text Mining Using Side Information,” International Journal of Computer
Science and Mobile Applications, vol.3 Issue. 1, pp. 01-05, January
2015.
[15] C.Ramasubramanian, R.Ramya “Effective Pre-Processing Activities in
Text Mining using Improved Porter’s Stemming Algorithm,”
International Journal of Advanced Research in Computer and
(c) Communication Engineering, vol. 2, Issue 12, pp. 4536-4538, December
2013.
[16] Divya Nasa “Text Mining Techniques- A Survey,” International Journal
Figure 2. (a) Word cloud Visualization, (b) Time series of Advanced Research in Computer Science and Software Engineering,
analysis, (c) Topic trends of Twitter data. vol 2, Issue 4, April 2012.
[17] Anil Kumar Tiwari , Lokesh Kumar Sharma , G. Rama Krishna “Entropy
V. CONCLUSIONS AND FUTURE SCOPE Weighting Genetic k-Means Algorithm for Subspace Clustering,”
International Journal of Computer Applications (0975 – 8887), vol 7–
No.7, pp.27-30, October 2010.
The results of this study show the clustering of twitter data. [18] Zhongying Zhao, Shengzhong Feng, Qiang Wang, Joshua Zhexue
Twitter is fast growing and widely used social networking site Huang, Graham J.Williams, Jianping Fan “Topic oriented community
on the World Wide Web. Mining of Twitter data has gained detection through social objects and link analysis in social networks”
Knowledge Based Systems, Elsevier, vol. 26, pp. 164-173, February
importance in the past decades. Integrating both text mining 2012.
and visualization provides better knowledge on information [19] Andreas Hotho, Andreas N¨urnberger, Gerhard Paaß “A Brief Survey of
discovery and decision making. This work can be further Text Mining,” In Ldv Forum vol. 20, No. 1, pp. 19-62, May 2005.
applied on medical text mining, government portals, etc so that [20] Sentiment140foracademics:http://cs.stanford.edu/people/alecmgo/
trainingandtestdata.zip.
they could be used for national security and rehabilitation due
to natural disasters.

REFERENCES

[1] Samar M.Alqhtani, Suhuai Luo and Brain Regan, “Fusing Text and
Image for Event Detection in Twitter,” The International Journal of
Multimedia and Its Applications (IJMA), vol.7, No.1, pp.27-35, February
2015.
[2] J.Mingers, “The paucity of multimethod research: a review of the
information systems literature,” Information Systems Journal, vol.13,
pp.233-249, July 2003.
[3] Twitter: https://en.wikipedia.org/wiki/Twitter.
[4] Rischan Mafrur, M Fiqri Muthohar, Gi Hyun Bang, Do Kyeong Lee,
Kyungbaek Kim and Deokjai Choi “Twitter Mining: The Case of 2014
Indonesian Legislative Elections,” International Journal of Software
Engineering and Its Applications, vol. 8, No. 10, pp. 191-202, 2014.
[5] AurangzebKhan, Baharum Baharudin, Lam Hong Lee, Khairullah
Khan“A Review of Machine Learning Algorithms for Text-Documents
Classification,” Journal of Advances in Information Techonology, vol.1,
No.1, Feburary 2010.
[6] K.L.Sumathy ,M.Chidambaram, “Text Mining: Concepts, Applications,
Tools and Issues – An Overview,” International Journal of Computer
Applications, vol. 80, No.4, pp. 29-32, October 2013.
[7] Guo-Dao Sun, Ying-Cai Wu, Rong-Hua Liang and Shi-Xia Liu “A
Survey of Visual Analytics Techniques and Applications:State-of-the-Art
Research and Future Challenges,” Journal of Computer Science and
Technology, vol. 28, No.5, pp. 852-86, September 2013.
[8] Chung-Hong Lee “Mining spatio-temporal information on microblogging
streams using a density-based online clustering method,” Expert Systems
with Applications, Elsevier, vol. 39, No. 10, pp. 9623–9641, August
2012.
[9] Farhan Hassan Khan, Saba Bashir, Usman Qamar “TOM: Twitter
opinion mining framework using hybrid classification scheme,” Decision
Support Systems, Elsevier, vol. 57, pp. 245–257, January 2012.
[10] Isti Surjandari, Muthia Szami Naffisah and M. Irfan Prawiradinata “Text
Mining of Twitter Data for Public Sentiment Analysis of Staple Foods
Price Changes,” Journal of Industrial and Intelligent Information,
Engineering and Technology Publishing ,vol. 3, No. 3, doi:
10.12720/jiii.3.3.253-257, September 2015.
[11] Janez Kranjc, Jasmina Smailovic, Vid Podpecan, Miha Grcar , Martin
Znidarsic, Nada Lavrac “Active learning for sentiment analysis on data
streams: Methodology and workflow implementation in the ClowdFlows
platform,” Information Processing and Management, Elsevier, vol. 51,
No.2 , pp. 187–203, March 2015.

80

You might also like