Clustering Report
Clustering Report
Submitted by
Instructor
Prof. Edward A. Fox
Abstract
1 Introduction 1
2 Literature Review 3
2.0.1 Flat clustering algorithms . . . . . . . . . . . . . . . . . . . . . . 3
2.0.2 Hierarchical clustering algorithms . . . . . . . . . . . . . . . . . . 3
2.0.3 Clustering in Solr . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.0.4 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.0.5 Mahout clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.0.6 Clustering Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.0.7 Cluster Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Requirements 10
4 Design 11
4.0.8 Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.0.9 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.0.10 Programming Languages . . . . . . . . . . . . . . . . . . . . . . . 12
4.0.11 Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5 Implementation 14
5.0.12 Milestones and Deliverables . . . . . . . . . . . . . . . . . . . . . 14
5.0.13 Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6 Evaluation 17
6.1 Silhoutte Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6.2 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
6.3 Human Judgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.3.1 Clustering Result for Ebola Data Set . . . . . . . . . . . . . . . . 21
6.4 Clustering Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Appendices 26
A User manual 27
A.0.3 Pre-requisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
A.0.4 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
i
A.0.5 Data Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
A.0.6 Cluster Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
A.0.7 Cluster output . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
A.0.8 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . 28
A.0.9 Working with many collections . . . . . . . . . . . . . . . . . . . 29
B Developers manual 30
B.0.10 Solr Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
B.0.11 Mahout Installation . . . . . . . . . . . . . . . . . . . . . . . . . . 30
B.0.12 Solr and Mahout Integration . . . . . . . . . . . . . . . . . . . . . 31
B.0.13 Clustering Webpages . . . . . . . . . . . . . . . . . . . . . . . . . 32
B.0.14 Clustering tweets . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
B.0.15 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . 36
B.0.16 Clustering small collection of tweets . . . . . . . . . . . . . . . . . 37
B.0.17 Code Listings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
B.0.18 The Cluster Labeling Process . . . . . . . . . . . . . . . . . . . . 42
C File Inventory 45
Acknowledgement 47
References 48
ii
List of Figures
iii
List of Tables
iv
Chapter 1
Introduction
We deal with clustering in almost every aspect of daily life. Clustering is the subject of
active research in several fields such as statistics, pattern recognition, and machine learn-
ing. In data mining, clustering deals with very large data sets with different attributes
associated with the data. This imposes unique computational requirements on relevant
clustering algorithms. A variety of algorithms have recently emerged that meet these re-
quirements and were successfully applied to real life data mining problems [1]. Clustering
methods are divided into two basic types: hierarchical and flat clustering. Within each
of these types there exists a wealth of subtypes and different algorithms for finding the
clusters. Flat clustering algorithm goal is to create clusters that are coherent internally,
and clearly different from each other. The data within a cluster should be as similar as
possible; data in one cluster should be as dissimilar as possible from documents in other
clusters. Hierarchical clustering builds a cluster hierarchy that can be represented as a
tree of clusters. Each cluster can be represented as child, a parent and a sibling to other
clusters. Even though hierarchical clustering is superior to flat clustering in representing
the clusters, it has a drawback of being computationally intensive in finding the relevant
hierarchies [8].
The initial goal of the project is to use flat clustering methods to partition data into se-
mantically related clusters. Further, based upon the clustering quality and understanding
of the data we enhance cluster representation using hierarchical clustering. This may also
result in hybrid clusters between flat and hierarchical arrangement. Clustering algorithms
provided in the Apache Mahout library will be used in our work [2]. Mahout is a suite
of generally designed machine learning libraries. It is associated with Apache Hadoop [4]
for large scale machine learning in distributed environment. Currently Mahout supports
mainly recommendation mining, clustering and classification algorithms. For our project
we identified to evaluate a set of clustering algorithms - k-means, Canopy, fuzzy k-means,
streaming k-means, and spectral k-means available in the Mahout library. We have used
various collections: web pages and tweet as our data set to evaluate clustering.
Since clustering is an unsupervised classification finding the appropriate number of
clusters apriori to categorize the data is a difficult problem to address. The most efficient
way to learn about the number of clusters is to learn from the data itself. We address
this challenge by estimating the number of clusters using methods like cross-validation
and semi-supervised learning. Figure 1.1 shows an overview of the project.
In the project we evaluate a flat clustering algorithm on tweet and web page data sets
using Mahout K-means clustering. The algorithm is initialized with random centroid
points. We found, by emperical evaluation, that the best possible number of clusters for
1
Divide input
• Input tweets based on results • Hierarchical
• Input Webpages clustering
• K-means
clustering • Merge results
from previous
• Label extraction stage
AVRO=> sequence Load data into
file HDFS
small data set is 5 and large data set is 10. For the divisive hierarchical clustering, we
have done two layer clustering. The first layer corresponds to clustered points output
from a flat cluster like K-Means. The second layer corresponds to further flat clustering
of clustered points from layer 1. For labeling, we chose the top terms (based on frequency
of occurrence) in the clustered points closed to centroids. These top terms are identified
from the K-means cluster dump results using Mahout tools. Future work include multiple
layers of hierarchical clustering and advanced cluster labeling techniques.
The report is organized as follows: chapter 2 provides a brief literature review on ex-
isting clustering algorithms including flat clustering and hierarchical clustering, labeling
procedures, and open-source tools such as Apach Solr and Apache Mahout. In chapter 3,
we present our project requirements with pointers to relevant sections. In chapter 4, the
design and implementation of the project is discussed along with the tools and dependen-
cies. Project milestones and brief time line of weekly progress is presented in chapter 5.
In chapter 6, we discuss the techniques used in clustering evaluation including Silhoutte
scores, confusion matrix, and human judgement. Conclusion and future work is pre-
sented in chapter 7. Three appendix chapters are included: Appendix A provide detailed
instructions to reproduce the clustering results we have obtained and a user guide to
run various scripts we have developed over the course of the project. In appendix B we
have detailed the implementation and evaluation of clustering algorithms to aid future
developers to continue on this project. In appendix C we present a list of inventory files
developed as part of this project and VTechWorks submissions.
2
Chapter 2
Literature Review
Clustering objects into groups is usually based on a similarity metric between objects,
with the goal that objects within the same group are very similar, and objects between
different groups are less similar. In this review we focus on document clustering for web
pages and tweet data. The application of text clustering can be both online or offline.
Online applications are considered to be more efficient compared to offline applications
in terms of cluster quality, however, they suffer from latency issues.
Text clustering algorithms may be classified as flat clustering and hierarchical clus-
tering. In the next two subsections we elaborate more details about these algorithms.
Hierarchical clustering algorithms are further subdivided into two types (1) agglom-
erative methods - a bottom-up cluster hierarchy generation by fusing objects into groups
and groups into higher clusters. (2) divisive methods - a top-down cluster hierarchy gen-
eration by partitioning a single cluster encompassing all objects successively into finer
clusters. Agglomerative techniques are more commonly used [10].
Hierarchical clustering does not require knowing the pre-specified number of clusters.
However this advantage came with the cost of the algorithm complexity. Hierarchical
clustering algorithms have a complexity that is at least quadratic in the number of doc-
uments compared to the linear complexity of flat algorithms like k-means or EM [10].
3
Figure 2.1: Hierarchical clustering
4
base clusters to form the final clusters. K-Means is a generic clustering algorithm that
can also be applied to clustering textual data. As opposed to Lingo and STC, bisecting
k-means creates non-overlapping clusters.
Carrot2 is suited for clustering small to medium collections of documents. It may work
for longer documents, but processing times will be too long for online search. The integra-
tion between Solr and Carrot2 is implemented as APIs [20]. Learning about Solr-Carrot2
integration will help us in integrating our clustering techniques with Solr.
5
Document clustering using Mahout
For Mahout, we need to generate sequence files from cleaned data in HDFS and vectorize
them in the format understandable by Mahout. Once the vectors are generated they will
be input to common clustering algorithm like k-means. Due to the nature of text data
with high dimensional features it is possible that dimensionality reduction techniques will
be used to transform feature vectors for improving cluster quality.
For clustering text data, vector generation can be improved by removing noise and us-
ing a good weighting technique. Mahout allows specifying custom Lucene analyzers to its
clustering sub-commands for this. Also, cluster quality depends on the measure used to
calculate similarity between two feature vectors. Mahout supplies a large number of Dis-
tance Measure implementations (Manhattan, Squared Euclidean, Euclidean, Weighted
Euclidean and Weighted Manhattan) and also allows the user to specify his/her own if
the defaults don’t suit the purpose. Within each dimension, points can be normalized to
remove the effect of outliers - the normalization p-norm should match the p-norm used
by the distance measure. Finally, if the dimensions are not comparable, then one should
normalize across dimensions, a process known as weighting (this should be done during
the vectorization process, which the user controls fully) [3].
Once the data is vectorized, the user invokes the appropriate clustering algorithm ei-
ther by calling the appropriate Mahout sub-command from the command line, or through
a program by calling the appropriate driver run method. All algorithms require the initial
centroids to be provided, and the algorithm iteratively modifies the centroids until they
converge. The user can either guess randomly or use the Canopy cluster to generate the
initial centroids.
Finally, the output of the clustering algorithm (sequence files in binary format) can
be read using the Mahout cluster dumper sub-command to get a human readable format.
K-Means Algorithm
The k-means clustering algorithm is known to be efficient in clustering large data sets.
This algorithm is one of the simplest and the best known unsupervised learning algo-
rithms. It solves the well-known clustering problem. The K-Means algorithm aims to
partition a set of objects, based on their attributes/features, into k clusters, where k is
a predefined constant. The algorithm defines k centroids, one for each cluster. The cen-
troid of a cluster is formed in such a way that it is closely related, in terms of similarity (
where similarity can be measured by using different methods such as Euclidean distance
or Extended Jaccard) to all objects in that cluster [9]. Technically, what k-means is
interested in, is the variance. It minimizes the overall variance, by assigning each object
to the cluster such that the variance is minimized. Coincidentally, the sum of squared
deviations, one objects contribution to the total variance, over all dimensions is exactly
the definition of squared euclidean distance.
6
Creating Vector Files:
• Unlike Canopy algorithm, the k-means algorithm requires vector files as input,
therefore you have to create vector files.
• To generate vector files from sequence file format, Mahout provides the seq2parse
utility.
K-means clustering job requires input vector directory, output clusters directory, dis-
tance measure, maximum number of iterations to be carried out, and an integer value
representing the number of clusters the input data is to be divided into. The next figure
shows K-Means in action for Mahout implementation.
Fuzzy K-Means
Fuzzy K-Means (also called Fuzzy C-Means) is an extension of K-Means, the popular
simple clustering technique. While K-Means discovers hard clusters (a point belong to
only one cluster), Fuzzy K-Means is a more statistically formalized method and discovers
soft clusters where a particular point can belong to more than one cluster with certain
probability.
7
Like K-Means, Fuzzy K-Means works on those objects which can be represented in
n-dimensional vector space and a distance measure is defined. The algorithm is similar
to k-means.
• Initialize k clusters
• Until converged
Similar to K-Means, the program doesn’t modify the input directories. And for every
iteration, the cluster output is stored in a directory cluster-N. The code has set number
of reduce tasks equal to number of map tasks. FuzzyKMeansDriver - This is similar
to KMeansDriver. It iterates over input points and cluster points for specified number
of iterations or until it is converged.During every iteration i, a new cluster-i directory
is created which contains the modified cluster centers obtained during FuzzyKMeans
iteration. This will be feeded as input clusters in the next iteration. FuzzyKMeansMapper
- reads the input cluster during its configure() method, then computes cluster membership
probability of a point to each cluster. FuzzyKMeansReducer - Multiple reducers receives
certain keys and all values associated with those keys. The reducer sums the values to
produce a new centroid for the cluster which is output.
8
of the documents within each cluster to find the most representative words for the cluster.
The work in [25] briefly describe a technique to label clusters based on how many
times a feature is used in a cluster. By utilizing this information, and also drawing
on knowledge of the code, short titles are manually selected for the obtained clusters.
Although labeling is performed manually, they point out that the automatically developed
feature summary of each cluster makes the labeling process much easier. In 2001 Tzerpos
et al. [26] emphasizes that a clustering algorithm should have certain features to make its
output easier to comprehend. These features include bounded cluster cardinality, which
ensures that any single cluster does not contain a very large number of entities, and
effective cluster naming. They use a pattern based approach to recognizing familiar sub-
system structures within large systems. The identified patterns are expected to occur in
large systems with around 100 source files. The same pattern-based approach is used for
cluster labeling. In 2003 Tonella et al. [27] describe the use of keywords within web pages
to cluster and label similar pages. Both single words and a contiguous sequence of two
words i.e., bigrams are considered as representative keywords of a webpage. Clustering
as well as cluster labeling are carried out on the basis of keywords within a webpage.
Cluster labels are ranked according to inverse keyword frequency
9
Chapter 3
Requirements
The goal of our project is to improve the quality of the document searching by clustering
the documents and using the results to influence the search results. We propose to do the
clustering in an iterative manner such that a hierarchy is developed between the different
iterations. This would further improve the search results quality since the hierarchical
results could result in a scoring mechanism with different weights for different levels of
hierarchy.
Below, we summarize our tasks for the project:
• Hands-on with various big data technologies like Hadoop, Map Reduce, HDFS, Solr.
See subsection B.0.10.
• Flat clustering of tweets and webpages using K-means, Streaming K-means and/or
Fuzzy K-means. See subsection B.0.13, subsection B.0.14.
• Evaluating and optimizing clustering results using three metrics which includes
Silhoutte scores, confusion matrix with labelled data, and human judgement. See
chapter 6.
• Cluster label extraction from the clustering results. See subsection B.0.18
• Merging the results of various levels of hierarchy to help the scoring mechanism in
order to improve the search results quality. See subsection B.0.15
10
Chapter 4
Design
Offline document clustering is shown in Figure 4.1. For a major part of the project we
emphasize on offline document clustering and optimize the scores based on the clustering
results. The input to document clustering are tweets and web pages in Avro format in
HDFS. The input data is cleaned and pre-processed using various techniques discussed
in reducing noise team’s report. The yellow boxes in Figure 4.1 is developed as part of
this project. The blue boxes represent our leverage of Mahout clustering algorithms and
tools. For divisive hierarchical clustering we leveraged K-Means clustering of Mahout
and iteratively calling those algorithms at various hierarchical levels. Finally, the output
produced is in Avro format with schema presented in the Hadoop team’s report. The
clustering output is further ingested into HBase and pipeline is completed by indexing
the clustering information into Solr fields.
JAVA/Python
Evaluation
TWEETS /
DATA EXTRACTION
WEBAPGES
JAVA/Python
KMEANS
LABEL EXTRACTION
CLUSTERING
MAHOUT
LOAD DATA
HIERARCHICAL IN
MERGE RESULTS
CLUSTERING
HBASE
MAHOUT JAVA/Python
11
4.0.8 Workflow
We adapt flat clustering methods for offline document clustering initially and then move
on to hierarchical clustering methods. The Apache Mahout project provides access to
scalable machine learning libraries which includes several clustering algorithms. Thus,
in our project we leverage various clustering algorithms in Apache Mahout library. The
input format to clustering algorithms is an Avro file. The output format of clustering
is also an Avro file. Since we don’t need any meta-data for clustering the actual input
data format (Avro) will be converted to sequence files while providing input to Mahout
clustering algorithms. For generating human readable representation of sequence files
we use the clusterdumper tool in Mahout. The information from the output of the
clusterdumper tool would be used to add two more fields to the Solr schema. For flat
clustering, the final output from our team is the addition of two fields “Cluster ID” and
“Cluster label”. The overall workflow is presented in Figure 4.2
DataMPre-processing
Cleaned Cleaned Clustering
TweetsMand TF-IDFMVectorM
TweetsMand
WebMpages Generation
WebMpages
SequenceMFiles K-Means
Avro SequenceMFiles
Clustering
SequenceMFiles
DataMPost-processing
Mapping: ClusterMDump
TopMTerms
ClusteredM -MClusterMLabel
ClusterMLabel Centroids
Output -MDocumentMID
Extraction ClusteredMPoints
-MClusterMIDM
Avro SequenceMFiles SequenceMFiles SequenceMFiles
4.0.9 Tools
For document clustering we use Apache Mahout clustering algorithms and relevant tools
provided as part of the Mahout library. The document collections are saved in HDFS
and the output of clustering is saved in HBase. The format of input and output files will
be sequence files. KEA is used for labeling the clustering result. KEA is implemented in
Java and is platform independent and an open-source software.
4.0.11 Dependencies
The dependencies we had during our work progress and performance evaluation are listed
below.
12
• Feature extraction team: This is a mutual dependency. We are dependent on the
feature extraction team to provide optimized feature vectors that are necessary for
improving clustering. Currently, the feature vectors extracted from sequence files
are directly fed to Mahout algorithms. However, cluster results can be improved
if important features are extracted and fed to Mahout algorithms. We collabo-
rate/leverage the results from the feature extraction team to optimize the docu-
ment clustering. We have used very basic feature extraction methods like pruning
high frequency words, pruning very low frequency words, stop word removal and
stemming to reducing the dictionary size.
• LDA team: As part of the output we provide appropriate cluster labels and eval-
uate those labels with the LDA team to ensure that relevant topics are clustered
as expected. As an evaluation measure we have collaborated with LDA team and
provided a cosine similarity matrix for documents within each cluster and for two
collections ebola S and Malaysia Airlines B tweet collections. This mutual evalua-
tion helps to compare statistical document clustering with probabilistic document
clustering based on topic modeling from LDA team. More details on evaluation
can be found in LDA team report for project IDEAL. To aid the clustering evalua-
tion with respect to statistical measures we have provided three evaluation metrics
discussed in detail in chapter 6
• Reducing Noise team: We have obtained cleaned versions of the data from reducing
noise team that helped us to improve the document quality and quality of the cluster
results.
• Solr team: Helps us to provide feedback on scoring the Solr search results based on
offline clustering.
• Hadoop team: While we work with large volumes of data, we seek help to config-
ure the Hadoop environment and resolve issues related to developing map-reduce
programs used in clustering.
13
Chapter 5
Implementation
5.0.13 Timeline
The implementation time line of the and member contributions are listed in the
Table 5.1 in chronological order.
14
Weekly Reports Task Done by
Sujit,
1 Installation of SOLR on laptops. Rubasri,
Hanna
Sujit,
1 Clustering literature review Rubasri,
Hanna
Sujit,
2 Understanding of workflow Rubasri,
Hanna
2 Mahout setup and integration with Lucene vector packages Sujit
3 Carrot clustering in SOLR Sujit
3 SOLR and Mahout integration Sujit
Sujit,
Reorganization of previous week’s report
3 Rubasri,
according to Report15 requirements
Hanna
Downloading of webpages mentioned in tweets
4 Rubasri
by using the script provided by the TA
Sujit and
4 Indexing of Webpages and tweets in SOLR
Hanna
Exploration of clustering options in Mahout
4 Sujit
(Kmeans, Streaming Kmeans, Fuzzy Kmeans)
Extraction of all the tweets from the CSV file
5 using the python script to sequence file since Mahout Rubasri
requires sequence file as input.
5 Clustering of sample Tweets using Mahout Rubasri
5 Clustering of sample Webpages using Mahout Rubasri
Conversion of AVRO format to sequence file and
6 Sujit
extraction of the cluster results from Mahout.
Identification of 100 relevant tweets for training data set and
6 100 other random tweets for the test data set to be provided to Sujit
the Classification team.
Crawling of webpages using Nutch in our local machine and
6 Rubasri
cluster for small collections
15
Implemention of K-means clustering in Mahout and
8 Sujit, Rubasri
association of the results of clustering with the input data collection
Extraction of the cluster labels (used the top most term)
8 Rubasri
from the cluster results and association of the labels with each tweet
Modification of the AVRO schema to include the cluster id and
cluster label for each of the tweets.
8 Sujit
We were able to get the output files in AVRO format
which has the clustering results.
Analysis of Mahout clustering algorithms
9 Sujit
(Streaming K means, fuzzy k means, K means)
Crawling of webpages using Nutch in our local machine and
9 Sujit
cluster for big collections
9 Automation of clustering using bash script Sujit
9 Implementation of Hierarchical clustering Rubasri
Merging of the results from different levels of
10 Rubasri
hierarchical clustering
10 Implementation of KIA labelling Hanna
10 Statistics for evaluation Sujit
10 Clustering of cleaned webpages Sujit
Automation of the hierarchical clustering process
10 Rubasri
using bash script
Clustering evaluation using Silhoutte scores,
Final Sujit
confusion matrix, and human judgement
Sujit,
Final Final project presentation and report Rubasri,
Hanna
16
Chapter 6
Evaluation
We have chosen three metrics for evaluation which includes Silhoutte scores, confu-
sion matrix, and human judgement. In the following sections we provide our results
that includes evaluation for tweet and web page collections.
b(i) − a(i)
s(i) = (6.1)
max{a(i), b(i)}
Silhoutte score for the document collection is the mean average of all coefficients
computed for each of the data points. To efficiently compute the Silhoutte score we
use scikit learn python package. A Silhoutte score of +1 represents that the docu-
ments are clustered with high quality, a score of −1 represents that the documents
are clustered with poor quality. Normally, the Silhoutte score for text documents
will be close to zero due to the sparsity of the documents (99%). For our evaluation
we assume that the Silhoutte score of anything greater than zero is considered to
be decent clustering result.
Table 6.1, 6.2, and 6.3 provide Silhoutte scores for small tweet collection, big tweet
collection and web page collection respectively.
17
From the scores we interpret that for most of the collections the clustering results
are good enough as the Silhoutte scores are greater than zero. As a future work
we propose that the scores can be improved by performing more advanced feature
selection, dimensionality reduction, and various other clustering procedures.
18
Data Set Silhoutte Score
classification small 00000 v2 (plane crash S) 0.0239099
classification small 00001 v2 (plane crash S) 0.296624
clustering large 00000 v1 (diabetes B) 0.124263
clustering large 00001 v1 (diabetes B) 0.0284772
clustering small 00000 v2 (ebola S) 0.0407911
clustering small 00001 v2 (ebola S) 0.0163434
hadoop small 00000 (egypt B) 0.206282
hadoop small 00001 (egypt B) 0.264068
ner small 00000 v2 (storm B) 0.0237915
ner small 00000 v2 (storm B) 0.219972
noise large 00000 v1 (shooting B) 0.027601
noise large 00000 v1 (shooting B) 0.0505734
noise large 00001 v1 (shooting B) 0.0329083
noise small 00000 v2 (charlie hebdo S) 0.0156003
social 00000 v2 (police) 0.0139787
solr large 00000 v1 (tunisia B) 0.467372
solr large 00001 v1 (tunisia B) 0.0242648
solr small 00000 v2 (election S) 0.0165125
solr small 00001 v2 (election S) 0.0639537
Table 6.3: Silhoutte Scores for Web page Collections
shooting B, storm B, tunisia B. Confusion matrix for concatenated big data set
is shown in Figure 6.2
19
Figure 6.2: Confusion Matrix for Concatenated Big Tweet Collection
Unlike confusion matrix for classification, in the context of clustering we are inter-
ested in just homogeneous labeling than accurate labeling, i.e., we are interested
only in the groupings of documents rather than the exact class which they belong
to. Thus, instead of providing exact labels to the clusters obtained, we label the
clusters ranging from A to G in both figures.
An interesting insight into the confusion matrix lead us that in both the data sets
we observe that approximately 4 collections out of 7 collections are placed in the
same cluster. After manually analyzing the data sets we concluded that most of the
data sets that belong to single cluster have the event type bomb, shooting, tunisia,
etc. Thus, we concluded that the clustering output is reasonable.
In addition to the confusion matrix, we have calcualated the Silhoutte scores for
each of the concatenated data sets. Table 6.4 shows the scores obtained for the
collections. Even if there is mis-classification as judged by confusion matrix we still
see Silhoutte scores to be greater than zero, which further confirms our expectation
that the tweets in those documents might be similar.
20
6.3 Human Judgement
A third metric we have chosen to evaluate clustering results is to compare the
results of clustering and labeling method with that of human judgement. Due to
time constraints we haven’t evaluated final results of all of the document collections.
In this section, we provide an evaluation of Ebola data set.
By comparing the cluster results and manually analyzing random samples of the
documents in each of the cluster we present a summary table 6.6
21
Human Label Sample Tweets
Ebola kills fourth victim in Nigeria The death tollfrom the Ebola outbreak in Nigeria has risen to four whi,
RT Dont be victim 827 Ebola death toll rises to 826,
Death
RT Ebola outbreak now believed to have infected 2127 people killed 1145 health officials say,
RT Two people in have died after drinking salt water which was rumoured to be protective against
US doctor stricken with the deadly Ebola virus while in Liberia and brought to the US for treatment in a speci,
Doctors
Moscow doctors suspect that a Nigerian man might have
RT For real Obama orders Ebola screening of Mahamaother African Leaders meeting him at USAfrica Summit
Patrick Sawyer was sent by powerful people to spread
Politics Ebola to Nigeria Fani Kayode has reacted,
22
RT The Economist explains why Ebola wont become a pandemic View video via,
Obama Calls Ellen Commits to fight amp WAfrica
How is this Ebola virus transmitted,
Symptoms RT Ebola symptoms can take 2 21 days to show It usually start in the form of malaria or cold followed by Fever Diarrhoea
Ebola virus forces Sierra Leone and Liberia to withdraw from Nanjing Youth Olympics
Ebola FG Okays Experimental Drug Developed By Nigerian To Treat,
RT Drugs manufactured in tobacco plants being tested against Ebola other diseases,
Drugs
Tobacco plants prove useful in Ebola drug production
EBOLA Western drugs firms have not tried to findvaccine because virus only affects Africans
23
Chapter 7
7.0.1 Conclusion
In our project, we performed flat clustering on the various input data sets using
Mahout K-means clustering. In the Mahout K-means algorithm, the initial cen-
troids of the cluster are chosen as random and it is required to specify the number
of clusters. Using empirical analysis we found that the best number of clusters for
the small data set is 5 and large data set is 10. In order to further improve the
search quality results, we performed hierarchical clustering on the input data sets
by further clustering the results obtained in flat clustering. We chose the top terms
present in the cluster as the cluster label. These top terms are identified from the
K-means cluster dump results using Mahout tools. Although top terms are not the
best way to label the clusters they work well for tweet collection (short text).
In order to verify the effectiveness of the clustering, we have evaluated the clustering
results using Silhouette scores, confusion matrix and human judgement. Silhouette
scores measure how similar the documents are within each cluster and how dissimilar
the documents are in different clusters. We obtained positive Silhouette scores for
all of the data sets which shows that the quality of the clustering is commendatory.
In addition to Silhouette scores, confusion matrix was also used to evaluate the
quality of the clustering where in we used K-means algorithm with various tunables.
Due to high sparsity in the data set the Silhoutte scores are low (close to zero).
However, feature transformation methods like Latent Semantic Analysis (LSA) can
be applied to transform the data set into lower dimensional space decreasing the
sparsity and increasing the Silhoutte scores. We also found that the Silhouette
scores of web pages are higher than the tweets mainly because of the length of the
web pages and sparsity index compared to that of tweets.
24
results and search in Wikipedia for best possible titles that match top terms can
be an enhancement to this work.
25
Appendices
26
Appendix A
User manual
A.0.3 Pre-requisites
Users are recommended to have following environment setup prior to performing
data clustering as described in the later subsections.
3 $ j a v a −j a r AvroToSequenceFilesCleanedTweetDocId . j a r . / e b o l a S /
part−m−00000. avro . / e b o l a S / part−m−00000. s e q
4
5 $ hadoop f s −mkdir c l e a n e d t w e e t s d o c i d
6
27
1 $ . / c l u s t e r i n g c s 5 6 0 4 s 1 5 . sh <input > <output> 2>&1 | t e e
logfile name . log
2
4 # copy c l u s t e r output t o l o c a l f s
5 $ hadoop f s −copyToLocal <output >/output−kmeans .
6
7 # execute c l u s t e r l a b e l i n g
8 $ j a v a −j a r labelWithIDAvroOut . j a r <output f o l d e r name> <output
f i l e name>
9
10 # f i n a l output w i l l be l o c a t e d i n :
11 $ l s −a l <output f o l d e r name>/output−kmeans/<output f i l e name>.
avro
7 # Copy t h e f o l d e r t o HDFS
28
8 $ hadoop f s −put <output o f c l u s t e r r e s u l t s f o l d e r > < l e v e l 2
input folder >
9
10
14 # e n s u r e c l u s t e r i n g output i s p r e s e n t
15 $ hadoop f s − l s < l e v e l 2 output f o l d e r >/<d a t a s e t >/output−kmeans
16
17 # copy c l u s t e r output t o l o c a l f s
18 $ hadoop f s −copyToLocal < l e v e l 2 output f o l d e r >s/<d a t a s e t >/
output−kmeans .
19
20 # e x e c u t e c l u s t e r l a b e l i n g on each o f t h e r e s u l t s g e n e r a t e d
21 $ j a v a −j a r labelWithIDAvroOut . j a r < l e v e l 2 output f o l d e r >/<
d a t a s e t > <d a t a s e t >
22
23
24
25 # merge t h e r e s u l t s o f both t h e l e v e l s o f c l u s t e r i n g
26 $ j a v a −j a r merge . j a r < l e v e l 1 output f o l d e r > < l e v e l 2 output
folder >
29
Appendix B
Developers manual
This section details implementation of our project and provides further information
for a developer who is interested in extending this work.
In this manual it is assumed that the development environment is configured as
following:
Operating System: Ubuntu 14.04 Linux
CPU Architecture: x86 64
CPU cores: 4
Memory: 4GB
Java version: 1.7.0 75
JVM: 64-bit Server
IDE: Eclipse (Luna with Maven repositories indexed)
30
Figure B.1: Snapshot of Solr user interface
31
2 −−d i r / s o l r / c o l l e c t i o n 1 / data / i n d e x \
3 −− f i e l d c o n t e n t \
4 −−d i c t O u t d i c t . t x t \
5 −−output d i c t . out
In order to ensure that the term vectors can be extracted using the lucene.vector
package the Solr schema.xml should be modified as following:
The webpages collection is put in the HDFS using the following command.
1 $ hadoop f s −put Webpages webpage /
Since Mahout accepts only Sequence file as input format, the extracted webpages
which are in the form of text files are converted to a sequence file using the following
command.
1 $ mahout s e q d i r e c t o r y − i webpages /Webpages/ \
2 −o webpages / WebpagesSeq −xm s e q u e n t i a l
“-xm sequential” is given to specify that the conversion has to be done sequentially.
If it is omitted, the conversion would be done in the form of mapreduce jobs.
The TFIDF vectors from the sequence file are generated using the following com-
mand.
1 $ mahout s e q 2 p a r s e − i webpages / WebpagesSeq \
2 −o webpages / WebpagesVectors
Canopy clustering is done before K-means clustering to guess the best value of K.
Output of this stage becomes the input of K-means clustering.
1 $ mahout canopy \
2 − i webpages / WebpagesVectors / t f i d f −v e c t o r s / \
3 −o webpages / WebpagesCentroids −t 1 500 −t 2 250
32
Each canopy cluster is represented by two circles. The radius of the outer circle is
T1 and the radius of the inner circle is T2. The options “-t1” and “-t2” specifies
the T1 and T2 radius thresholds respectively. The option “-dm” specifies which
distance measure to use (default is SquaredEuclidean).
The results of canopy clustering can be dumped to a text file using the following
command. This step is optional.
1 $ mahout c l u s t e r d u m p −dt s e q u e n c e f i l e −d \
2 webpages / WebpagesVectors / d i c t i o n a r y . f i l e −∗ \
3 − i webpages / WebpagesCentroids / c l u s t e r s −0− f i n a l \
4 −o w e b p a g e s r e p o r t . t x t
The option “-cd” specifies the convergence delta. The default is 0.5. “-x” specifies
the maximum number of iterations and “-cl” specifies that K-means clustering has
to be done after the canopy clustering.
Finally, the results of K-means clustering is dumped using the following command.
1 $ mahout c l u s t e r d u m p −dt s e q u e n c e f i l e −d \
2 webpages / WebpagesVectors / d i c t i o n a r y . f i l e −∗ \
3 − i webpages / WebpagesClusters / c l u s t e r s −2− f i n a l −o \
4 webpages report kmeans . txt
33
Figure B.2 shows an example output of fuzzy k-means cluster dump. As Fuzzy
K-Means is a soft clustering algorithm, each document in the cluster can be in
multiple clusters.
From the observed output we have identified that alphabet “i” is appearing as a
top term in the cluster output. This is expected since we are using uncleaned data.
This is reported to the reducing noise team.
Streaming K-Means:
Choosing approximate cluster centroids using canopy clustering and further using k-
means to provide cluster output is inefficient and the hadoop cluster might take long
time to process the jobs. This is especially due to the squared euclidean distance
measurement between each data point in the collection. When the collection is
too large clustering take long time to converge. Streaming k-means overcomes
this disadvantage. It has two steps - (1) streaming step (2) ball k-means step. In
streaming step, the algorithm passes through all the data points once and computes
approximate number of centroids. In the ball k-means step, an optimized algorithm
is used to compute clusters efficiently and accurately compared to conventional k-
means algorithm.
The command used in streaming k-means is as listed below:
1
In our further evaluation, we have generated a tiny collection with just 1000 tweets
to ensure that our clustering flow is correct. We have extracted 1000 tweets from
the Avro file and generated a sequence file in the format understandable by Ma-
34
hout. Further, we have used canopy and k-means algorithms (similar to webpage
clustering) to perform clustering on this tiny data set.
In the Figure B.3, we present brief statistics while performing Streaming K-Means
on cleaned ebola S tweet data set. The statistics are collected using the “qualclus-
ter” tool in the Mahout library. The statistics indicate that the choice of tunable
parameters are reasonably performing well with streaming k-means. The average
distance in all of the clusters is around 400 while maximum distance from centroid
to farthest point is 2295. This difference is expected as the k-means algorithm is
susceptible to outliers. In our future evaluation, we attempt to identify such outliers
and filter them out during pre-processing stages.
35
for further evaluation - https://issues.apache.org/jira/browse/MAHOUT-1698.
Figure B.4 shows the clustering results. The first column is the tweet ID, second
column is the cluster ID and the third column is the cluster label. We used Python
to extract the cluster ID and the cluster label associated with each of the clusters
from the cluster dump output. Another Python program would extract the tweet
IDs and the cluster IDs associated with each of the tweets and use the label output
obtained from the previous step to associate each of the tweets with the cluster
it belongs too. We then modified the AVRO schema with the results obtained to
produce an Avro output file with cluster results. The Avro output for ebola data set
can be found in the cluster at user/cs5604s15 cluster/clustered tweets/ebola S
36
We iteratively applied the Mahout K-means clustering algorithm on each of these
clusters(sequence files) to obtain another level of hierarchy in clustering of docu-
ments. The results obtained are merged with the initial clustering output. These
are then converted to Avro format. We have used Python to parse the clustering
results obtained from Mahout to create a text file with tweet IDs and cluster labels.
The final merged output is also a text file. We then converted it to Avro format.
As we can observe from the cluster labels the noise is still present and words like My,
We, I appear frequently which contributes heavily to top terms. We have reported
this observation to noise reducing team to further optimize the noise reduction.
In addition, we have used feature selection to reduce such occurrences of the stop
words.
2 package edu . vt . c s 5 6 0 4 s 1 5 . c l u s t e r i n g ;
3
4 import j a v a . i o . F i l e ;
37
5 import j a v a . n e t . URI ;
6
22 p u b l i c i n t run ( S t r i n g [ ] a r g s ) throws E x c e p t i o n {
23 F i l e f i l e = new F i l e ( ” /home/ s u j i t / workspace / c s 5 6 0 4 /
ebola S AVRO/ part−m−00000. avro ” ) ;
24 DatumReader<s q o o p i m p o r t z 2 2 4 > datumReader = new
SpecificDatumReader<s q o o p i m p o r t z 2 2 4 >( s q o o p i m p o r t z 2 2 4 .
class ) ;
25 DataFileReader <s q o o p i m p o r t z 2 2 4 > d a t a F i l e R e a d e r = new
DataFileReader <s q o o p i m p o r t z 2 2 4 >( f i l e , datumReader ) ;
26 s q o o p i m p o r t z 2 2 4 data = new s q o o p i m p o r t z 2 2 4 ( ) ;
27
38
49 return 0;
50 }
51
52 p u b l i c s t a t i c v o i d main ( S t r i n g [ ] a r g s ) {
53 int result = 0;
54 System . out . p r i n t l n ( ” S t a r t i n g Avro t o Sequence f i l e
generation ”) ;
55 try {
56 r e s u l t = ToolRunner . run ( new C o n f i g u r a t i o n ( ) , new
Av ro To Se que nc eF il es ( ) , a r g s ) ;
57 } catch ( Exception e ) {
58 e . printStackTrace () ;
59 }
60 System . e x i t ( r e s u l t ) ;
61 }
62 }
Following code is used to perform clustering job on Hadoop cluster. The input to
clustering cs5604s15.sh script is HDFS sequence file generated from Avro output
in above code listing. The output of clustering as of now is statistics in case of the
streaming k-means and cluster dump with top terms in case of the fuzzy k-means.
Our implementation for the cluster labeling is still under progress. Once cluster
labeling is complete the final output will be produced in Avro format.
1 #! / b i n / bash
2
3 # Author : S u j i t Thumma
4 # i n s p i r e d from h t t p s : / / g i t h u b . com/ apache /mahout/ b l o b / master /
examples / b i n / c l u s t e r −r e u t e r s . sh
5 # C l u s t e r i n g j o b f o r CS5604 c l a s s
6
39
27 # echo ”Can ’ t f i n d mahout d r i v e r i n $MAHOUT, cwd ‘ pwd ‘ ,
exiting ..”
28 # exit 1
29 #f i
30
31 a l g o r i t h m =( streamingkmeans fuzzykmeans )
32 i f [ −n ” $3 ” ] ; then
33 c h o i c e=$1
34 else
35 echo ” P l e a s e s e l e c t a number t o c h o o s e t h e c o r r e s p o n d i n g
c l u s t e r i n g algorithm ”
36 echo ” 1 . ${ a l g o r i t h m [ 0 ] } c l u s t e r i n g ”
37 echo ” 2 . ${ a l g o r i t h m [ 1 ] } c l u s t e r i n g ”
38 r e a d −p ” Enter your c h o i c e : ” c h o i c e
39 fi
40
40
74 −o $ {OUTPUT DIR}/ output−s e q d i r −s p a r s e −streamingkmeans −ow −−
maxDFPercent 85 −−namedVector \
75 && \
76 echo ” Step2 : Mahout streamingkmeans ” && \
77 $MAHOUT streamingkmeans \
78 − i $ {OUTPUT DIR}/ output−s e q d i r −s p a r s e −streamingkmeans / t f i d f −
vectors / \
79 −−tempDir $ {OUTPUT DIR}/tmp \
80 −o $ {OUTPUT DIR}/ output−streamingkmeans \
81 −s c o r g . apache . mahout . math . n e i g h b o r h o o d . F a s t P r o j e c t i o n S e a r c h
\
82 −dm o r g . apache . mahout . common . d i s t a n c e . C o s i n e D i s t a n c e M e a s u r e
\
83 −k 10 −km 130 −ow \
84 && \
85 echo ” Step3 : q u a l c l u s t e r f o r s t r e a m i n g kmeans” && \
86 $MAHOUT q u a l c l u s t e r \
87 − i $ {OUTPUT DIR}/ output−s e q d i r −s p a r s e −streamingkmeans / t f i d f −
v e c t o r s / part−r −00000 \
88 −c ${OUTPUT DIR}/ output−streamingkmeans / part−r −00000 \
89 −o $ {OUTPUT DIR}/ output−streamingkmeans−c l u s t e r −d i s t a n c e . c s v
\
90 && \
91 c a t $ {OUTPUT DIR}/ output−streamingkmeans−c l u s t e r −d i s t a n c e . c s v
92 echo ” Check streamingkmeans r e s u l t s i n $OUTPUT DIR/ output−
streamingkmeans−c l u s t e r −d i s t a n c e . c s v ”
93 e l i f [ ” x $ c l u s t e r t y p e ” == ” xfuzzykmeans ” ] ; then
94 echo ” Step1 : Running s e q 2 s p a r s e f o r fuzzykmeans ” && \
95 $MAHOUT s e q 2 s p a r s e \
96 − i $ {INPUT DIR}/ input −s e q d i r / \
97 −o $ {OUTPUT DIR}/ output−s e q d i r −s p a r s e −fkmeans −ow −−
maxDFPercent 85 −−namedVector \
98 && \
99 echo ” Step2 : Mahout fkmeans ” && \
100 $MAHOUT fkmeans \
101 − i $ {OUTPUT DIR}/ output−s e q d i r −s p a r s e −fkmeans / t f i d f −v e c t o r s /
\
102 −c ${OUTPUT DIR}/ output−fkmeans−c l u s t e r s \
103 −o $ {OUTPUT DIR}/ output−fkmeans \
104 −dm o r g . apache . mahout . common . d i s t a n c e . C o s i n e D i s t a n c e M e a s u r e
\
105 −x 10 −k 20 −ow −m 1 . 1 \
106 && \
107 echo ” Step3 : c l u s t e r d u m p f o r fkmeans ” && \
108 $MAHOUT c l u s t e r d u m p \
109 − i $ {OUTPUT DIR}/ output−fkmeans / c l u s t e r s −∗− f i n a l \
110 −o $ {OUTPUT DIR}/ output−fkmeans / c l u s t e r d u m p \
111 −d $ {OUTPUT DIR}/ output−s e q d i r −s p a r s e −fkmeans / d i c t i o n a r y .
f i l e −0 \
112 −dt s e q u e n c e f i l e −b 100 −n 20 −sp 0 \
113 && \
41
114 # c a t $ {WORK DIR}/ r e u t e r s −fkmeans / c l u s t e r d u m p
115 echo ” check fkmeans r e s u l t s i n $OUTPUT DIR/ output−fkmeans /
cl u s t e r d u m p ”
116 else
117 echo ”uknown c l u s t e r type : $ c l u s t e r t y p e ”
118 fi
119 echo ”Done ! ”
120
Keyword selection
Kea first needs to create a model that learns the extraction strategy from manually
indexed documents. This means, for each document in the input directory there
must be a file with the extension ”.key” and the same name as the corresponding
document. This file should contain manually assigned keywords, one per line. Given
the list of the candidate phrases, Kea marks those that were manually assigned as
positive example and all the rest as negative examples. By analyzing the feature
values for positive and negative candidate phrases, a model is computed, which
reflects the distribution of feature values for each phrase. As we can not manu-
ally assign an keys (.key files) for the documents, we used the words from the first
42
line of each document at the .key file. Later we planned to enhance this method
using the result from the clustering method, Map-Reduce our put, which give the
most frequent words in the cluster as input for .key files for KEA. When extracting
keywords from new documents, KEA takes the model and feature values for each
candidate phrase and computes its probability of being a keywords. Phrases with
the highest probabilities are selected into the final set of keywords. The user can
specify the number of keywords that need to be selected. We select the highest 10
probability keywords for each cluster to output from KEA.
10 e . g . e x p o r t CLASSPATH=$CLASSPATH :$KEAHOME
11 \ item Hadoop−ready d r i v e r program t o run Mahout c l u s t e r i n g
a l g o r i t h m s on hadoop c l u s t e r with l a r g e c o l l e c t i o n s .
12 \ item V i s u a l i z a t i o n t o j u d g e how w e l l t h e c l u s t e r i n g i s
p e r f o r m i n g . The v i s u a l i z a t i o n need not be g r a p h i c a l . Textual
r e p r e s e n t a t i o n i s f i n e f o r the scope of the p r o j e c t .
13 \ item I n p u t s from t h e LDA and S o c i a l Network team t o improve
clustering results .
14 d ) Add $KEAHOME/ l i b / ∗ . j a r t o your CLASSPATH environment v a r i a b l e
.
15
25
28 e . g . j a v a c TestKea
29 j a v a −Xmx526M TestKea
43
To extract keyphrases for some documents, put them into an empty directory. Then
rename them so that they end with the suffix ”.txt”
Extract thee keyphrase extraction model by running KEAKeyphraseExtractor:
1
Keyword ranking
The keywords are ranked by the number of documents where they serve as important
terms. It ranks the terms by the sum of their TF-IDF scores in each document
cluster. The top 20 forms a label list. The top 10 are cluster labels.
44
Appendix C
File Inventory
Document clustering project file inventory list is provided in the Table C.1
File Description
ClusteringReport.pdf Final project technical report
ClusteringReport.zip Latex source code for generating report (includes Readme.txt to reproduce report)
ClusteringPPT.pdf Final project presentation in pdf format
ClusteringPPT.pptx Final project presentation in editable pptx format
ClusteringCodeFiles.zip Project source code and binaries (See C.2 for more details)
File Description
extract docid from points.sh Extracts document ID from Clustered points
lda-team-eval.sh Wrapper script to compute row similarity
rows-similarity-computation.sh Row similarity computation using Mahout
cluster out to avro/cluster out to avro.py Cluster output to Avro file conversion
45
File Description
AvroToSequenceFilesCleanedTweetDocId.jar Converts Avro input to Sequence Files with Tweet schema
AvroToSequenceFilesCleanedWebpages.jar Converts Avro input to Sequence Files with Web page schema
File Description
confusion matrix for clustering.py Build confusion matrix for data sets with known labels
dictsize.sh Dictionary size calculator for various data sets
silhoutte concat tweets.sh Silhoutte score calculator for concatenated tweet data sets
silhoutte evaluation clustering.py Main script to calculate Silhoutte scores
silhoutte.sh Silhoutte score calculator wrapper script
sparsity.py Sparsity index calculator
sparsity.sh Wrapper script to calculate sparsity index
tweets.txt, webpages.txt Data set names for input to above Bash scripts
File Description
clustering/src/main/java/edu/vt/cs5604s15/clustering/ Source code for Avro to Sequence file conversion
shorturlexpansion/src/main/java/edu/vt/cs5604s15/shorturlexpansion/UrlExpander.java Source code for short URL to long URL expansion
File Description
Hierarchical clustering.sh Bash script to perform hierarchical clustering
HClustering.zip Source code and binary for hierarchical clusteirng
LabelCluster.zip Source code and binary for cluster labeling
Merge.zip Source code and binary for merging hierarchical cluster results
Readme.txt Provides further information on how to execute binaries
46
Acknowledgement
We would like to thank our sponsor, US National Science Foundation, for funding
this project through grant IIS - 1319578.
We would like to thank our class mates for extensive evaluation of this report
through peer reviews and invigorating discussions in the class that helped us a lot
in the completion of the project.
We would like to specifically thank our instructor Prof. Edward A. Fox and teaching
assistants Mohamed Magdy, Sunshin Lee for their constant guidance and encour-
agement throughout the course of the project.
47
References
[1] Jain, Anil K., M. Narasimha Murty, and Patrick J. Flynn. “Data clustering: a
review.” ACM Computing Surveys (CSUR) 31.3 (1999): 264-323.
[2] The Apache Mahout project’s goal is to build a scalable machine learning
library. http://mahout.apache.org/
[3] Learning Mahout: Clustering. http://sujitpal.blogspot.com/2012/09/
learning-mahout-clustering.html Last accessed: 02/19/2015
[4] Apache Hadoop. http://hadoop.apache.org/ Last accessed: 02/19/2015
[5] Jain, Anil K. Data clustering: 50 years beyond K-means. Pattern recognition
letters 31.8 (2010): 651-666.
[6] Berry, Michael W. Survey of text mining. Computing Reviews 45.9 (2004):
548.
[7] Beil, Florian, Martin Ester, and Xiaowei Xu. Frequent term-based text clus-
tering. Proceedings of the eighth ACM SIGKDD international conference on
Knowledge discovery and data mining. ACM, 2002.
[8] Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schtze. Introduc-
tion to information retrieval. Vol. 1. Cambridge: Cambridge university press,
2008.
[9] Steinbach, Michael, George Karypis, and Vipin Kumar. A comparison of doc-
ument clustering techniques. KDD workshop on text mining. Vol. 400. No. 1.
2000.
[10] Zhao, Ying, and George Karypis. Evaluation of hierarchical clustering algo-
rithms for document datasets. Proceedings of the eleventh international con-
ference on Information and knowledge management. ACM, 2002.
[11] Dhillon, Inderjit S., Subramanyam Mallela, and Rahul Kumar. Enhanced word
clustering for hierarchical text classification. Proceedings of the eighth ACM
SIGKDD international conference on Knowledge discovery and data mining.
ACM, 2002.
[12] Rosa, Kevin Dela, et al. Topical clustering of tweets. Proceedings of the ACM
SIGIR: SWSM (2011).
[13] Zamir, Oren, and Oren Etzioni. Web document clustering: A feasibility demon-
stration. Proceedings of the 21st annual international ACM SIGIR conference
on Research and development in information retrieval. ACM, 1998.
[14] Strehl, Alexander, Joydeep Ghosh, and Raymond Mooney. Impact of similarity
measures on web-page clustering. Workshop on Artificial Intelligence for Web
Search (AAAI 2000). 2000.
48
[15] Cooley, Robert, Bamshad Mobasher, and Jaideep Srivastava. Data preparation
for mining world wide web browsing patterns. Knowledge and information
systems 1.1 pp:5-32 (1999).
[16] Python package scikit-learn: different clustering algorithm implemented
in python. http://scikit-learn.org/stable/modules/clustering.html#
clustering Last accessed: 02/19/2015
[17] Python package collective.solr 4.0.3: Solr integration for external index-
ing and searching. https://pypi.python.org/pypi/collective.solr/4.0.
3 Last accessed: 02/19/2015
[18] Python Package Pattern2.6: a web mining module for data mining (Google
+ Twitter + Wikipedia API, web crawler, HTML DOM parser), machine
learning (vector space model, k-means clustering, Naive Bayes + k-NN + SVM
classifiers) and network analysis (graph centrality and visualization). https:
//pypi.python.org/pypi/Pattern Last accessed: 02/19/2015
[19] Carrot2 is an Open Source Search Results Clustering Engine. It can automati-
cally organize small collections of documents. http://project.carrot2.org/
Last accessed: 02/19/2015
[20] Apache SOLR and Carrot2 integration strategies. http://carrot2.github.
io/solr-integration-strategies/ Last accessed: 02/19/2015
[21] Gruss, Richard; Morgado, Daniel; Craun, Nate; Shea-Blymyer, Colin,
OutbreakSum: Automatic Summarization of Texts Relating to Disease
Outbreaks. https://vtechworks.lib.vt.edu/handle/10919/51133 Last ac-
cessed: 02/19/2015
[22] Apache Solr: Quick Start Tutorial. http://lucene.apache.org/solr/
quickstart.html. Last accessed 02/13/2015
[23] Apache Solr: SchemaXML Wiki. http://wiki.apache.org/solr/SchemaXml
Last accessed: 02/13/2015
[24] Apache Maven: Downloads. http://maven.apache.org/download.cgi Last
accessed: 02/13/2015
[25] Schwanke, R. W., and Platoff, M. A. Cross References are Features. Proc.
ACM 2nd Intl. Workshop on Software Configuration Management, Princeton,
N. J., October 1989, 86-95.
[26] M. Shtern and V. Tzerpos, On the comparability of software clustering algo-
rithms, Intl Conf. on Program Compre., ICPC, pp. 64-67, 2010.
[27] Tonella, P., Ricca, F., Pianta, E., and Girardi, C. (2003, September). Using
keyword extraction for web site clustering. In Web Site Evolution, 2003. Theme:
Architecture. Proceedings. Fifth IEEE International Workshop on, 41-48.
[28] Rousseeuw, Peter J. “Silhouettes: a graphical aid to the interpretation and val-
idation of cluster analysis.” Journal of computational and applied mathematics
20 (1987): 53-65.
49