A Novel Visual Analytics Approach For Clustering Large-Scale Social Data
A Novel Visual Analytics Approach For Clustering Large-Scale Social Data
Abstract—Social data refers to data individuals create that is type of data as user tag data.
knowingly and voluntarily shared by them and is an exciting Given the ubiquity of such data, methods for exploring
avenue into gaining insight into interpersonal behaviors and and analyzing social data links have become a critical
interaction. However, such data is large, heterogeneous and
often incomplete, properties that make the analysis of such research topic. By analyzing user tags and user behavior
data extremely challenging. One common method of exploring data, it is expected that insights into societal patterns can
such data is through cluster analysis, which can enable analysts be gleaned and used to inform governmental policy and
to find groups of related users, behaviors and interactions. This business decisions as well as influence human behavior [2].
paper presents a novel visual analysis approach for detecting For example, previous work [3] explores clustering users
clusters within large-scale social networks by utilizing a divide-
analyze-recombine scheme that sequentially performs data into groups to develop targeted market strategy based on
partitioning, subset clustering and result recombination within demographics and behavioral patterns. However, our abil-
an integrated visual interface. A case study on a microblog ity to extract knowledge from these large heterogeneous
messaging data (with 4.8 millions users) is used to demonstrate data sources is still limited due to the latent, subtle and
the feasibility of this approach and comparisons are also unpredictable relationships that may be hidden within the
provided to illustrate the performance benefits of this approach
with respect to existing solutions. data. These problems are further compounded due to issues
of data size and the non-linear computational complexity
Keywords-Divide and Recombine; Cluster Analysis; K- related to the data size.
means; Visual Analysis;
One means of reducing the computational complexity of
an analysis task is through the application of parallelization
I. I NTRODUCTION
using distributed computing [4] and super-computing [5].
As humans have become more intrinsically connected However, for analysis tasks that require frequent data
to technology, details of their behaviors and interpersonal transfer and interaction, parallelization is still problematic.
connections have become increasingly transparent. Activities Another means of reducing the computational complexity
and communications can automatically be collected during would be through the use of divide-and-conquer schemes.
interpersonal activities such as mobile phone communica- The divide-and-conquer algorithm works by recursively
tions, e-commerce transactions and internet activity (Tweets, breaking down a problem into sub-problems that can then
Facebook posts, blog entries), and such data can be classified be recombined to address the original problem.
into two components: behavioral data and personal data. Recent work has demonstrated the effectiveness of divide-
Behavioral data can be thought of as a record of a user and-conquer approaches for handling large data. For exam-
activity within society, for example, a mobile phone call ple, RHIPE [6] is built upon R and has been specifically
between individuals, a video conference, and sending a developed to provide a divide-and-conquer solution for the
tweet all provide links between a single user and other statistical analysis of large data. RHIPE utilizes the MapRe-
individuals. Such data tends to reflect a user’s daily patterns duce framework, and MapReduce itself has been used in
and behaviors [1], and we will refer to this type of data as large scale data analysis (e.g., [5]). Furthermore, in the
user behavior data. Personal data can be thought of as being data mining community, it is believed that the divide-and-
intrinsically tied to a single individual and would include recombine scheme will be adapted to handle the scalability
things such as their age and gender. We will refer to this problems found in many complex machine learning tasks [7].
80
Adaptive Data Division Incremental Data Clustering Result Recombination
Subset 2 ......
. . .. . .
Clusters: 1 2 ...... k Outliers
...
Subset N ......
......
Clusters: 1 2 k Outliers
Subset 1 Parameters
Clusters: 1 2 ... k
Outliers:
Social data Visual Exploration and Interactions
Figure 1. Conceptual overview of our approach. (a) Visual assisted adaptive data division; (b) Context-aware subset clustering; (c) Incremental recombination
of clusters; (d) Visual exploration of clusters and unusual data points.
propagated into the other subsets one-by-one and clustering sum of the distances between each point pair in the clusters:
is then done on each subset. In the third stage, the clusters
k
of all subsets are then merged with a hierarchical clustering
process. An integrated visual interface is designed to allow
D(K) = ∑ ∑ xi − x j (1)
r=1 xi ,x j ∈Xr
analysts to explore the user tags and associated user behavior
information of each cluster. Compared with existing divide- The clustering quality depends on the specification of
and-recombine approaches, our approach combines data the initial centroids [19]. Typical K-means solutions either
mining techniques and interactive visuals, which enables the repeat the computation of centroids several times using
analyst to explore patterns in large-scale social data. different initial seed points and then choose optimal result
based on the quality of the clusters, or apply specific
A. K-Means Clustering optimization which can result in long runtimes and storage
One of the key parts of our data analysis process is the uti- overheads when handling large-sized data. Our approach to
lization of the K-means algorithm. The K-means algorithm improving the performance of K-means in large-scale data is
is one of the most popular clustering methods [10], and is to utilize a divide-and-recombine scheme to reduce the data
employed in our approach due to its simplicity, efficiency complexity of K-means clustering. Furthermore, in order to
and robustness. For a set of points X = {xi |xi ∈ Rd }, K adapt the K-means algorithm specifically to social data, there
data points are randomly selected as the initial centroids are several key aspects that need to be addressed:
of the intended K clusters. Two steps are then recursively • Data Division How to determine the appropriate subset
performed until the algorithm converges: size so that the combination of clustered results of the
• Step 1: Assign each data point to its closest centroid. subsets can yield the same quality as the conventional
• Step 2: Update the centroids with the mean value of all K-means algorithm?
the data points assigned to them. • Data Clustering How to select K and a distance metric
Suppose that the resultant clusters are X1 , X2 , ..., XK . The for multi-dimensional user attribute vectors to achieve
quality of the clustering subject to K can be measured by the optimal clustering results?
81
• Recombination and Analysis How to evaluate the determining an optimal subset size. Figure 3 shows a group
clustering quality and discover interesting users and of pixel charts with a subset size 50.
patterns from the clustered results?
B. Adaptive Data Division
In terms of clustering, one criteria for dividing a massive
dataset into smaller subsets is to preserve the statistical
distribution in each subset [20], which poses a challenging
problem in determining initial splits in the divide-and-
recombine process. We propose to randomly subdivide the
entire set into subsets with uniform sizes [10] and adaptively
determine the subset size by considering both the perfor-
mance and quality.
Our solution is to employ the use of a visually assisted
adaptive data division scheme. For each subset, a pixel
chart [21] is generated to show the statistical distribution of
each attribute value of the user attribute vector. In particular,
all subsets are organized into a 2D array, which is further
represented with a set of 2D cells. The saturation of each
Figure 3. The pixel charts with respect to 42 user attribute values with
cell encodes the percentage of users with the corresponding 50 subsets.
attribute value in each subset, as illustrated in Figure 2.
The algorithm for adaptive data division is summarized
Cluster ID as follows:
82
Distance
250 The set of {H(m, n), m = 1, 2..., M, n = 1, 2, 3, ..., M} forms
an M × M matrix. By normalizing and encoding each ele-
20
ment of the matrix with a color, a sensitivity map associated
with the underlying attribute is generated (Figure 5). This
mapping implies the proximity among the clustered results
15
under different weighting configurations. A low value of
H(m, n) indicates that two weighting configurations yield
10 similar clustered results, and vice versa. The overall distri-
bution of the sensitivity map and its average value can be
5 used to study the relevance of the attribute to the clustering.
A sensitivity map can be generated for each attribute. The
0 set of the maps characterizes the influence of a set of user
1 2 3 4 5 6 7 8 9 10 11 12 13 14
attributes on the variation of the clustering process and can
Figure 4. The relationship between the cluster number K (the x axis) and be used to guide the weight function. The weight of each
the sum of distance D(K) (the y axis). attribute is specified as to be proportional to the average
value of its associated map. The user can manually adjust
the sensitivity of a map by moving it vertically. The weights
space, and the choice of distance metric chosen will greatly corresponding to the sensitivities are shown with a radial
influence the results. In this case, we cluster the underlying map.
dataset based on the user attribute vectors, where the prob-
lem is identical to specifying an appropriate weight for each weight
attribute: a high weight can be used for a salient attribute, 0.1 1.0
and an attribute has little influence on the clustering if its
weight is low. For clarity we define the weights of a user at-
0.1
tribute vector to be x = (x1 , x2 , ..., xd ) as w = (w1 , w2 , ..., wd ).
However, the influence of each attribute on the cluster-
ing result and the relations among attributes are difficult
to model and represent. Furthermore, attributes may have
weight
83
2000 Sum of Distances
all other subsets. Algorithm 2 briefly summarizes the details. Standard
Incremental
25
Incremental
Standard
The incremental data clustering scheme not only allows us 20
E. Result Recombination
After all the subsets are clustered, we utilize the standard
hierarchical clustering method [10] to recombine the results IV. C ASE S TUDY
of all subsets. Again, the cluster number K for clustering Our clustering algorithm and visual interface were devel-
the entire set is set to be the same as that of the first subset. oped using Java. A parallelization to our clustering kernel
The distance between cluster A and cluster B is defined as: is also implemented with the multi-thread feature of Java.
n(A)n(B) We also implemented the standard K-means algorithm. The
Δ(A, B) = centroid(A) − centroid(B)2 (4) three implementations are named as DR, PDR and STD. The
n(A) + n(B)
performance reports are made on a PC with an i7 3.40 GHZ
where is n(·) denotes the size of a cluster, and centroid(·) CPU, 16 G memory, and an Nvidia 680 video card.
denotes the cluster centroid. We have conducted experiments on a Microblog Dataset
The key idea of the hierarchical clustering algorithm is to consisting of 4.8 million users. It was collected from the
recursively combine all clusters (see Algorithm 3). largest social network (http://www.weibo.com) in China, i.e.,
84
2000000
Sina Weibo. The total data sizes is 399 MB. Sum of Distances STD
Table I
S IX ATTRIBUTES OF THE M ICROBLOG D ATASET 0 K
4 5 6 7 8 9 10 11 12
0 K
4 5 6 7 8 9 10 11 12
Quality and Performance Applying our approach to the
(b)
dataset found that the optimal cluster number is six. Figure 7
(a) compares the clustering quality of our approach and Figure 7. Results for the Microblog dataset. (a) The clustering quality
the standard K-means algorithm. In particular, the summed comparison of DR and STD; (b) The performance comparison of DR, PDR
and STD.
values of point-to-centroid distances with respect to different
cluster number K are plotted.
To compare the performance, the timings with DR, PDR,
analytics pipeline. The entire pipeline integrates a suite of
and STD with respect to different cluster number K are
visual analysis techniques to provide an effective workspace
displayed in Figure 7 (b). The comparison indicates that
for the partition of a large collection of instances, the
the naive implementation of our approach achieves higher
determination of clustering parameters, and the merging
performance than the standard algorithm. The most time
of multiple clusters. Experimental results verify that our
consuming part is the incremental clustering of subsets
approach outperforms conventional solutions in both the
due to the I/O operations for clustering each subset. This
quality and efficiency.
inefficiency is addressed in our parallel implementation,
which yields a stable and high performance record.
VI. ACKNOWLEDGEMENT
Social Data Analysis In addition to the 6 clusters, a group
of outlier points are detected in the clustering process This paper is supported by NSFC (61232012,61272302),
(Figure 8). By further exploring the views of user attributes, National High Technology Research and Development Pro-
some interesting facts are discovered. In particular, the pixel gram of China (2012AA12090), Zhejiang Provincial Natu-
charts with respect to different attributes enable us to identify ral Science Foundation of China (LR13F020001), Doctoral
two specific groups of user patterns. One group has very high Fund of Ministry of Education of China (20120101110134).
attribute values, while the other has low values. Commonly
they have a high value of FollowerNumber, but a low R EFERENCES
value of FriendNumber (almost near zero). In addition, their
BiFriendNumbers are zero, and their registered addresses are [1] J.Moody, D.McFarland, and S. Bender-deMoll, “Dynamic
network visualization,” American Journal of Sociology, pp.
all overseas. 1206–1241, 2005.
V. C ONCLUSIONS
[2] J. Wei, Z. Shen, N. Sundaresan, and K.-L. Ma, “Visual cluster
This paper presents an effective divide-and-recombine exploration of web clickstream data,” in IEEE Conference on
approach for clustering massive data assisted with a visual Visual Analytics Science and Technology, 2012, pp. 3–12.
85
[11] Z. Ahmed and C. Weaver, “An adaptive parameter space-
filling algorithm for highly interactive cluster exploration,”
in IEEE Conference on Visual Analytics Science and Tech-
nology, July 2012, pp. 13–22.
[6] S. Guha, R. Hafen, J. Rounds, J. Xia, J. Li, B. Xi, and W. S. [20] J. B. Macqueen, “Some methods of classification and analysis
Cleveland, “Large Complex Data: Divide and Recombine of multi-variate observations,” in Proceedings of the Fifth
with RHIPE,” The ISI’s Journal for the Rapid Dissemination Berkeley Symposiumon Mathematical Statistics and Proba-
of Statistics Research, 2012. bility, July 1967, pp. 281–297.
86