0% found this document useful (0 votes)
59 views8 pages

A Novel Visual Analytics Approach For Clustering Large-Scale Social Data

Uploaded by

K V D Sagar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views8 pages

A Novel Visual Analytics Approach For Clustering Large-Scale Social Data

Uploaded by

K V D Sagar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

,(((,QWHUQDWLRQDO&RQIHUHQFHRQ%LJ'DWD

A Novel Visual analytics Approach for Clustering Large-Scale Social Data

Zhangye Wang Juanxia Zhou Wei Chen


State Key Lab of CAD&CG State Key Lab of CAD&CG State Key Lab of CAD&CG
Zhejiang University Zhejiang University Zhejiang University
Hangzhou, China Hangzhou, China Hangzhou, China
[email protected] [email protected] [email protected]

Chang Chen Jiyuan Liao Ross Maciejewski


College of Software Engineering State Key Lab of CAD&CG School of CIDSE
University of Science and Technology of China Zhejiang University Arizona State University
Hefei, China Hangzhou, China USA
[email protected] [email protected] [email protected]

Abstract—Social data refers to data individuals create that is type of data as user tag data.
knowingly and voluntarily shared by them and is an exciting Given the ubiquity of such data, methods for exploring
avenue into gaining insight into interpersonal behaviors and and analyzing social data links have become a critical
interaction. However, such data is large, heterogeneous and
often incomplete, properties that make the analysis of such research topic. By analyzing user tags and user behavior
data extremely challenging. One common method of exploring data, it is expected that insights into societal patterns can
such data is through cluster analysis, which can enable analysts be gleaned and used to inform governmental policy and
to find groups of related users, behaviors and interactions. This business decisions as well as influence human behavior [2].
paper presents a novel visual analysis approach for detecting For example, previous work [3] explores clustering users
clusters within large-scale social networks by utilizing a divide-
analyze-recombine scheme that sequentially performs data into groups to develop targeted market strategy based on
partitioning, subset clustering and result recombination within demographics and behavioral patterns. However, our abil-
an integrated visual interface. A case study on a microblog ity to extract knowledge from these large heterogeneous
messaging data (with 4.8 millions users) is used to demonstrate data sources is still limited due to the latent, subtle and
the feasibility of this approach and comparisons are also unpredictable relationships that may be hidden within the
provided to illustrate the performance benefits of this approach
with respect to existing solutions. data. These problems are further compounded due to issues
of data size and the non-linear computational complexity
Keywords-Divide and Recombine; Cluster Analysis; K- related to the data size.
means; Visual Analysis;
One means of reducing the computational complexity of
an analysis task is through the application of parallelization
I. I NTRODUCTION
using distributed computing [4] and super-computing [5].
As humans have become more intrinsically connected However, for analysis tasks that require frequent data
to technology, details of their behaviors and interpersonal transfer and interaction, parallelization is still problematic.
connections have become increasingly transparent. Activities Another means of reducing the computational complexity
and communications can automatically be collected during would be through the use of divide-and-conquer schemes.
interpersonal activities such as mobile phone communica- The divide-and-conquer algorithm works by recursively
tions, e-commerce transactions and internet activity (Tweets, breaking down a problem into sub-problems that can then
Facebook posts, blog entries), and such data can be classified be recombined to address the original problem.
into two components: behavioral data and personal data. Recent work has demonstrated the effectiveness of divide-
Behavioral data can be thought of as a record of a user and-conquer approaches for handling large data. For exam-
activity within society, for example, a mobile phone call ple, RHIPE [6] is built upon R and has been specifically
between individuals, a video conference, and sending a developed to provide a divide-and-conquer solution for the
tweet all provide links between a single user and other statistical analysis of large data. RHIPE utilizes the MapRe-
individuals. Such data tends to reflect a user’s daily patterns duce framework, and MapReduce itself has been used in
and behaviors [1], and we will refer to this type of data as large scale data analysis (e.g., [5]). Furthermore, in the
user behavior data. Personal data can be thought of as being data mining community, it is believed that the divide-and-
intrinsically tied to a single individual and would include recombine scheme will be adapted to handle the scalability
things such as their age and gender. We will refer to this problems found in many complex machine learning tasks [7].

978-1-4799-1293-3/13/$31.00 ©2013 IEEE 


This paper presents our proposed methodology for en- problems including matrix factorization and statistical in-
hancing the visual analysis of large-scale social data uti- ference have been modified to utilize a divide-and-conquer
lizing the divide-and-recombine scheme. Our approach is methodology [7]. For example, the RHIPE system that is
specifically designed to handle a central problem for social built on R utilizes an underlying Hadoop structure and has
network analysis: clustering users into groups by leveraging been applied to a variety of analytical domains [6]. Results
the user tags and user behavior information. The key idea from this work demonstrate the feasibility of utilizing divide-
is to sequentially perform data partitioning, subset cluster- and-conquer for exploring large scale data.
ing and result recombination utilizing an integrated visual
interface. The entire pipeline integrates a suite of visual C. Social Data Analysis
analysis techniques to provide an effective workspace for Recently, many tools have been developed to explore
the partitioning of large collections of instances, determining the complex spatiotemporal relationships underlying such
cluster parameters and merging multiple clusters. Our main data. Work by Field et al. [14] explored the analysis of
contributions include: microblog data and others like crime investigators [15] and
• The identification and systematic implementation of the urban planners [16] have also begun utilizing social data.
divide-and-recombine scheme for clustering of social MacEachren et al. [17] developed Senseplace as a tool for
data in an integrated visual exploration process; exploring the message density of actual or textually inferred
• An incremental data clustering technique that enhances Twitter message locations, ScatterBlogs [18] presented a
the conventional K-means algorithm in both the clus- scalable system enabling analysts to work on quantitative
tering quality and performance; findings within a large set of Microblog messages.However,
• A novel context-aware subset-clustering analysis. no previous work in this area (to our knowledge) utilizes a
divide-and-recombine scheme for large scale clustering of
II. R ELATED W ORK users and behaviors in social data analysis.
In this section, we will briefly review related works in III. A V ISUAL A NALYSIS A PPROACH FOR C LUSTERING
each of these categories. L ARGE -S CALE S OCIAL DATA

A. Data Clustering The focus of this work is on large-scale (multi-million)


user data sets. As input, our social data contains a set of user
Recently, the application of interactive supervised learning provided information (age, gender, etc.) that we denote as
has become a major focus in the visual analytics community. user tags and their associated behavioral information which
Pelekis et al. [8] designed a novel distance function as we define as user behavior. The user behavior information
a similarity measurement for the analysis of movement provides links between users.
data. Guo et al. [9] proposed an approach for steering the Within such data, there are expected to be clusters of
clustering process by building a hierarchical spatial cluster individuals and patterns which can be analyzed for mar-
structure within the high-dimensional feature space. keting purposes, criminal activity or various other research
Unsupervised learning algorithms can be divided into questions. A key challenge in clustering such data is that
two major categories: hierarchical clustering and partitioning the application of clustering algorithms to a set of user tags
clustering [10]. Perhaps the most commonly used algorithm requires the formulation of user tag and behavior informa-
for partitioning data is the K-means algorithm [11]. Un- tion into a point data set where each point is associated
fortunately, such partitioning algorithms tend to be non- with multiple attributes. The set of attributes form a user
linear in their runtime, despite efforts to improve such attribute vector. Note that the construction of the user
techniques [12]. As the computational complexity increases attribute vectors varies with specific data and tasks, and
non-linearly with the data size, new solutions are necessary. details for formulating this vector for each dataset will be
There are also some pioneered work on parallel cluster- given in the case study. Our goal is to provide analysts
ing algorithms. For instance, Zhao et al. [13] proposed a with a means of clustering the user attribute vectors to find
parallel K-means clustering algoithm based on MapReduce commonalities between users and behaviors. Conceptually,
framework. Yet, this our approach consists of four stages (Figure 1): data division,
data clustering, result recombination, and visual exploration.
B. Divide-And-Conquer for Data Analysis In the first stage, a massive collection of user attribute
Divide-and-Conquer has proven to be an efficient mecha- vectors is generated from the social data. It is randomly
nism for reducing the runtime of complicated procedures subdivided into many subsets by means of a visually assisted
such as sorting, the multiplication of large numbers and data division process. In the second stage, a subset of the
discrete Fourier transforms. Recently, it has been applied data is clustered using an improved k-means algorithm, in
to address the underlying issues of computational complex- which a novel context-aware technique is leveraged to op-
ity associated with big data. Several fundamental statistics timize the clustering parameters. Subsequently, the result is

80
Adaptive Data Division Incremental Data Clustering Result Recombination

Subset 2 ......

. . .. . .
Clusters: 1 2 ...... k Outliers

...
Subset N ......

......
Clusters: 1 2 k Outliers

The entire user set Context-aware Subset Clustering


......

Clusters: 1 2 ...... k Outliers

Subset 1 Parameters

Pixel chart view

Clusters: 1 2 ... k

Outliers:
Social data Visual Exploration and Interactions

Figure 1. Conceptual overview of our approach. (a) Visual assisted adaptive data division; (b) Context-aware subset clustering; (c) Incremental recombination
of clusters; (d) Visual exploration of clusters and unusual data points.

propagated into the other subsets one-by-one and clustering sum of the distances between each point pair in the clusters:
is then done on each subset. In the third stage, the clusters
k
of all subsets are then merged with a hierarchical clustering
process. An integrated visual interface is designed to allow
D(K) = ∑ ∑ xi − x j  (1)
r=1 xi ,x j ∈Xr
analysts to explore the user tags and associated user behavior
information of each cluster. Compared with existing divide- The clustering quality depends on the specification of
and-recombine approaches, our approach combines data the initial centroids [19]. Typical K-means solutions either
mining techniques and interactive visuals, which enables the repeat the computation of centroids several times using
analyst to explore patterns in large-scale social data. different initial seed points and then choose optimal result
based on the quality of the clusters, or apply specific
A. K-Means Clustering optimization which can result in long runtimes and storage
One of the key parts of our data analysis process is the uti- overheads when handling large-sized data. Our approach to
lization of the K-means algorithm. The K-means algorithm improving the performance of K-means in large-scale data is
is one of the most popular clustering methods [10], and is to utilize a divide-and-recombine scheme to reduce the data
employed in our approach due to its simplicity, efficiency complexity of K-means clustering. Furthermore, in order to
and robustness. For a set of points X = {xi |xi ∈ Rd }, K adapt the K-means algorithm specifically to social data, there
data points are randomly selected as the initial centroids are several key aspects that need to be addressed:
of the intended K clusters. Two steps are then recursively • Data Division How to determine the appropriate subset
performed until the algorithm converges: size so that the combination of clustered results of the
• Step 1: Assign each data point to its closest centroid. subsets can yield the same quality as the conventional
• Step 2: Update the centroids with the mean value of all K-means algorithm?
the data points assigned to them. • Data Clustering How to select K and a distance metric
Suppose that the resultant clusters are X1 , X2 , ..., XK . The for multi-dimensional user attribute vectors to achieve
quality of the clustering subject to K can be measured by the optimal clustering results?

81
• Recombination and Analysis How to evaluate the determining an optimal subset size. Figure 3 shows a group
clustering quality and discover interesting users and of pixel charts with a subset size 50.
patterns from the clustered results?
B. Adaptive Data Division
In terms of clustering, one criteria for dividing a massive
dataset into smaller subsets is to preserve the statistical
distribution in each subset [20], which poses a challenging
problem in determining initial splits in the divide-and-
recombine process. We propose to randomly subdivide the
entire set into subsets with uniform sizes [10] and adaptively
determine the subset size by considering both the perfor-
mance and quality.
Our solution is to employ the use of a visually assisted
adaptive data division scheme. For each subset, a pixel
chart [21] is generated to show the statistical distribution of
each attribute value of the user attribute vector. In particular,
all subsets are organized into a 2D array, which is further
represented with a set of 2D cells. The saturation of each
Figure 3. The pixel charts with respect to 42 user attribute values with
cell encodes the percentage of users with the corresponding 50 subsets.
attribute value in each subset, as illustrated in Figure 2.
The algorithm for adaptive data division is summarized
Cluster ID as follows:

Algorithm 1 Visual Assisted Adaptive Data Division


1 2 3
Initialization: set a small subset size Ns (e.g., 5K for a set
with 1M points) to generate a relatively large number of
4 5 6 subsets (e.g., 200).
while TRUE do
Generate a pixel chart for each attribute.
7 8 9 Visually explore the pixel charts of all attributes.
if All pixel charts show similar appearance then
Return the current subset size and subsets.
1
FALSE.
0
Percentage else
Ns ++.
Figure 2. A pixel chart visualizes the statistical distribution of certain end if
attribute with respect to a subset size. Here, each cell represents a subset,
and encodes the statistical value of the subset. end while

If the pixel chart of an attribute exhibits an approximately


uniform appearance, one can assume that the subset has a C. Context-aware Subset Clustering
statistical distribution that is similar to the overall distribu- Once the analyst is satisfied with the data subset choice,
tion of a given attribute. If the subset size is appropriate the data subsets then enter the clustering phase. In clustering
for all attributes, then one can assume that it represents the first subset, two parameters are essential to the clustering
a reasonable statistical sample of the original dataset with quality: the cluster number, K; and the distance metric for
respect to the specific user attribute vector. In this manner, the points.
an analyst can quickly assess the suggested subdivision and Cluster Number K There have been many methods to
modify the current data division to try and improve the find an optimal cluster number K [22]. In our system, K
results. Surely, statistical sampling techniques would work is heuristically determined by computing and plotting the
in these situations, but may require more user interaction sum of the distance function D(K) (Equation 1). Typically,
time. In order to enable such actions, our system provides the point where the curve becomes flat (shown in red in
simple user interactions in the pixel chart view: specifying Figure 4) is identified as a reasonable choice for K [22].
the subset size, studying the distribution of a specific subset Attribute Weights K-means partitions the points based
size, comparing the distributions of different subsets, and on the distances between points in the high-dimensional

82
Distance
250 The set of {H(m, n), m = 1, 2..., M, n = 1, 2, 3, ..., M} forms
an M × M matrix. By normalizing and encoding each ele-
20
ment of the matrix with a color, a sensitivity map associated
with the underlying attribute is generated (Figure 5). This
mapping implies the proximity among the clustered results
15
under different weighting configurations. A low value of
H(m, n) indicates that two weighting configurations yield
10 similar clustered results, and vice versa. The overall distri-
bution of the sensitivity map and its average value can be
5 used to study the relevance of the attribute to the clustering.
A sensitivity map can be generated for each attribute. The
0 set of the maps characterizes the influence of a set of user
1 2 3 4 5 6 7 8 9 10 11 12 13 14
attributes on the variation of the clustering process and can
Figure 4. The relationship between the cluster number K (the x axis) and be used to guide the weight function. The weight of each
the sum of distance D(K) (the y axis). attribute is specified as to be proportional to the average
value of its associated map. The user can manually adjust
the sensitivity of a map by moving it vertically. The weights
space, and the choice of distance metric chosen will greatly corresponding to the sensitivities are shown with a radial
influence the results. In this case, we cluster the underlying map.
dataset based on the user attribute vectors, where the prob-
lem is identical to specifying an appropriate weight for each weight
attribute: a high weight can be used for a salient attribute, 0.1 1.0
and an attribute has little influence on the clustering if its
weight is low. For clarity we define the weights of a user at-
0.1
tribute vector to be x = (x1 , x2 , ..., xd ) as w = (w1 , w2 , ..., wd ).
However, the influence of each attribute on the cluster-
ing result and the relations among attributes are difficult
to model and represent. Furthermore, attributes may have
weight

different data types (numerical, ordinal and categorical) and


distinctive value ranges. Thus, it is desirable to discover the
significantly relevant attributes and associated configurations
by studying the influences of different weights on the clus-
tering results, and design an enhanced weight computation
scheme.
In our approach, the weight of an attribute subject to a
clustering process is determined by evaluating the sensitivity
of the clustered result with respect to different weights. For
the jth attribute, a sequence of weighting configurations
1.0
{w(k), k = 1, 2, 3, ..., M} is generated:
 l
w (k) = 1/k, if l = j;
(2) 1
wl (k) = (1 − 1/k)/(M − 1), else. 0
sensitivity
where M is an adjustable constant.
Figure 5. A sensitivity map is generated for each attribute.
For each weighting configuration {w(k)}, a K-means
clustering is performed by using the weights in the distance
metric. We further compute the Hausdorff Distance [23]
between the centroid sets of the clustered results C(m) and D. Incremental data clustering
C(n) associated with {w(m)} and {w(n)}: The first subset can be either randomly selected, or can
be manually specified by the users with the help of the
H(m, n) = max min x − y (3) pixel charts of subsets. After clustering the first subset, an
x∈C(m) y∈C(n)
incremental data clustering procedure is applied to cluster
where x and y denote a centroid of the clustered results C(m) all other subsets. The clustering configuration (the weights,
and C(n), respectively. etc.) for the first subset is applied in the clustering process of

83
2000 Sum of Distances
all other subsets. Algorithm 2 briefly summarizes the details. Standard
Incremental

Algorithm 2 Incremental Data Clustering


for Each subset do
Set the clustered centroids of the first subset as the
initial centroids.
Set the cluster number and attribute weights of the first
subset as the ones for the underlying subset.
Perform the K-means clustering. 1500 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
cluster Number

end for (a)


30 Time (s)

25
Incremental
Standard
The incremental data clustering scheme not only allows us 20

to utilize the divide-cluster-recombine mechanism, but it also


achieves higher performance and quality when compared 15

with the standard K-means algorithm. Figure 6 (a) compares 10


the sum of point-to-centroid distances of the cluster results
of both our approach and the standard K-means algorithm. 5

In principal, the incremental data clustering scheme achieves 0 1 Cluster Number


2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
higher accuracy because the cluster centroids in clustering (b)
each subset are determined in handling the first subset. With
the standard K-means algorithm, the cluster centroids are Figure 6. Comparing the sum of point-to-centroid distances (a) and the
running time (b) of our incremental data clustering scheme (in dark blue)
randomly initialized when clustering each subset, leading to and the standard K-means algorithm (in light blue).
varied clustering results [24]. The benefit of the incremental
scheme is also significant with respect to the running time. Algorithm 3 Hierarchical clustering of the clusters of all
Figure 6 (b) reports the total running time of both approaches subsets
for the same dataset. In general, our approach runs faster Initialization: Load all clusters of all subsets.
with different cluster numbers K, and achieves an average while TRUE do
10% - 100% acceleration. More importantly, the clustering Compute the distances between each pair of all clusters
of subsets are parallelizable because the subsets are inde- of all subsets.
pendent. Concrete performance analysis will be given in the Combine two clusters whose distance is the minimal.
case study. if There are only K clusters then
To allow for visual exploration of the clusters of each Output the clusters.
subset, our system employs the principle component analysis FALSE.
(PCA) algorithm to project all data points of a subset or the else
entire set into the 2D space. PCA is chosen because it is TRUE.
fast with a linear computational complexity, and can handle end if
a large amount of points. end while

E. Result Recombination
After all the subsets are clustered, we utilize the standard
hierarchical clustering method [10] to recombine the results IV. C ASE S TUDY
of all subsets. Again, the cluster number K for clustering Our clustering algorithm and visual interface were devel-
the entire set is set to be the same as that of the first subset. oped using Java. A parallelization to our clustering kernel
The distance between cluster A and cluster B is defined as: is also implemented with the multi-thread feature of Java.
n(A)n(B) We also implemented the standard K-means algorithm. The
Δ(A, B) = centroid(A) − centroid(B)2 (4) three implementations are named as DR, PDR and STD. The
n(A) + n(B)
performance reports are made on a PC with an i7 3.40 GHZ
where is n(·) denotes the size of a cluster, and centroid(·) CPU, 16 G memory, and an Nvidia 680 video card.
denotes the cluster centroid. We have conducted experiments on a Microblog Dataset
The key idea of the hierarchical clustering algorithm is to consisting of 4.8 million users. It was collected from the
recursively combine all clusters (see Algorithm 3). largest social network (http://www.weibo.com) in China, i.e.,

84
2000000
Sina Weibo. The total data sizes is 399 MB. Sum of Distances STD

In particular, the Microblog Dataset consists of 4,838,573 DR

users and 18 user attributes. In our experiments, six attributes 1500000

are used: OnlineTime, Bi-friendNumber, FollowerNumber,


FriendNumber, PostNumber and FavouritePostNumber. Ta- 1000000
ble I lists the meaning of these attributes. The entire set
is divided into 76 subsets, each of which contains 64,000
users. 500000

Table I
S IX ATTRIBUTES OF THE M ICROBLOG D ATASET 0 K
4 5 6 7 8 9 10 11 12

Attribute Name(Abbr.) Meaning (a)


OnlineTime The time the user spend on Microblog
2000 Time (s)
the number of people with which the
BiFriendNumber PDR
user follows each other DR

the number of people who follow the STD


FollowerNumber 1500
user to see his/ her updates
FriendNumber the number of people the user follows
the number of messages the user sends 1000
PostNumber out to his/her followers through
Microblog
the number of microblogs the user 500
FavouritePostNumber
collects

0 K
4 5 6 7 8 9 10 11 12
Quality and Performance Applying our approach to the
(b)
dataset found that the optimal cluster number is six. Figure 7
(a) compares the clustering quality of our approach and Figure 7. Results for the Microblog dataset. (a) The clustering quality
the standard K-means algorithm. In particular, the summed comparison of DR and STD; (b) The performance comparison of DR, PDR
and STD.
values of point-to-centroid distances with respect to different
cluster number K are plotted.
To compare the performance, the timings with DR, PDR,
analytics pipeline. The entire pipeline integrates a suite of
and STD with respect to different cluster number K are
visual analysis techniques to provide an effective workspace
displayed in Figure 7 (b). The comparison indicates that
for the partition of a large collection of instances, the
the naive implementation of our approach achieves higher
determination of clustering parameters, and the merging
performance than the standard algorithm. The most time
of multiple clusters. Experimental results verify that our
consuming part is the incremental clustering of subsets
approach outperforms conventional solutions in both the
due to the I/O operations for clustering each subset. This
quality and efficiency.
inefficiency is addressed in our parallel implementation,
which yields a stable and high performance record.
VI. ACKNOWLEDGEMENT
Social Data Analysis In addition to the 6 clusters, a group
of outlier points are detected in the clustering process This paper is supported by NSFC (61232012,61272302),
(Figure 8). By further exploring the views of user attributes, National High Technology Research and Development Pro-
some interesting facts are discovered. In particular, the pixel gram of China (2012AA12090), Zhejiang Provincial Natu-
charts with respect to different attributes enable us to identify ral Science Foundation of China (LR13F020001), Doctoral
two specific groups of user patterns. One group has very high Fund of Ministry of Education of China (20120101110134).
attribute values, while the other has low values. Commonly
they have a high value of FollowerNumber, but a low R EFERENCES
value of FriendNumber (almost near zero). In addition, their
BiFriendNumbers are zero, and their registered addresses are [1] J.Moody, D.McFarland, and S. Bender-deMoll, “Dynamic
network visualization,” American Journal of Sociology, pp.
all overseas. 1206–1241, 2005.
V. C ONCLUSIONS
[2] J. Wei, Z. Shen, N. Sundaresan, and K.-L. Ma, “Visual cluster
This paper presents an effective divide-and-recombine exploration of web clickstream data,” in IEEE Conference on
approach for clustering massive data assisted with a visual Visual Analytics Science and Technology, 2012, pp. 3–12.

85
[11] Z. Ahmed and C. Weaver, “An adaptive parameter space-
filling algorithm for highly interactive cluster exploration,”
in IEEE Conference on Visual Analytics Science and Tech-
nology, July 2012, pp. 13–22.

[12] A. Vakali, J. Pokornoy, and T. Dalamagas, “An overview


of web data clustering practices,” in Proceedings of EDBT
Workshops, 2004, pp. 597–606.

[13] W. Zhao, H. Ma, and Q. He, “Parallel k-means clustering


based on mapreduce,” Cloud Computing, Springer Berlin
Heidelberg, pp. 674–679, 2009.

[14] K. Field and J. O’Brien, “Cartoblography: Experiments in


using and organising the spatial context of micro-blogging,”
Transactions in GIS, vol. 14, pp. 5–23, 2010.

[15] R. E. Roth and J. White, “Twitterhitter: Geovisual analytics


for harvesting insight from volunteered geographic informa-
tion,” in GIScience, 2010.

[16] S. Wakamiya, R. Lee, and K. Sumiya, “Crowd-based


urban characterization: extracting crowd behavioral patterns
in urban areas from twitter,” in Proceedings of the 3rd
Figure 8. A group of points within the red rectangle are specified as
ACM SIGSPATIAL International Workshop on Location-
outliers. Based Social Networks, ser. LBSN ’11. New York,
NY, USA: ACM, 2011, pp. 77–84. [Online]. Available:
http://doi.acm.org/10.1145/2063212.2063225
[3] Z. Shen and K.-L. Ma, “Mobivis: A visualization system for [17] A. MacEachren, A. Jaiswal, A. Robinson, S. Pezanowski,
exploring mobile data,” in IEEE Pacific Visualization, 2008, A. Savelyev, P. Mitra, X. Zhang, and J. Blanford, “Sense-
pp. 175–182. place2: Geotwitter analytics support for situational aware-
ness,” in IEEE Conference on Visual Analytics Science and
[4] W. A. Pike, J. Bruce, B. Baddeley, D. Best, L. Franklin, Technology, 2011, pp. 181–190.
R. May, D. M.Rice, R. Riensche, and K. Younkin, “The
Scalable Reasoning System: Lightweight Visualization for [18] H. Bosch, D. Thom, M. Worner, S. Koch, D. Puttmann,
Distributed Analytics,” Infomation Visualization, pp. 171– D. Jackle, and T. Ertl, “Scatterblogs: Geo-spatial document
184, 2008. analysis,” in IEEE Conference on Visual Analytics Science
and Technology, 2011, pp. 309–310.
[5] H. Vo, J. Bronson, B. Summa, J. Comba, J. Freire, B. Howe,
V. Pascucci, and C. Silva, “Parallel visualization on large [19] A. K. Jain, M. Murthy, and P. Flynn, “Data clustering: A
clusters using MapReduce,” in IEEE Symposium on Large review,” ACM Computing Reviews, vol. 31, no. 3, pp. 264–
Data Analysis and Visualization (LDAV), 2011, pp. 27–34. 323, 1999.

[6] S. Guha, R. Hafen, J. Rounds, J. Xia, J. Li, B. Xi, and W. S. [20] J. B. Macqueen, “Some methods of classification and analysis
Cleveland, “Large Complex Data: Divide and Recombine of multi-variate observations,” in Proceedings of the Fifth
with RHIPE,” The ISI’s Journal for the Rapid Dissemination Berkeley Symposiumon Mathematical Statistics and Proba-
of Statistics Research, 2012. bility, July 1967, pp. 281–297.

[21] D. A. Keim and H.-P. Kriegel, “VisDB: Database Explo-


[7] M. I. Jordan, “Divide-and-conquer and statistical inference ration using Multidimensional Visualization,” IEEE Computer
for big data,” in ACM SIGKDD international conference on Graphics and Application, vol. 14, no. 5, pp. 40–49, 1994.
Knowledge discovery and data mining, 2012.
[22] D. Arthur and S. Vassilvitskii, “K-means++: The Advantages
[8] N. Pelekis, G. Andrienko, N. Andrienko, I. Kopanakis, of Careful Seeding,” in Proceedings of the eighteenth annual
G. Marketos, and Y. Theodoridis, “Visually exploring move- ACM-SIAM symposium pn Discrete algorithms, 2007, pp.
ment data via similarity-based analysis,” Journal of Intelligent 1027–1035.
Information Systems, vol. 38, no. 2, pp. 343–391, 2012.
[23] D. P. Huttenlocher, G. A. Klanderman, and W. J. Rucklidge,
[9] D. Guo, D. Peuquet, and M. Gahegan, “ICEAGE: Interactive “Comparing Images Using the Hausdorff Distance,” IEEE
clustering and exploration of large and high-dimensional Transactions on PAMI, pp. 264–323, 1933.
geodata,” GeoInformatica, vol. 7, no. 3, pp. 229–253, 2003.
[24] A. K. Jain, “Data clustering: 50 years beyond k-means,”
[10] P. Berkhin, “Survey of clustering data mining techniques,” Pattern Recognition Letters, vol. 31, no. 8, pp. 651–666, June
Technical Report, 2002. 2010.

86

You might also like