0% found this document useful (0 votes)

59 views8 pages

A Novel Visual Analytics Approach For Clustering Large-Scale Social Data

Uploaded by

K V D Sagar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views8 pages

A Novel Visual Analytics Approach For Clustering Large-Scale Social Data

Uploaded by

K V D Sagar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

,(((,QWHUQDWLRQDO&RQIHUHQFHRQ%LJ'DWD

A Novel Visual analytics Approach for Clustering Large-Scale Social Data

Zhangye Wang Juanxia Zhou Wei Chen

State Key Lab of CAD&CG State Key Lab of CAD&CG State Key Lab of CAD&CG
Zhejiang University Zhejiang University Zhejiang University
Hangzhou, China Hangzhou, China Hangzhou, China
[email protected] [email protected] [email protected]

Chang Chen Jiyuan Liao Ross Maciejewski

College of Software Engineering State Key Lab of CAD&CG School of CIDSE
University of Science and Technology of China Zhejiang University Arizona State University
Hefei, China Hangzhou, China USA
[email protected] [email protected] [email protected]

Abstract—Social data refers to data individuals create that is type of data as user tag data.
knowingly and voluntarily shared by them and is an exciting Given the ubiquity of such data, methods for exploring
avenue into gaining insight into interpersonal behaviors and and analyzing social data links have become a critical
interaction. However, such data is large, heterogeneous and
often incomplete, properties that make the analysis of such research topic. By analyzing user tags and user behavior
data extremely challenging. One common method of exploring data, it is expected that insights into societal patterns can
such data is through cluster analysis, which can enable analysts be gleaned and used to inform governmental policy and
to find groups of related users, behaviors and interactions. This business decisions as well as influence human behavior [2].
paper presents a novel visual analysis approach for detecting For example, previous work [3] explores clustering users
clusters within large-scale social networks by utilizing a divide-
analyze-recombine scheme that sequentially performs data into groups to develop targeted market strategy based on
partitioning, subset clustering and result recombination within demographics and behavioral patterns. However, our abil-
an integrated visual interface. A case study on a microblog ity to extract knowledge from these large heterogeneous
messaging data (with 4.8 millions users) is used to demonstrate data sources is still limited due to the latent, subtle and
the feasibility of this approach and comparisons are also unpredictable relationships that may be hidden within the
provided to illustrate the performance benefits of this approach
with respect to existing solutions. data. These problems are further compounded due to issues
of data size and the non-linear computational complexity
Keywords-Divide and Recombine; Cluster Analysis; K- related to the data size.
means; Visual Analysis;
One means of reducing the computational complexity of
an analysis task is through the application of parallelization
I. I NTRODUCTION
using distributed computing [4] and super-computing [5].
As humans have become more intrinsically connected However, for analysis tasks that require frequent data
to technology, details of their behaviors and interpersonal transfer and interaction, parallelization is still problematic.
connections have become increasingly transparent. Activities Another means of reducing the computational complexity
and communications can automatically be collected during would be through the use of divide-and-conquer schemes.
interpersonal activities such as mobile phone communica- The divide-and-conquer algorithm works by recursively
tions, e-commerce transactions and internet activity (Tweets, breaking down a problem into sub-problems that can then
Facebook posts, blog entries), and such data can be classified be recombined to address the original problem.
into two components: behavioral data and personal data. Recent work has demonstrated the effectiveness of divide-
Behavioral data can be thought of as a record of a user and-conquer approaches for handling large data. For exam-
activity within society, for example, a mobile phone call ple, RHIPE [6] is built upon R and has been specifically
between individuals, a video conference, and sending a developed to provide a divide-and-conquer solution for the
tweet all provide links between a single user and other statistical analysis of large data. RHIPE utilizes the MapRe-
individuals. Such data tends to reflect a user’s daily patterns duce framework, and MapReduce itself has been used in
and behaviors [1], and we will refer to this type of data as large scale data analysis (e.g., [5]). Furthermore, in the
user behavior data. Personal data can be thought of as being data mining community, it is believed that the divide-and-
intrinsically tied to a single individual and would include recombine scheme will be adapted to handle the scalability
things such as their age and gender. We will refer to this problems found in many complex machine learning tasks [7].

978-1-4799-1293-3/13/$31.00 ©2013 IEEE

This paper presents our proposed methodology for en- problems including matrix factorization and statistical in-
hancing the visual analysis of large-scale social data uti- ference have been modified to utilize a divide-and-conquer
lizing the divide-and-recombine scheme. Our approach is methodology [7]. For example, the RHIPE system that is
specifically designed to handle a central problem for social built on R utilizes an underlying Hadoop structure and has
network analysis: clustering users into groups by leveraging been applied to a variety of analytical domains [6]. Results
the user tags and user behavior information. The key idea from this work demonstrate the feasibility of utilizing divide-
is to sequentially perform data partitioning, subset cluster- and-conquer for exploring large scale data.
ing and result recombination utilizing an integrated visual
interface. The entire pipeline integrates a suite of visual C. Social Data Analysis
analysis techniques to provide an effective workspace for Recently, many tools have been developed to explore
the partitioning of large collections of instances, determining the complex spatiotemporal relationships underlying such
cluster parameters and merging multiple clusters. Our main data. Work by Field et al. [14] explored the analysis of
contributions include: microblog data and others like crime investigators [15] and
• The identification and systematic implementation of the urban planners [16] have also begun utilizing social data.
divide-and-recombine scheme for clustering of social MacEachren et al. [17] developed Senseplace as a tool for
data in an integrated visual exploration process; exploring the message density of actual or textually inferred
• An incremental data clustering technique that enhances Twitter message locations, ScatterBlogs [18] presented a
the conventional K-means algorithm in both the clus- scalable system enabling analysts to work on quantitative
tering quality and performance; findings within a large set of Microblog messages.However,
• A novel context-aware subset-clustering analysis. no previous work in this area (to our knowledge) utilizes a
divide-and-recombine scheme for large scale clustering of
II. R ELATED W ORK users and behaviors in social data analysis.
In this section, we will briefly review related works in III. A V ISUAL A NALYSIS A PPROACH FOR C LUSTERING
each of these categories. L ARGE -S CALE S OCIAL DATA

A. Data Clustering The focus of this work is on large-scale (multi-million)

user data sets. As input, our social data contains a set of user
Recently, the application of interactive supervised learning provided information (age, gender, etc.) that we denote as
has become a major focus in the visual analytics community. user tags and their associated behavioral information which
Pelekis et al. [8] designed a novel distance function as we define as user behavior. The user behavior information
a similarity measurement for the analysis of movement provides links between users.
data. Guo et al. [9] proposed an approach for steering the Within such data, there are expected to be clusters of
clustering process by building a hierarchical spatial cluster individuals and patterns which can be analyzed for mar-
structure within the high-dimensional feature space. keting purposes, criminal activity or various other research
Unsupervised learning algorithms can be divided into questions. A key challenge in clustering such data is that
two major categories: hierarchical clustering and partitioning the application of clustering algorithms to a set of user tags
clustering [10]. Perhaps the most commonly used algorithm requires the formulation of user tag and behavior informa-
for partitioning data is the K-means algorithm [11]. Un- tion into a point data set where each point is associated
fortunately, such partitioning algorithms tend to be non- with multiple attributes. The set of attributes form a user
linear in their runtime, despite efforts to improve such attribute vector. Note that the construction of the user
techniques [12]. As the computational complexity increases attribute vectors varies with specific data and tasks, and
non-linearly with the data size, new solutions are necessary. details for formulating this vector for each dataset will be
There are also some pioneered work on parallel cluster- given in the case study. Our goal is to provide analysts
ing algorithms. For instance, Zhao et al. [13] proposed a with a means of clustering the user attribute vectors to find
parallel K-means clustering algoithm based on MapReduce commonalities between users and behaviors. Conceptually,
framework. Yet, this our approach consists of four stages (Figure 1): data division,
data clustering, result recombination, and visual exploration.
B. Divide-And-Conquer for Data Analysis In the first stage, a massive collection of user attribute
Divide-and-Conquer has proven to be an efficient mecha- vectors is generated from the social data. It is randomly
nism for reducing the runtime of complicated procedures subdivided into many subsets by means of a visually assisted
such as sorting, the multiplication of large numbers and data division process. In the second stage, a subset of the
discrete Fourier transforms. Recently, it has been applied data is clustered using an improved k-means algorithm, in
to address the underlying issues of computational complex- which a novel context-aware technique is leveraged to op-
ity associated with big data. Several fundamental statistics timize the clustering parameters. Subsequently, the result is

80
Adaptive Data Division Incremental Data Clustering Result Recombination

Subset 2 ......

. . .. . .
Clusters: 1 2 ...... k Outliers

...
Subset N ......

......
Clusters: 1 2 k Outliers

The entire user set Context-aware Subset Clustering

......

Clusters: 1 2 ...... k Outliers

Subset 1 Parameters

Pixel chart view

Clusters: 1 2 ... k

Outliers:
Social data Visual Exploration and Interactions

Figure 1. Conceptual overview of our approach. (a) Visual assisted adaptive data division; (b) Context-aware subset clustering; (c) Incremental recombination
of clusters; (d) Visual exploration of clusters and unusual data points.

propagated into the other subsets one-by-one and clustering sum of the distances between each point pair in the clusters:
is then done on each subset. In the third stage, the clusters
k
of all subsets are then merged with a hierarchical clustering
process. An integrated visual interface is designed to allow
D(K) = ∑ ∑ xi − x j (1)
r=1 xi ,x j ∈Xr
analysts to explore the user tags and associated user behavior
information of each cluster. Compared with existing divide- The clustering quality depends on the specification of
and-recombine approaches, our approach combines data the initial centroids [19]. Typical K-means solutions either
mining techniques and interactive visuals, which enables the repeat the computation of centroids several times using
analyst to explore patterns in large-scale social data. different initial seed points and then choose optimal result
based on the quality of the clusters, or apply specific
A. K-Means Clustering optimization which can result in long runtimes and storage
One of the key parts of our data analysis process is the uti- overheads when handling large-sized data. Our approach to
lization of the K-means algorithm. The K-means algorithm improving the performance of K-means in large-scale data is
is one of the most popular clustering methods [10], and is to utilize a divide-and-recombine scheme to reduce the data
employed in our approach due to its simplicity, efficiency complexity of K-means clustering. Furthermore, in order to
and robustness. For a set of points X = {xi |xi ∈ Rd }, K adapt the K-means algorithm specifically to social data, there
data points are randomly selected as the initial centroids are several key aspects that need to be addressed:
of the intended K clusters. Two steps are then recursively • Data Division How to determine the appropriate subset
performed until the algorithm converges: size so that the combination of clustered results of the
• Step 1: Assign each data point to its closest centroid. subsets can yield the same quality as the conventional
• Step 2: Update the centroids with the mean value of all K-means algorithm?
the data points assigned to them. • Data Clustering How to select K and a distance metric
Suppose that the resultant clusters are X1 , X2 , ..., XK . The for multi-dimensional user attribute vectors to achieve
quality of the clustering subject to K can be measured by the optimal clustering results?

81
• Recombination and Analysis How to evaluate the determining an optimal subset size. Figure 3 shows a group
clustering quality and discover interesting users and of pixel charts with a subset size 50.
patterns from the clustered results?
B. Adaptive Data Division
In terms of clustering, one criteria for dividing a massive
dataset into smaller subsets is to preserve the statistical
distribution in each subset [20], which poses a challenging
problem in determining initial splits in the divide-and-
recombine process. We propose to randomly subdivide the
entire set into subsets with uniform sizes [10] and adaptively
determine the subset size by considering both the perfor-
mance and quality.
Our solution is to employ the use of a visually assisted
adaptive data division scheme. For each subset, a pixel
chart [21] is generated to show the statistical distribution of
each attribute value of the user attribute vector. In particular,
all subsets are organized into a 2D array, which is further
represented with a set of 2D cells. The saturation of each
Figure 3. The pixel charts with respect to 42 user attribute values with
cell encodes the percentage of users with the corresponding 50 subsets.
attribute value in each subset, as illustrated in Figure 2.
The algorithm for adaptive data division is summarized
Cluster ID as follows:

Algorithm 1 Visual Assisted Adaptive Data Division

1 2 3
Initialization: set a small subset size Ns (e.g., 5K for a set
with 1M points) to generate a relatively large number of
4 5 6 subsets (e.g., 200).
while TRUE do
Generate a pixel chart for each attribute.
7 8 9 Visually explore the pixel charts of all attributes.
if All pixel charts show similar appearance then
Return the current subset size and subsets.
1
FALSE.
0
Percentage else
Ns ++.
Figure 2. A pixel chart visualizes the statistical distribution of certain end if
attribute with respect to a subset size. Here, each cell represents a subset,
and encodes the statistical value of the subset. end while

If the pixel chart of an attribute exhibits an approximately

uniform appearance, one can assume that the subset has a C. Context-aware Subset Clustering
statistical distribution that is similar to the overall distribu- Once the analyst is satisfied with the data subset choice,
tion of a given attribute. If the subset size is appropriate the data subsets then enter the clustering phase. In clustering
for all attributes, then one can assume that it represents the first subset, two parameters are essential to the clustering
a reasonable statistical sample of the original dataset with quality: the cluster number, K; and the distance metric for
respect to the specific user attribute vector. In this manner, the points.
an analyst can quickly assess the suggested subdivision and Cluster Number K There have been many methods to
modify the current data division to try and improve the find an optimal cluster number K [22]. In our system, K
results. Surely, statistical sampling techniques would work is heuristically determined by computing and plotting the
in these situations, but may require more user interaction sum of the distance function D(K) (Equation 1). Typically,
time. In order to enable such actions, our system provides the point where the curve becomes flat (shown in red in
simple user interactions in the pixel chart view: specifying Figure 4) is identified as a reasonable choice for K [22].
the subset size, studying the distribution of a specific subset Attribute Weights K-means partitions the points based
size, comparing the distributions of different subsets, and on the distances between points in the high-dimensional

82
Distance
250 The set of {H(m, n), m = 1, 2..., M, n = 1, 2, 3, ..., M} forms
an M × M matrix. By normalizing and encoding each ele-
20
ment of the matrix with a color, a sensitivity map associated
with the underlying attribute is generated (Figure 5). This
mapping implies the proximity among the clustered results
15
under different weighting configurations. A low value of
H(m, n) indicates that two weighting configurations yield
10 similar clustered results, and vice versa. The overall distri-
bution of the sensitivity map and its average value can be
5 used to study the relevance of the attribute to the clustering.
A sensitivity map can be generated for each attribute. The
0 set of the maps characterizes the influence of a set of user
1 2 3 4 5 6 7 8 9 10 11 12 13 14
attributes on the variation of the clustering process and can
Figure 4. The relationship between the cluster number K (the x axis) and be used to guide the weight function. The weight of each
the sum of distance D(K) (the y axis). attribute is specified as to be proportional to the average
value of its associated map. The user can manually adjust
the sensitivity of a map by moving it vertically. The weights
space, and the choice of distance metric chosen will greatly corresponding to the sensitivities are shown with a radial
influence the results. In this case, we cluster the underlying map.
dataset based on the user attribute vectors, where the prob-
lem is identical to specifying an appropriate weight for each weight
attribute: a high weight can be used for a salient attribute, 0.1 1.0
and an attribute has little influence on the clustering if its
weight is low. For clarity we define the weights of a user at-
0.1
tribute vector to be x = (x1 , x2 , ..., xd ) as w = (w1 , w2 , ..., wd ).
However, the influence of each attribute on the cluster-
ing result and the relations among attributes are difficult
to model and represent. Furthermore, attributes may have
weight

different data types (numerical, ordinal and categorical) and

distinctive value ranges. Thus, it is desirable to discover the
significantly relevant attributes and associated configurations
by studying the influences of different weights on the clus-
tering results, and design an enhanced weight computation
scheme.
In our approach, the weight of an attribute subject to a
clustering process is determined by evaluating the sensitivity
of the clustered result with respect to different weights. For
the jth attribute, a sequence of weighting configurations
1.0
{w(k), k = 1, 2, 3, ..., M} is generated:
l
w (k) = 1/k, if l = j;
(2) 1
wl (k) = (1 − 1/k)/(M − 1), else. 0
sensitivity
where M is an adjustable constant.
Figure 5. A sensitivity map is generated for each attribute.
For each weighting configuration {w(k)}, a K-means
clustering is performed by using the weights in the distance
metric. We further compute the Hausdorff Distance [23]
between the centroid sets of the clustered results C(m) and D. Incremental data clustering
C(n) associated with {w(m)} and {w(n)}: The first subset can be either randomly selected, or can
be manually specified by the users with the help of the
H(m, n) = max min x − y (3) pixel charts of subsets. After clustering the first subset, an
x∈C(m) y∈C(n)
incremental data clustering procedure is applied to cluster
where x and y denote a centroid of the clustered results C(m) all other subsets. The clustering configuration (the weights,
and C(n), respectively. etc.) for the first subset is applied in the clustering process of

83
2000 Sum of Distances
all other subsets. Algorithm 2 briefly summarizes the details. Standard
Incremental

Algorithm 2 Incremental Data Clustering

for Each subset do
Set the clustered centroids of the first subset as the
initial centroids.
Set the cluster number and attribute weights of the first
subset as the ones for the underlying subset.
Perform the K-means clustering. 1500 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
cluster Number

end for (a)

30 Time (s)

25
Incremental
Standard
The incremental data clustering scheme not only allows us 20

to utilize the divide-cluster-recombine mechanism, but it also

achieves higher performance and quality when compared 15

with the standard K-means algorithm. Figure 6 (a) compares 10

the sum of point-to-centroid distances of the cluster results
of both our approach and the standard K-means algorithm. 5

In principal, the incremental data clustering scheme achieves 0 1 Cluster Number

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
higher accuracy because the cluster centroids in clustering (b)
each subset are determined in handling the first subset. With
the standard K-means algorithm, the cluster centroids are Figure 6. Comparing the sum of point-to-centroid distances (a) and the
running time (b) of our incremental data clustering scheme (in dark blue)
randomly initialized when clustering each subset, leading to and the standard K-means algorithm (in light blue).
varied clustering results [24]. The benefit of the incremental
scheme is also significant with respect to the running time. Algorithm 3 Hierarchical clustering of the clusters of all
Figure 6 (b) reports the total running time of both approaches subsets
for the same dataset. In general, our approach runs faster Initialization: Load all clusters of all subsets.
with different cluster numbers K, and achieves an average while TRUE do
10% - 100% acceleration. More importantly, the clustering Compute the distances between each pair of all clusters
of subsets are parallelizable because the subsets are inde- of all subsets.
pendent. Concrete performance analysis will be given in the Combine two clusters whose distance is the minimal.
case study. if There are only K clusters then
To allow for visual exploration of the clusters of each Output the clusters.
subset, our system employs the principle component analysis FALSE.
(PCA) algorithm to project all data points of a subset or the else
entire set into the 2D space. PCA is chosen because it is TRUE.
fast with a linear computational complexity, and can handle end if
a large amount of points. end while

E. Result Recombination
After all the subsets are clustered, we utilize the standard
hierarchical clustering method [10] to recombine the results IV. C ASE S TUDY
of all subsets. Again, the cluster number K for clustering Our clustering algorithm and visual interface were devel-
the entire set is set to be the same as that of the first subset. oped using Java. A parallelization to our clustering kernel
The distance between cluster A and cluster B is defined as: is also implemented with the multi-thread feature of Java.
n(A)n(B) We also implemented the standard K-means algorithm. The
Δ(A, B) = centroid(A) − centroid(B)2 (4) three implementations are named as DR, PDR and STD. The
n(A) + n(B)
performance reports are made on a PC with an i7 3.40 GHZ
where is n(·) denotes the size of a cluster, and centroid(·) CPU, 16 G memory, and an Nvidia 680 video card.
denotes the cluster centroid. We have conducted experiments on a Microblog Dataset
The key idea of the hierarchical clustering algorithm is to consisting of 4.8 million users. It was collected from the
recursively combine all clusters (see Algorithm 3). largest social network (http://www.weibo.com) in China, i.e.,

84
2000000
Sina Weibo. The total data sizes is 399 MB. Sum of Distances STD

In particular, the Microblog Dataset consists of 4,838,573 DR

users and 18 user attributes. In our experiments, six attributes 1500000

are used: OnlineTime, Bi-friendNumber, FollowerNumber,

FriendNumber, PostNumber and FavouritePostNumber. Ta- 1000000
ble I lists the meaning of these attributes. The entire set
is divided into 76 subsets, each of which contains 64,000
users. 500000

Table I
S IX ATTRIBUTES OF THE M ICROBLOG D ATASET 0 K
4 5 6 7 8 9 10 11 12

Attribute Name(Abbr.) Meaning (a)

OnlineTime The time the user spend on Microblog
2000 Time (s)
the number of people with which the
BiFriendNumber PDR
user follows each other DR

the number of people who follow the STD

FollowerNumber 1500
user to see his/ her updates
FriendNumber the number of people the user follows
the number of messages the user sends 1000
PostNumber out to his/her followers through
Microblog
the number of microblogs the user 500
FavouritePostNumber
collects

0 K
4 5 6 7 8 9 10 11 12
Quality and Performance Applying our approach to the
(b)
dataset found that the optimal cluster number is six. Figure 7
(a) compares the clustering quality of our approach and Figure 7. Results for the Microblog dataset. (a) The clustering quality
the standard K-means algorithm. In particular, the summed comparison of DR and STD; (b) The performance comparison of DR, PDR
and STD.
values of point-to-centroid distances with respect to different
cluster number K are plotted.
To compare the performance, the timings with DR, PDR,
analytics pipeline. The entire pipeline integrates a suite of
and STD with respect to different cluster number K are
visual analysis techniques to provide an effective workspace
displayed in Figure 7 (b). The comparison indicates that
for the partition of a large collection of instances, the
the naive implementation of our approach achieves higher
determination of clustering parameters, and the merging
performance than the standard algorithm. The most time
of multiple clusters. Experimental results verify that our
consuming part is the incremental clustering of subsets
approach outperforms conventional solutions in both the
due to the I/O operations for clustering each subset. This
quality and efficiency.
inefficiency is addressed in our parallel implementation,
which yields a stable and high performance record.
VI. ACKNOWLEDGEMENT
Social Data Analysis In addition to the 6 clusters, a group
of outlier points are detected in the clustering process This paper is supported by NSFC (61232012,61272302),
(Figure 8). By further exploring the views of user attributes, National High Technology Research and Development Pro-
some interesting facts are discovered. In particular, the pixel gram of China (2012AA12090), Zhejiang Provincial Natu-
charts with respect to different attributes enable us to identify ral Science Foundation of China (LR13F020001), Doctoral
two specific groups of user patterns. One group has very high Fund of Ministry of Education of China (20120101110134).
attribute values, while the other has low values. Commonly
they have a high value of FollowerNumber, but a low R EFERENCES
value of FriendNumber (almost near zero). In addition, their
BiFriendNumbers are zero, and their registered addresses are [1] J.Moody, D.McFarland, and S. Bender-deMoll, “Dynamic
network visualization,” American Journal of Sociology, pp.
all overseas. 1206–1241, 2005.
V. C ONCLUSIONS
[2] J. Wei, Z. Shen, N. Sundaresan, and K.-L. Ma, “Visual cluster
This paper presents an effective divide-and-recombine exploration of web clickstream data,” in IEEE Conference on
approach for clustering massive data assisted with a visual Visual Analytics Science and Technology, 2012, pp. 3–12.

85
[11] Z. Ahmed and C. Weaver, “An adaptive parameter space-
filling algorithm for highly interactive cluster exploration,”
in IEEE Conference on Visual Analytics Science and Tech-
nology, July 2012, pp. 13–22.

[12] A. Vakali, J. Pokornoy, and T. Dalamagas, “An overview

of web data clustering practices,” in Proceedings of EDBT
Workshops, 2004, pp. 597–606.

[13] W. Zhao, H. Ma, and Q. He, “Parallel k-means clustering

based on mapreduce,” Cloud Computing, Springer Berlin
Heidelberg, pp. 674–679, 2009.

[14] K. Field and J. O’Brien, “Cartoblography: Experiments in

using and organising the spatial context of micro-blogging,”
Transactions in GIS, vol. 14, pp. 5–23, 2010.

[15] R. E. Roth and J. White, “Twitterhitter: Geovisual analytics

for harvesting insight from volunteered geographic informa-
tion,” in GIScience, 2010.

[16] S. Wakamiya, R. Lee, and K. Sumiya, “Crowd-based

urban characterization: extracting crowd behavioral patterns
in urban areas from twitter,” in Proceedings of the 3rd
Figure 8. A group of points within the red rectangle are specified as
ACM SIGSPATIAL International Workshop on Location-
outliers. Based Social Networks, ser. LBSN ’11. New York,
NY, USA: ACM, 2011, pp. 77–84. [Online]. Available:
http://doi.acm.org/10.1145/2063212.2063225
[3] Z. Shen and K.-L. Ma, “Mobivis: A visualization system for [17] A. MacEachren, A. Jaiswal, A. Robinson, S. Pezanowski,
exploring mobile data,” in IEEE Pacific Visualization, 2008, A. Savelyev, P. Mitra, X. Zhang, and J. Blanford, “Sense-
pp. 175–182. place2: Geotwitter analytics support for situational aware-
ness,” in IEEE Conference on Visual Analytics Science and
[4] W. A. Pike, J. Bruce, B. Baddeley, D. Best, L. Franklin, Technology, 2011, pp. 181–190.
R. May, D. M.Rice, R. Riensche, and K. Younkin, “The
Scalable Reasoning System: Lightweight Visualization for [18] H. Bosch, D. Thom, M. Worner, S. Koch, D. Puttmann,
Distributed Analytics,” Infomation Visualization, pp. 171– D. Jackle, and T. Ertl, “Scatterblogs: Geo-spatial document
184, 2008. analysis,” in IEEE Conference on Visual Analytics Science
and Technology, 2011, pp. 309–310.
[5] H. Vo, J. Bronson, B. Summa, J. Comba, J. Freire, B. Howe,
V. Pascucci, and C. Silva, “Parallel visualization on large [19] A. K. Jain, M. Murthy, and P. Flynn, “Data clustering: A
clusters using MapReduce,” in IEEE Symposium on Large review,” ACM Computing Reviews, vol. 31, no. 3, pp. 264–
Data Analysis and Visualization (LDAV), 2011, pp. 27–34. 323, 1999.

[6] S. Guha, R. Hafen, J. Rounds, J. Xia, J. Li, B. Xi, and W. S. [20] J. B. Macqueen, “Some methods of classification and analysis
Cleveland, “Large Complex Data: Divide and Recombine of multi-variate observations,” in Proceedings of the Fifth
with RHIPE,” The ISI’s Journal for the Rapid Dissemination Berkeley Symposiumon Mathematical Statistics and Proba-
of Statistics Research, 2012. bility, July 1967, pp. 281–297.

[21] D. A. Keim and H.-P. Kriegel, “VisDB: Database Explo-

[7] M. I. Jordan, “Divide-and-conquer and statistical inference ration using Multidimensional Visualization,” IEEE Computer
for big data,” in ACM SIGKDD international conference on Graphics and Application, vol. 14, no. 5, pp. 40–49, 1994.
Knowledge discovery and data mining, 2012.
[22] D. Arthur and S. Vassilvitskii, “K-means++: The Advantages
[8] N. Pelekis, G. Andrienko, N. Andrienko, I. Kopanakis, of Careful Seeding,” in Proceedings of the eighteenth annual
G. Marketos, and Y. Theodoridis, “Visually exploring move- ACM-SIAM symposium pn Discrete algorithms, 2007, pp.
ment data via similarity-based analysis,” Journal of Intelligent 1027–1035.
Information Systems, vol. 38, no. 2, pp. 343–391, 2012.
[23] D. P. Huttenlocher, G. A. Klanderman, and W. J. Rucklidge,
[9] D. Guo, D. Peuquet, and M. Gahegan, “ICEAGE: Interactive “Comparing Images Using the Hausdorff Distance,” IEEE
clustering and exploration of large and high-dimensional Transactions on PAMI, pp. 264–323, 1933.
geodata,” GeoInformatica, vol. 7, no. 3, pp. 229–253, 2003.
[24] A. K. Jain, “Data clustering: 50 years beyond k-means,”
[10] P. Berkhin, “Survey of clustering data mining techniques,” Pattern Recognition Letters, vol. 31, no. 8, pp. 651–666, June
Technical Report, 2002. 2010.

COPA Demostration Plan
100% (4)
COPA Demostration Plan
452 pages
Research Paper
No ratings yet
Research Paper
7 pages
Inputs Outputs: Navigation Menu
No ratings yet
Inputs Outputs: Navigation Menu
10 pages
Unit 6: Big Data Analytics Using R: 6.0 Overview
No ratings yet
Unit 6: Big Data Analytics Using R: 6.0 Overview
32 pages
An Approach To Analysis and Classification of Data From Big Data by Using Apriori Algorithm
No ratings yet
An Approach To Analysis and Classification of Data From Big Data by Using Apriori Algorithm
4 pages
BC2017
No ratings yet
BC2017
28 pages
Big Data: How To Handle: A Survey: Dinesh MCA Deptt. PDM University, Bahadurgarh ABC MCA Deptt
No ratings yet
Big Data: How To Handle: A Survey: Dinesh MCA Deptt. PDM University, Bahadurgarh ABC MCA Deptt
8 pages
E 292 Volume 30 No.1 January2024.educational Administration Theory and Practice
No ratings yet
E 292 Volume 30 No.1 January2024.educational Administration Theory and Practice
7 pages
Data Visualization in The Age of Big Data
No ratings yet
Data Visualization in The Age of Big Data
7 pages
Big Data
No ratings yet
Big Data
77 pages
Activity Recognition: Fundamentals and Applications
From Everand
Activity Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
RHadoop
No ratings yet
RHadoop
50 pages
Challenging Tools On Research Issues in Big Data Analytics
No ratings yet
Challenging Tools On Research Issues in Big Data Analytics
4 pages
Da Notes (Big Data) PDF
No ratings yet
Da Notes (Big Data) PDF
32 pages
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
No ratings yet
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
43 pages
BDMA
No ratings yet
BDMA
18 pages
Divide and Recombine Approach For Detailed Analysis and Visualization of Large Complex Data
No ratings yet
Divide and Recombine Approach For Detailed Analysis and Visualization of Large Complex Data
13 pages
Challenging Tools On Research Issues in Big Data Analytics: Althaf Rahaman - SK, Sai Rajesh.K .Girija Rani K
No ratings yet
Challenging Tools On Research Issues in Big Data Analytics: Althaf Rahaman - SK, Sai Rajesh.K .Girija Rani K
8 pages
2015 Sentiment Analysis of Twitter Data Within Big Data Distributed
No ratings yet
2015 Sentiment Analysis of Twitter Data Within Big Data Distributed
6 pages
01 - Introduction To Big Data Analytics PDF
No ratings yet
01 - Introduction To Big Data Analytics PDF
38 pages
Big Data Analytics For R-2017 by ArunPrasath S., Sriram Kumar K., Krishna Sankar P.
No ratings yet
Big Data Analytics For R-2017 by ArunPrasath S., Sriram Kumar K., Krishna Sankar P.
7 pages
Dont Do That
No ratings yet
Dont Do That
30 pages
Introduction To Analytics and Big Data
No ratings yet
Introduction To Analytics and Big Data
12 pages
Project
No ratings yet
Project
14 pages
ASurveyofBigDataAnalytics ItsChallenges
No ratings yet
ASurveyofBigDataAnalytics ItsChallenges
12 pages
01 - Introduction To Big Data Analytics PDF
No ratings yet
01 - Introduction To Big Data Analytics PDF
37 pages
Big Data Analytics
100% (1)
Big Data Analytics
11 pages
Big Data From Beginning To Future
No ratings yet
Big Data From Beginning To Future
17 pages
Commercial Usage Using Big Data
No ratings yet
Commercial Usage Using Big Data
4 pages
Aditya 18cs03 Seminar Report
No ratings yet
Aditya 18cs03 Seminar Report
27 pages
TP 4 2docuatrimestre
No ratings yet
TP 4 2docuatrimestre
10 pages
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
34 pages
Jsaer2016 03 01 21 24
No ratings yet
Jsaer2016 03 01 21 24
4 pages
A Big Data Analytics Study Challenges, Unresolved Research Issues, and Techniques
100% (1)
A Big Data Analytics Study Challenges, Unresolved Research Issues, and Techniques
8 pages
1.big Data and Its Importance
No ratings yet
1.big Data and Its Importance
17 pages
Ijcrt2108014 - 2021
No ratings yet
Ijcrt2108014 - 2021
5 pages
Data Mining With Big Data: Apurva Choudhary 206114009 Dept. of CSE NIT Tiruchirappalli - 15
No ratings yet
Data Mining With Big Data: Apurva Choudhary 206114009 Dept. of CSE NIT Tiruchirappalli - 15
19 pages
Big Data J
No ratings yet
Big Data J
3 pages
Big Data Analytics - Applications, Challenges & Future Directions
No ratings yet
Big Data Analytics - Applications, Challenges & Future Directions
6 pages
Introduction Data Science
No ratings yet
Introduction Data Science
29 pages
BIG Data1
No ratings yet
BIG Data1
49 pages
Data Modeling Overview
No ratings yet
Data Modeling Overview
18 pages
Big Data Analytics (16!06!2025)
No ratings yet
Big Data Analytics (16!06!2025)
26 pages
An Introduction To Data Mining
No ratings yet
An Introduction To Data Mining
3 pages
Big Data and Analytics
No ratings yet
Big Data and Analytics
23 pages
Lecture Notes in Computer Science 8302: Editorial Board
No ratings yet
Lecture Notes in Computer Science 8302: Editorial Board
10 pages
Unit 2 Da
No ratings yet
Unit 2 Da
69 pages
3 Social Media Analytics
No ratings yet
3 Social Media Analytics
4 pages
Unit-I Bdaur-Bcom
No ratings yet
Unit-I Bdaur-Bcom
5 pages
Business Analytics Notes
No ratings yet
Business Analytics Notes
6 pages
Bda 1 - Concepts and Methods
No ratings yet
Bda 1 - Concepts and Methods
68 pages
BDA Upto Unit3
No ratings yet
BDA Upto Unit3
42 pages
Introduction To Big Data Platform (Module-3)
No ratings yet
Introduction To Big Data Platform (Module-3)
23 pages
Big Data: Concepts, Techniques, Storage and Challenges
No ratings yet
Big Data: Concepts, Techniques, Storage and Challenges
9 pages
Emergency Chapter Two
No ratings yet
Emergency Chapter Two
41 pages
Bda Combined
No ratings yet
Bda Combined
102 pages
Besufekad BIG DATA 1
No ratings yet
Besufekad BIG DATA 1
10 pages
What Is Data
No ratings yet
What Is Data
20 pages
Big Data A Survey Dinesh
No ratings yet
Big Data A Survey Dinesh
9 pages
Comprehensive Guide to Glue for Scientific Data Exploration: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Glue for Scientific Data Exploration: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Developing Analytic Talent: Becoming a Data Scientist
From Everand
Developing Analytic Talent: Becoming a Data Scientist
Vincent Granville
3/5 (7)
Big Data Ethics in Research
From Everand
Big Data Ethics in Research
Nicolae Sfetcu
No ratings yet
Views, Synonyms, and Sequences
No ratings yet
Views, Synonyms, and Sequences
45 pages
Virtual Laboratory - Operating Systems Design - 19CS2106S
No ratings yet
Virtual Laboratory - Operating Systems Design - 19CS2106S
9 pages
Setting Up Eclipse:: Codelab 1 Introduction To The Hadoop Environment (Version 0.17.0)
No ratings yet
Setting Up Eclipse:: Codelab 1 Introduction To The Hadoop Environment (Version 0.17.0)
9 pages
Visual Clustering Approaches
No ratings yet
Visual Clustering Approaches
3 pages
Letters of Recommendation: University of New Mexico Graduate Studies
No ratings yet
Letters of Recommendation: University of New Mexico Graduate Studies
1 page
Sagar de Duplication Paper
No ratings yet
Sagar de Duplication Paper
2 pages
PDF
No ratings yet
PDF
254 pages
Web Technologies Lab
No ratings yet
Web Technologies Lab
2 pages
Curriculum Vitae: Subbarao.G Mobile No: 91-9848558574 Career Objective
No ratings yet
Curriculum Vitae: Subbarao.G Mobile No: 91-9848558574 Career Objective
2 pages
Experiment 8 - DDCA
No ratings yet
Experiment 8 - DDCA
4 pages
Casestudy HR and Ai
No ratings yet
Casestudy HR and Ai
7 pages
LG 20LS5R Chassis LP68A PDF
No ratings yet
LG 20LS5R Chassis LP68A PDF
30 pages
IBM Product Development
No ratings yet
IBM Product Development
12 pages
State Machines
No ratings yet
State Machines
6 pages
Selenium UI Map
No ratings yet
Selenium UI Map
61 pages
hdc4 2
No ratings yet
hdc4 2
23 pages
User Manual
No ratings yet
User Manual
51 pages
Refund BSP
No ratings yet
Refund BSP
17 pages
A Model of Musical Motifs
100% (1)
A Model of Musical Motifs
8 pages
Regarding Genaretes in The Process of Estimate Regret Concern Despite Estimates Think Capture
No ratings yet
Regarding Genaretes in The Process of Estimate Regret Concern Despite Estimates Think Capture
3 pages
Naukri NarahariJayavardhan (6y 0m)
No ratings yet
Naukri NarahariJayavardhan (6y 0m)
2 pages
Project On Microsoft
33% (3)
Project On Microsoft
7 pages
BioBlocksLab - A Portable DIY Bio Lab Using BioBlocks Language - ScienceDirect
No ratings yet
BioBlocksLab - A Portable DIY Bio Lab Using BioBlocks Language - ScienceDirect
14 pages
Download ebooks file Handbook of Machine Learning for Computational Optimization: Applications and Case Studies (Demystifying Technologies for Computational Excellence) 1st Edition Vishal Jain (Editor) all chapters
100% (2)
Download ebooks file Handbook of Machine Learning for Computational Optimization: Applications and Case Studies (Demystifying Technologies for Computational Excellence) 1st Edition Vishal Jain (Editor) all chapters
49 pages
Gozambiajobs Yalelo Lusaka Full Time Ict Specialist
No ratings yet
Gozambiajobs Yalelo Lusaka Full Time Ict Specialist
4 pages
Dart Variables and Data Types
No ratings yet
Dart Variables and Data Types
3 pages
Pokemon Black Cheats & Cheat Codes For Nintendo DS - Cheat Code Central
No ratings yet
Pokemon Black Cheats & Cheat Codes For Nintendo DS - Cheat Code Central
55 pages
Viper Thesis
100% (3)
Viper Thesis
8 pages
HW 1
No ratings yet
HW 1
3 pages
Classes That Can Be Instantiated: Ghoul Class
No ratings yet
Classes That Can Be Instantiated: Ghoul Class
14 pages
XSD (XML Schema Definition) Overview
No ratings yet
XSD (XML Schema Definition) Overview
4 pages
Beatrice Gay Letter Writing Scipt
0% (2)
Beatrice Gay Letter Writing Scipt
23 pages
Securing Nomad
No ratings yet
Securing Nomad
28 pages
Circa 2000 Amd Laptop Power Up Sequence
No ratings yet
Circa 2000 Amd Laptop Power Up Sequence
58 pages
Deh 5250SD
No ratings yet
Deh 5250SD
80 pages
Sub Inspector of Excise
No ratings yet
Sub Inspector of Excise
13 pages
1806 - Cyber Law Final Draft
No ratings yet
1806 - Cyber Law Final Draft
25 pages