Search | arXiv e-print repository

arXiv:2003.06508 [pdf, other]

DriftSurf: A Risk-competitive Learning Algorithm under Concept Drift

Authors: Ashraf Tahmasbi, Ellango Jothimurugesan, Srikanta Tirthapura, Phillip B. Gibbons

Abstract: When learning from streaming data, a change in the data distribution, also known as concept drift, can render a previously-learned model inaccurate and require training a new model. We present an adaptive learning algorithm that extends previous drift-detection-based methods by incorporating drift detection into a broader stable-state/reactive-state process. The advantage of our approach is that w… ▽ More When learning from streaming data, a change in the data distribution, also known as concept drift, can render a previously-learned model inaccurate and require training a new model. We present an adaptive learning algorithm that extends previous drift-detection-based methods by incorporating drift detection into a broader stable-state/reactive-state process. The advantage of our approach is that we can use aggressive drift detection in the stable state to achieve a high detection rate, but mitigate the false positive rate of standalone drift detection via a reactive state that reacts quickly to true drifts while eliminating most false positives. The algorithm is generic in its base learner and can be applied across a variety of supervised learning problems. Our theoretical analysis shows that the risk of the algorithm is competitive to an algorithm with oracle knowledge of when (abrupt) drifts occur. Experiments on synthetic and real datasets with concept drifts confirm our theoretical analysis. △ Less

Submitted 2 August, 2020; v1 submitted 13 March, 2020; originally announced March 2020.

Comments: 32 pages, 12 figures. Submitted to NeurIPS 2020. Replaced to include revision of Lemma 2 and additional experimental results

ACM Class: I.2.6

arXiv:2001.11433 [pdf, other]

Shared-Memory Parallel Maximal Clique Enumeration from Static and Dynamic Graphs

Authors: Apurba Das, Seyed-Vahid Sanei-Mehri, Srikanta Tirthapura

Abstract: Maximal Clique Enumeration (MCE) is a fundamental graph mining problem, and is useful as a primitive in identifying dense structures in a graph. Due to the high computational cost of MCE, parallel methods are imperative for dealing with large graphs. We present shared-memory parallel algorithms for MCE, with the following properties: (1) the parallel algorithms are provably work-efficient relative… ▽ More Maximal Clique Enumeration (MCE) is a fundamental graph mining problem, and is useful as a primitive in identifying dense structures in a graph. Due to the high computational cost of MCE, parallel methods are imperative for dealing with large graphs. We present shared-memory parallel algorithms for MCE, with the following properties: (1) the parallel algorithms are provably work-efficient relative to a state-of-the-art sequential algorithm (2) the algorithms have a provably small parallel depth, showing they can scale to a large number of processors, and (3) our implementations on a multicore machine show good speedup and scaling behavior with increasing number of cores, and are substantially faster than prior shared-memory parallel algorithms for MCE; for instance, on certain input graphs, while prior works either ran out of memory or did not complete in 5 hours, our implementation finished within a minute using 32 cores. We also present work-efficient parallel algorithms for maintaining the set of all maximal cliques in a dynamic graph that is changing through the addition of edges. △ Less

Submitted 30 January, 2020; originally announced January 2020.

Comments: This paper is accepted in ACM Transactions on Parallel Computing (TOPC). A preliminary version [arXiv:1807.09417] of this work appeared in the proceedings of the 25th IEEE International Conference on. High Performance Computing, Data, and Analytics (HiPC), 2018

arXiv:1909.02629 [pdf, other]

Random Sampling for Group-By Queries

Authors: Trong Duc Nguyen, Ming-Hung Shih, Sai Sree Parvathaneni, Bojian Xu, Divesh Srivastava, Srikanta Tirthapura

Abstract: Random sampling has been widely used in approximate query processing on large databases, due to its potential to significantly reduce resource usage and response times, at the cost of a small approximation error. We consider random sampling for answering the ubiquitous class of group-by queries, which first group data according to one or more attributes, and then aggregate within each group after… ▽ More Random sampling has been widely used in approximate query processing on large databases, due to its potential to significantly reduce resource usage and response times, at the cost of a small approximation error. We consider random sampling for answering the ubiquitous class of group-by queries, which first group data according to one or more attributes, and then aggregate within each group after filtering through a predicate. The challenge with group-by queries is that a sampling method cannot focus on optimizing the quality of a single answer (e.g. the mean of selected data), but must simultaneously optimize the quality of a set of answers (one per group). We present CVOPT, a query- and data-driven sampling framework for a set of group-by queries. To evaluate the quality of a sample, CVOPT defines a metric based on the norm (e.g. $\ell_2$ or $\ell_\infty$) of the coefficients of variation (CVs) of different answers, and constructs a stratified sample that provably optimizes the metric. CVOPT can handle group-by queries on data where groups have vastly different statistical characteristics, such as frequencies, means, or variances. CVOPT jointly optimizes for multiple aggregations and multiple group-by clauses, and provides a way to prioritize specific groups or aggregates. It can be tuned to cases when partial information about a query workload is known, such as a data warehouse where queries are scheduled to run periodically. Our experimental results show that CVOPT outperforms current state-of-the-art on sample quality and estimation accuracy for group-by queries. On a set of queries on two real-world data sets, CVOPT yields relative errors that are 5x smaller than competing approaches, under the same space budget. △ Less

Submitted 12 September, 2019; v1 submitted 5 September, 2019; originally announced September 2019.

arXiv:1906.04120 [pdf, other]

Parallel Streaming Random Sampling

Authors: Kanat Tangwongsan, Srikanta Tirthapura

Abstract: This paper investigates parallel random sampling from a potentially-unending data stream whose elements are revealed in a series of element sequences (minibatches). While sampling from a stream was extensively studied sequentially, not much has been explored in the parallel context, with prior parallel random-sampling algorithms focusing on the static batch model. We present parallel algorithms fo… ▽ More This paper investigates parallel random sampling from a potentially-unending data stream whose elements are revealed in a series of element sequences (minibatches). While sampling from a stream was extensively studied sequentially, not much has been explored in the parallel context, with prior parallel random-sampling algorithms focusing on the static batch model. We present parallel algorithms for minibatch-stream sampling in two settings: (1) sliding window, which draws samples from a prespecified number of most-recently observed elements, and (2) infinite window, which draws samples from all the elements received. Our algorithms are computationally and memory efficient: their work matches the fastest sequential counterpart, their parallel depth is small (polylogarithmic), and their memory usage matches the best known. △ Less

Submitted 10 June, 2019; originally announced June 2019.

arXiv:1904.04126 [pdf, ps, other]

Weighted Reservoir Sampling from Distributed Streams

Authors: Rajesh Jayaram, Gokarna Sharma, Srikanta Tirthapura, David P. Woodruff

Abstract: We consider message-efficient continuous random sampling from a distributed stream, where the probability of inclusion of an item in the sample is proportional to a weight associated with the item. The unweighted version, where all weights are equal, is well studied, and admits tight upper and lower bounds on message complexity. For weighted sampling with replacement, there is a simple reduction t… ▽ More We consider message-efficient continuous random sampling from a distributed stream, where the probability of inclusion of an item in the sample is proportional to a weight associated with the item. The unweighted version, where all weights are equal, is well studied, and admits tight upper and lower bounds on message complexity. For weighted sampling with replacement, there is a simple reduction to unweighted sampling with replacement. However, in many applications the stream has only a few heavy items which may dominate a random sample when chosen with replacement. Weighted sampling \textit{without replacement} (weighted SWOR) eludes this issue, since such heavy items can be sampled at most once. In this work, we present the first message-optimal algorithm for weighted SWOR from a distributed stream. Our algorithm also has optimal space and time complexity. As an application of our algorithm for weighted SWOR, we derive the first distributed streaming algorithms for tracking \textit{heavy hitters with residual error}. Here the goal is to identify stream items that contribute significantly to the residual stream, once the heaviest items are removed. Residual heavy hitters generalize the notion of $\ell_1$ heavy hitters and are important in streams that have a skewed distribution of weights. In addition to the upper bound, we also provide a lower bound on the message complexity that is nearly tight up to a $\log(1/ε)$ factor. Finally, we use our weighted sampling algorithm to improve the message complexity of distributed $L_1$ tracking, also known as count tracking, which is a widely studied problem in distributed streaming. We also derive a tight message lower bound, which closes the message complexity of this fundamental problem. △ Less

Submitted 8 April, 2019; originally announced April 2019.

Comments: To appear in PODS 2019

arXiv:1903.12065 [pdf, ps, other]

Optimal Random Sampling from Distributed Streams Revisited

Authors: Srikanta Tirthapura, David P. Woodruff

Abstract: We give an improved algorithm for drawing a random sample from a large data stream when the input elements are distributed across multiple sites which communicate via a central coordinator. At any point in time the set of elements held by the coordinator represent a uniform random sample from the set of all the elements observed so far. When compared with prior work, our algorithms asymptotically… ▽ More We give an improved algorithm for drawing a random sample from a large data stream when the input elements are distributed across multiple sites which communicate via a central coordinator. At any point in time the set of elements held by the coordinator represent a uniform random sample from the set of all the elements observed so far. When compared with prior work, our algorithms asymptotically improve the total number of messages sent in the system as well as the computation required of the coordinator. We also present a matching lower bound, showing that our protocol sends the optimal number of messages up to a constant factor with large probability. As a byproduct, we obtain an improved algorithm for finding the heavy hitters across multiple distributed sites. △ Less

Submitted 28 March, 2019; originally announced March 2019.

Comments: This writeup is a revised version of a paper with the same title and authors, which appeared in the Proceedings of the International Conference on Distributed Computing (DISC) 2011

Journal ref: DISC 2011: 283-297

arXiv:1812.03398 [pdf, other]

doi 10.1145/3357384.3357983

FLEET: Butterfly Estimation from a Bipartite Graph Stream

Authors: Seyed-Vahid Sanei-Mehri, Yu Zhang, Ahmet Erdem Sariyuce, Srikanta Tirthapura

Abstract: We consider space-efficient single-pass estimation of the number of butterflies, a fundamental bipartite graph motif, from a massive bipartite graph stream where each edge represents a connection between entities in two different partitions. We present a space lower bound for any streaming algorithm that can estimate the number of butterflies accurately, as well as FLEET, a suite of algorithms for… ▽ More We consider space-efficient single-pass estimation of the number of butterflies, a fundamental bipartite graph motif, from a massive bipartite graph stream where each edge represents a connection between entities in two different partitions. We present a space lower bound for any streaming algorithm that can estimate the number of butterflies accurately, as well as FLEET, a suite of algorithms for accurately estimating the number of butterflies in the graph stream. Estimates returned by the algorithms come with provable guarantees on the approximation error, and experiments show good tradeoffs between the space used and the accuracy of approximation. We also present space-efficient algorithms for estimating the number of butterflies within a sliding window of the most recent elements in the stream. While there is a significant body of work on counting subgraphs such as triangles in a unipartite graph stream, our work seems to be one of the few to tackle the case of bipartite graph streams. △ Less

Submitted 28 August, 2019; v1 submitted 8 December, 2018; originally announced December 2018.

Comments: This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in Seyed-Vahid Sanei-Mehri, Yu Zhang, Ahmet Erdem Sariyuce and Srikanta Tirthapura. "FLEET: Butterfly Estimation from a Bipartite Graph Stream". The 28th ACM International Conference on Information and Knowledge Management

arXiv:1808.09531 [pdf, other]

doi 10.1109/BigData.2018.8622352

Enumerating Top-k Quasi-Cliques

Authors: Seyed-Vahid Sanei-Mehri, Apurba Das, Srikanta Tirthapura

Abstract: Quasi-cliques are dense incomplete subgraphs of a graph that generalize the notion of cliques. Enumerating quasi-cliques from a graph is a robust way to detect densely connected structures with applications to bio-informatics and social network analysis. However, enumerating quasi-cliques in a graph is a challenging problem, even harder than the problem of enumerating cliques. We consider the enum… ▽ More Quasi-cliques are dense incomplete subgraphs of a graph that generalize the notion of cliques. Enumerating quasi-cliques from a graph is a robust way to detect densely connected structures with applications to bio-informatics and social network analysis. However, enumerating quasi-cliques in a graph is a challenging problem, even harder than the problem of enumerating cliques. We consider the enumeration of top-k degree-based quasi-cliques, and make the following contributions: (1) We show that even the problem of detecting if a given quasi-clique is maximal (i.e. not contained within another quasi-clique) is NP-hard (2) We present a novel heuristic algorithm KernelQC to enumerate the k largest quasi-cliques in a graph. Our method is based on identifying kernels of extremely dense subgraphs within a graph, following by growing subgraphs around these kernels, to arrive at quasi-cliques with the required densities (3) Experimental results show that our algorithm accurately enumerates quasi-cliques from a graph, is much faster than current state-of-the-art methods for quasi-clique enumeration (often more than three orders of magnitude faster), and can scale to larger graphs than current methods. △ Less

Submitted 28 August, 2018; originally announced August 2018.

Comments: 10 pages

Journal ref: 2018 IEEE International Conference on Big Data (Big Data)

arXiv:1807.09417 [pdf, other]

Shared-Memory Parallel Maximal Clique Enumeration

Authors: Apurba Das, Seyed-Vahid Sanei-Mehri, Srikanta Tirthapura

Abstract: We present shared-memory parallel methods for Maximal Clique Enumeration (MCE) from a graph. MCE is a fundamental and well-studied graph analytics task, and is a widely used primitive for identifying dense structures in a graph. Due to its computationally intensive nature, parallel methods are imperative for dealing with large graphs. However, surprisingly, there do not yet exist scalable and para… ▽ More We present shared-memory parallel methods for Maximal Clique Enumeration (MCE) from a graph. MCE is a fundamental and well-studied graph analytics task, and is a widely used primitive for identifying dense structures in a graph. Due to its computationally intensive nature, parallel methods are imperative for dealing with large graphs. However, surprisingly, there do not yet exist scalable and parallel methods for MCE on a shared-memory parallel machine. In this work, we present efficient shared-memory parallel algorithms for MCE, with the following properties: (1) the parallel algorithms are provably work-efficient relative to a state-of-the-art sequential algorithm (2) the algorithms have a provably small parallel depth, showing that they can scale to a large number of processors, and (3) our implementations on a multicore machine shows a good speedup and scaling behavior with increasing number of cores, and are substantially faster than prior shared-memory parallel algorithms for MCE. △ Less

Submitted 24 July, 2018; originally announced July 2018.

Comments: 10 pages, 3 figures, proceedings of the 25th IEEE International Conference on. High Performance Computing, Data, and Analytics (HiPC), 2018

arXiv:1801.09039 [pdf, other]

Variance-Optimal Offline and Streaming Stratified Random Sampling

Authors: Trong Duc Nguyen, Ming-Hung Shih, Divesh Srivastava, Srikanta Tirthapura, Bojian Xu

Abstract: Stratified random sampling (SRS) is a fundamental sampling technique that provides accurate estimates for aggregate queries using a small size sample, and has been used widely for approximate query processing. A key question in SRS is how to partition a target sample size among different strata. While Neyman allocation provides a solution that minimizes the variance of an estimate using this sampl… ▽ More Stratified random sampling (SRS) is a fundamental sampling technique that provides accurate estimates for aggregate queries using a small size sample, and has been used widely for approximate query processing. A key question in SRS is how to partition a target sample size among different strata. While Neyman allocation provides a solution that minimizes the variance of an estimate using this sample, it works under the assumption that each stratum is abundant, i.e., has a large number of data points to choose from. This assumption may not hold in general: one or more strata may be bounded, and may not contain a large number of data points, even though the total data size may be large. We first present VOILA, an offline method for allocating sample sizes to strata in a variance-optimal manner, even for the case when one or more strata may be bounded. We next consider SRS on streaming data that are continuously arriving. We show a lower bound, that any streaming algorithm for SRS must have (in the worst case) a variance that is Ω(r) factor away from the optimal, where r is the number of strata. We present S-VOILA, a practical streaming algorithm for SRS that is locally variance-optimal in its allocation of sample sizes to different strata. Our result from experiments on real and synthetic data show that VOILA can have significantly (1.4 to 50.0 times) smaller variance than Neyman allocation. The streaming algorithm S-VOILA results in a variance that is typically close to VOILA, which was given the entire input beforehand. △ Less

Submitted 20 February, 2018; v1 submitted 27 January, 2018; originally announced January 2018.

arXiv:1801.07399 [pdf, other]

Onion Curve: A Space Filling Curve with Near-Optimal Clustering

Authors: Pan Xu, Cuong Nguyen, Srikanta Tirthapura

Abstract: Space filling curves (SFCs) are widely used in the design of indexes for spatial and temporal data. Clustering is a key metric for an SFC, that measures how well the curve preserves locality in moving from higher dimensions to a single dimension. We present the {\em onion curve}, an SFC whose clustering performance is provably close to optimal for the cube and near-cube shaped query sets, irrespec… ▽ More Space filling curves (SFCs) are widely used in the design of indexes for spatial and temporal data. Clustering is a key metric for an SFC, that measures how well the curve preserves locality in moving from higher dimensions to a single dimension. We present the {\em onion curve}, an SFC whose clustering performance is provably close to optimal for the cube and near-cube shaped query sets, irrespective of the side length of the query. We show that in contrast, the clustering performance of the widely used Hilbert curve can be far from optimal, even for cube-shaped queries. Since the clustering performance of an SFC is critical to the efficiency of multi-dimensional indexes based on the SFC, the onion curve can deliver improved performance for data structures involving multi-dimensional data. △ Less

Submitted 3 June, 2018; v1 submitted 23 January, 2018; originally announced January 2018.

Comments: The short version is published in ICDE 18

arXiv:1801.00338 [pdf, other]

Butterfly Counting in Bipartite Networks

Authors: Seyed-Vahid Sanei-Mehri, Ahmet Erdem Sariyuce, Srikanta Tirthapura

Abstract: We consider the problem of counting motifs in bipartite affiliation networks, such as author-paper, user-product, and actor-movie relations. We focus on counting the number of occurrences of a "butterfly", a complete $2 \times 2$ biclique, the simplest cohesive higher-order structure in a bipartite graph. Our main contribution is a suite of randomized algorithms that can quickly approximate the nu… ▽ More We consider the problem of counting motifs in bipartite affiliation networks, such as author-paper, user-product, and actor-movie relations. We focus on counting the number of occurrences of a "butterfly", a complete $2 \times 2$ biclique, the simplest cohesive higher-order structure in a bipartite graph. Our main contribution is a suite of randomized algorithms that can quickly approximate the number of butterflies in a graph with a provable guarantee on accuracy. An experimental evaluation on large real-world networks shows that our algorithms return accurate estimates within a few seconds, even for networks with trillions of butterflies and hundreds of millions of edges. △ Less

Submitted 15 March, 2018; v1 submitted 31 December, 2017; originally announced January 2018.

Comments: 28 pages, 5 tables, 6 figures

arXiv:1710.02103 [pdf, other]

Learning Graphical Models from a Distributed Stream

Authors: Yu Zhang, Srikanta Tirthapura, Graham Cormode

Abstract: A current challenge for data management systems is to support the construction and maintenance of machine learning models over data that is large, multi-dimensional, and evolving. While systems that could support these tasks are emerging, the need to scale to distributed, streaming data requires new models and algorithms. In this setting, as well as computational scalability and model accuracy, we… ▽ More A current challenge for data management systems is to support the construction and maintenance of machine learning models over data that is large, multi-dimensional, and evolving. While systems that could support these tasks are emerging, the need to scale to distributed, streaming data requires new models and algorithms. In this setting, as well as computational scalability and model accuracy, we also need to minimize the amount of communication between distributed processors, which is the chief component of latency. We study Bayesian networks, the workhorse of graphical models, and present a communication-efficient method for continuously learning and maintaining a Bayesian network model over data that is arriving as a distributed stream partitioned across multiple processors. We show a strategy for maintaining model parameters that leads to an exponential reduction in communication when compared with baseline approaches to maintain the exact MLE (maximum likelihood estimation). Meanwhile, our strategy provides similar prediction errors for the target distribution and for classification tasks. △ Less

Submitted 5 October, 2017; originally announced October 2017.

arXiv:1707.08272 [pdf, other]

A Change-Sensitive Algorithm for Maintaining Maximal Bicliques in a Dynamic Bipartite Graph

Authors: Apurba Das, Srikanta Tirthapura

Abstract: We consider the maintenance of maximal bicliques from a dynamic bipartite graph that changes over time due to the addition or deletion of edges. When the set of edges in a graph changes, we are interested in knowing the change in the set of maximal bicliques (the "change"), rather than in knowing the set of maximal bicliques that remain unaffected. The challenge in an efficient algorithm is to enu… ▽ More We consider the maintenance of maximal bicliques from a dynamic bipartite graph that changes over time due to the addition or deletion of edges. When the set of edges in a graph changes, we are interested in knowing the change in the set of maximal bicliques (the "change"), rather than in knowing the set of maximal bicliques that remain unaffected. The challenge in an efficient algorithm is to enumerate the change without explicitly enumerating the set of all maximal bicliques. In this work, we present (1) near-tight bounds on the magnitude of change in the set of maximal bicliques of a graph, due to a change in the edge set (2) a "change-sensitive" algorithm for enumerating the change in the set of maximal bicliques, whose time complexity is proportional to the magnitude of change that actually occurred in the set of maximal bicliques in the graph. To our knowledge, these are the first algorithms for enumerating maximal bicliques in a dynamic graph, with such provable performance guarantees. Our algorithms are easy to implement, and experimental results show that their performance exceeds that of current baseline implementations by orders of magnitude. △ Less

Submitted 25 July, 2017; originally announced July 2017.

Comments: 12 pages, 9 figures

arXiv:1701.03826 [pdf, other]

Streaming k-Means Clustering with Fast Queries

Authors: Yu Zhang, Kanat Tangwongsan, Srikanta Tirthapura

Abstract: We present methods for k-means clustering on a stream with a focus on providing fast responses to clustering queries. Compared to the current state-of-the-art, our methods provide substantial improvement in the query time for cluster centers while retaining the desirable properties of provably small approximation error and low space usage. Our algorithms rely on a novel idea of "coreset caching" t… ▽ More We present methods for k-means clustering on a stream with a focus on providing fast responses to clustering queries. Compared to the current state-of-the-art, our methods provide substantial improvement in the query time for cluster centers while retaining the desirable properties of provably small approximation error and low space usage. Our algorithms rely on a novel idea of "coreset caching" that systematically reuses coresets (summaries of data) computed for recent queries in answering the current clustering query. We present both theoretical analysis and detailed experiments demonstrating their correctness and efficiency △ Less

Submitted 6 December, 2018; v1 submitted 13 January, 2017; originally announced January 2017.

arXiv:1602.05232 [pdf, other]

Work-Efficient Parallel and Incremental Graph Connectivity

Authors: Natcha Simsiri, Kanat Tangwongsan, Srikanta Tirthapura, Kun-Lung Wu

Abstract: On an evolving graph that is continuously updated by a high-velocity stream of edges, how can one efficiently maintain if two vertices are connected? This is the connectivity problem, a fundamental and widely studied problem on graphs. We present the first shared-memory parallel algorithm for incremental graph connectivity that is both provably work-efficient and has polylogarithmic parallel depth… ▽ More On an evolving graph that is continuously updated by a high-velocity stream of edges, how can one efficiently maintain if two vertices are connected? This is the connectivity problem, a fundamental and widely studied problem on graphs. We present the first shared-memory parallel algorithm for incremental graph connectivity that is both provably work-efficient and has polylogarithmic parallel depth. We also present a simpler algorithm with slightly worse theoretical properties, but which is easier to implement and has good practical performance. Our experiments show a throughput of hundreds of millions of edges per second on a $20$-core machine. △ Less

Submitted 16 February, 2016; originally announced February 2016.

Comments: 18 pages

arXiv:1601.06311 [pdf, other]

Incremental Maintenance of Maximal Cliques in a Dynamic Graph

Authors: Apurba Das, Michael Svendsen, Srikanta Tirthapura

Abstract: We consider the maintenance of the set of all maximal cliques in a dynamic graph that is changing through the addition or deletion of edges. We present nearly tight bounds on the magnitude of change in the set of maximal cliques, as well as the first change-sensitive algorithms for clique maintenance, whose runtime is proportional to the magnitude of the change in the set of maximal cliques. We pr… ▽ More We consider the maintenance of the set of all maximal cliques in a dynamic graph that is changing through the addition or deletion of edges. We present nearly tight bounds on the magnitude of change in the set of maximal cliques, as well as the first change-sensitive algorithms for clique maintenance, whose runtime is proportional to the magnitude of the change in the set of maximal cliques. We present experimental results showing these algorithms are efficient in practice and are faster than prior work by two to three orders of magnitude. △ Less

Submitted 17 March, 2018; v1 submitted 23 January, 2016; originally announced January 2016.

Comments: 18 pages, 8 figures

arXiv:1404.4910 [pdf, ps, other]

Enumerating Maximal Bicliques from a Large Graph using MapReduce

Authors: Arko Provo Mukherjee, Srikanta Tirthapura

Abstract: We consider the enumeration of maximal bipartite cliques (bicliques) from a large graph, a task central to many practical data mining problems in social network analysis and bioinformatics. We present novel parallel algorithms for the MapReduce platform, and an experimental evaluation using Hadoop MapReduce. Our algorithm is based on clustering the input graph into smaller sized subgraphs, followe… ▽ More We consider the enumeration of maximal bipartite cliques (bicliques) from a large graph, a task central to many practical data mining problems in social network analysis and bioinformatics. We present novel parallel algorithms for the MapReduce platform, and an experimental evaluation using Hadoop MapReduce. Our algorithm is based on clustering the input graph into smaller sized subgraphs, followed by processing different subgraphs in parallel. Our algorithm uses two ideas that enable it to scale to large graphs: (1) the redundancy in work between different subgraph explorations is minimized through a careful pruning of the search space, and (2) the load on different reducers is balanced through the use of an appropriate total order among the vertices. Our evaluation shows that the algorithm scales to large graphs with millions of edges and tens of mil- lions of maximal bicliques. To our knowledge, this is the first work on maximal biclique enumeration for graphs of this scale. △ Less

Submitted 18 April, 2014; originally announced April 2014.

Comments: A preliminary version of the paper was accepted at the Proceedings of the 3rd IEEE International Congress on Big Data 2014

arXiv:1310.6780 [pdf, ps, other]

Mining Maximal Cliques from an Uncertain Graph

Authors: Arko Provo Mukherjee, Pan Xu, Srikanta Tirthapura

Abstract: We consider mining dense substructures (maximal cliques) from an uncertain graph, which is a probability distribution on a set of deterministic graphs. For parameter 0 < α < 1, we present a precise definition of an α-maximal clique in an uncertain graph. We present matching upper and lower bounds on the number of α-maximal cliques possible within an uncertain graph. We present an algorithm to enum… ▽ More We consider mining dense substructures (maximal cliques) from an uncertain graph, which is a probability distribution on a set of deterministic graphs. For parameter 0 < α < 1, we present a precise definition of an α-maximal clique in an uncertain graph. We present matching upper and lower bounds on the number of α-maximal cliques possible within an uncertain graph. We present an algorithm to enumerate α-maximal cliques in an uncertain graph whose worst-case runtime is near-optimal, and an experimental evaluation showing the practical utility of the algorithm. △ Less

Submitted 22 October, 2014; v1 submitted 24 October, 2013; originally announced October 2013.

Comments: ICDE 2015

arXiv:1310.1161 [pdf, ps, other]

Identifying Correlated Heavy-Hitters in a Two-Dimensional Data Stream

Authors: Bibudh Lahiri, Arko Provo Mukherjee, Srikanta Tirthapura

Abstract: We consider online mining of correlated heavy-hitters from a data stream. Given a stream of two-dimensional data, a correlated aggregate query first extracts a substream by applying a predicate along a primary dimension, and then computes an aggregate along a secondary dimension. Prior work on identifying heavy-hitters in streams has almost exclusively focused on identifying heavy-hitters on a sin… ▽ More We consider online mining of correlated heavy-hitters from a data stream. Given a stream of two-dimensional data, a correlated aggregate query first extracts a substream by applying a predicate along a primary dimension, and then computes an aggregate along a secondary dimension. Prior work on identifying heavy-hitters in streams has almost exclusively focused on identifying heavy-hitters on a single dimensional stream, and these yield little insight into the properties of heavy-hitters along other dimensions. In typical applications however, an analyst is interested not only in identifying heavy-hitters, but also in understanding further properties such as: what other items appear frequently along with a heavy-hitter, or what is the frequency distribution of items that appear along with the heavy-hitters. We consider queries of the following form: In a stream S of (x, y) tuples, on the substream H of all x values that are heavy-hitters, maintain those y values that occur frequently with the x values in H. We call this problem as Correlated Heavy-Hitters (CHH). We formulate an approximate formulation of CHH identification, and present an algorithm for tracking CHHs on a data stream. The algorithm is easy to implement and uses workspace which is orders of magnitude smaller than the stream itself. We present provable guarantees on the maximum error, as well as detailed experimental results that demonstrate the space-accuracy trade-off. △ Less

Submitted 3 October, 2013; originally announced October 2013.

arXiv:1308.2166 [pdf, other]

Parallel Triangle Counting in Massive Streaming Graphs

Authors: Kanat Tangwongsan, A. Pavan, Srikanta Tirthapura

Abstract: The number of triangles in a graph is a fundamental metric, used in social network analysis, link classification and recommendation, and more. Driven by these applications and the trend that modern graph datasets are both large and dynamic, we present the design and implementation of a fast and cache-efficient parallel algorithm for estimating the number of triangles in a massive undirected graph… ▽ More The number of triangles in a graph is a fundamental metric, used in social network analysis, link classification and recommendation, and more. Driven by these applications and the trend that modern graph datasets are both large and dynamic, we present the design and implementation of a fast and cache-efficient parallel algorithm for estimating the number of triangles in a massive undirected graph whose edges arrive as a stream. It brings together the benefits of streaming algorithms and parallel algorithms. By building on the streaming algorithms framework, the algorithm has a small memory footprint. By leveraging the paralell cache-oblivious framework, it makes efficient use of the memory hierarchy of modern multicore machines without needing to know its specific parameters. We prove theoretical bounds on accuracy, memory access cost, and parallel runtime complexity, as well as showing empirically that the algorithm yields accurate results and substantial speedups compared to an optimized sequential implementation. (This is an expanded version of a CIKM'13 paper of the same title.) △ Less

Submitted 9 August, 2013; originally announced August 2013.

arXiv:1004.1569

A Streaming Approximation Algorithm for Klee's Measure Problem

Authors: Gokarna Sharma, Costas Busch, Srikanta Tirthapura

Abstract: The efficient estimation of frequency moments of a data stream in one-pass using limited space and time per item is one of the most fundamental problem in data stream processing. An especially important estimation is to find the number of distinct elements in a data stream, which is generally referred to as the zeroth frequency moment and denoted by $F_0$. In this paper, we consider streams of rec… ▽ More The efficient estimation of frequency moments of a data stream in one-pass using limited space and time per item is one of the most fundamental problem in data stream processing. An especially important estimation is to find the number of distinct elements in a data stream, which is generally referred to as the zeroth frequency moment and denoted by $F_0$. In this paper, we consider streams of rectangles defined over a discrete space and the task is to compute the total number of distinct points covered by the rectangles. This is known as the Klee's measure problem in 2 dimensions. We present and analyze a randomized streaming approximation algorithm which gives an $(ε, δ)$-approximation of $F_0$ for the total area of Klee's measure problem in 2 dimensions. Our algorithm achieves the following complexity bounds: (a) the amortized processing time per rectangle is $O(\frac{1}{ε^4}\log^3 n\log\frac{1}δ)$; (b) the space complexity is $O(\frac{1}{ε^2}\log n \log\frac{1}δ)$ bits; and (c) the time to answer a query for $F_0$ is $O(\log\frac{1}δ)$, respectively. To our knowledge, this is the first streaming approximation for the Klee's measure problem that achieves sub-polynomial bounds. △ Less

Submitted 28 October, 2010; v1 submitted 9 April, 2010; originally announced April 2010.

Comments: This paper has been withdrawn by the author due to a small technical error in Algorithm 3 and 4

Showing 1–22 of 22 results for author: Tirthapura, S