Divide and Recombine Approach For Detailed Analysis and Visualization of Large Complex Data
Divide and Recombine Approach For Detailed Analysis and Visualization of Large Complex Data
93
On: 15 Oct 2019
Access details: subscription number
Publisher: CRC Press
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: 5 Howick Place, London SW1P 1WG, UK
Bühlmann Peter, Drineas Petros, Kane Michael, van der Laan Mark
Publication details
https://www.routledgehandbooks.com/doi/10.1201/b19567-6
Hafen Ryan
Published online on: 18 Feb 2016
How to cite :- Hafen Ryan. 18 Feb 2016, Divide and Recombine: Approach for Detailed Analysis and
Visualization of Large Complex Data from: Handbook of Big Data CRC Press
Accessed on: 15 Oct 2019
https://www.routledgehandbooks.com/doi/10.1201/b19567-6
This Document PDF may be used for research, teaching and private study purposes. Any substantial or systematic reproductions,
re-distribution, re-selling, loan or sub-licensing, systematic supply or distribution in any form to anyone is expressly forbidden.
The publisher does not give any warranty express or implied or make any representation that the contents will be complete or
accurate or up to date. The publisher shall not be liable for an loss, actions, claims, proceedings, demand or costs or damages
whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material.
3
Divide and Recombine: Approach for Detailed
Analysis and Visualization of Large Complex Data
Downloaded By: 10.3.98.93 At: 09:03 15 Oct 2019; For: 9781482249088, chapter3, 10.1201/b19567-6
Ryan Hafen
CONTENTS
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Context: Deep Analysis of Large Complex Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.1 Deep Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.2 Large Complex Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.3 What is Needed for Analysis of Large Complex Data? . . . . . . . . . . . . . . . 37
3.3 Divide and Recombine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.1 Division Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.2 Recombination Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.3 Data Structures and Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.4 Research in D&R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 Trelliscope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.1 Trellis Display/Small Multiples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.2 Scaling Trellis Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.3 Trelliscope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5 Tessera: Computational Environment for D&R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5.1 Front End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5.2 Back Ends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5.2.1 Small Scale: In-Memory Storage with R MapReduce . . . . . . 44
3.5.2.2 Medium Scale: Local Disk Storage with Multicore R
MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5.2.3 Large Scale: HDFS and Hadoop MapReduce via RHIPE . 44
3.5.2.4 Large Scale in Memory: Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.1 Introduction
The amount of data being captured and stored is ever increasing, and the need to make
sense of it poses great statistical challenges in methodology, theory, and computation. In this
chapter, we present a framework for statistical analysis and visualization of large complex
data: divide and recombine (D&R).
In D&R, a large dataset is broken into pieces in a meaningful way, statistical or
visual methods are applied to each subset in an embarrassingly parallel fashion, and the
35
36 Handbook of Big Data
results of these computations are recombined in a manner that yields a statistically valid
result. We introduce D&R in Section 3.3 and discuss various division and recombination
schemes.
D&R provides the foundation for Trelliscope, an approach to detailed visualization of
large complex data. Trelliscope is a multipanel display system based on the concepts of
Trellis display. In Trellis display, data are broken into subsets, a visualization method is
applied to each subset, and the resulting panels are arranged in a grid, facilitating meaning-
ful visual comparison between panels. Trelliscope extends Trellis by providing a multipanel
display system that can handle a very large number of panels and provides a paradigm for
Downloaded By: 10.3.98.93 At: 09:03 15 Oct 2019; For: 9781482249088, chapter3, 10.1201/b19567-6
analyst with domain expertise who might be able to provide insights based on explorations
that vastly improve the quality of the data or help the analyst look at the data from a new
perspective. In the words of the father of exploratory data analysis, John Tukey:
This discussion of deep analysis is nothing new to the statistical practitioner, and to
Downloaded By: 10.3.98.93 At: 09:03 15 Oct 2019; For: 9781482249088, chapter3, 10.1201/b19567-6
such our discussion may feel a bit belabored. But in the domain of big data, its practice
severely lags behind the other analytical approaches and is often ignored, and hence deserves
attention.
distributed storage and computing frameworks. With big data, we need a statistical method-
ology that will provide access to the thousands of methods available in a language such as
R without the need to reimplement them. Our proposed approach is D&R, described in
Section 3.3.
Downloaded By: 10.3.98.93 At: 09:03 15 Oct 2019; For: 9781482249088, chapter3, 10.1201/b19567-6
Statistic
Subset Output recombination
New data
Subset Output for analysis
Data subthread
Analytic
Subset Output recombination
FIGURE 3.1
Diagram of the D&R statistical and computational framework.
Divide and Recombine 39
example, suppose we have 25 years of 90 daily financial variables for 100 banks in the United
States. If we wish to study the behavior of individual banks and then make comparisons
across banks, we would partition the data by bank. If we are interested in how all banks
behave together over the course of each year, we could partition by year. Other aspects
such as geography and type or size of bank might also be valid candidates for a division
specification.
A critical consideration when specifying a division is to obtain subsets that are small
enough to be manageable when loaded into memory, so that they can be processed in a
single process in an environment like R. Sometimes, a division driven by subject matter can
lead to subsets that are too large. In this case, some creativity on the part of the analyst
must be applied to further break down the subsets.
The persistence of a division is important. Division is an expensive operation, as it can
require shuffling a large amount data around on a cluster. A given partitioning of the data
is typically reused many times while we are iterating over different analytical methods. For
example, after partitioning financial data by bank, we will probably apply many different
analytical and visual methods to that partitioning scheme until we have a model we are
happy with. We do not want to incur the cost of division each time we want to try a new
method.
Keeping multiple persistent copies of data formatted in different ways for different anal-
ysis purposes is a common practice with small data, and for a good reason. Having the
appropriate data structure for a given analysis task is critical, and the complexity of the
data often means that these structures will be very different depending on the task (e.g.,
not always tabular). Thus, it is not generally sufficient to simply have a single table that is
indexed in different ways for different analysis tasks. The notion of possibly creating mul-
tiple copies of a large dataset may be alarming to a database engineer, but should not be
surprising to a statistical practitioner, as it is a standard practice with small datasets to
have different copies of the data for different purposes.
small loss in accuracy is often a small price to pay for the simple, fast computation. A more
lengthy discussion of this can be found in [6], and some interesting research is discussed in
Section 3.3.4.
Another crucial recombination approach is visual recombination, which we discuss in
Section 3.4.
In addition to the methodological concerns of D&R, there are also computational concerns.
Here, we define the minimal conditions required for a D&R computational environment.
Data structures are the first important consideration. Division methods can result in
partitions where subsets can have nontabular data structures. A generic storage mechanism
for data with potentially arbitrary structure is a key-value store. In a key-value store, each
key-value pair constitutes a subset of the data, and typically the key is a unique identifier
or object that describes the subset, and the value contains the data for the subset. When
the data are large, there are many distributed key-value store technologies that might be
utilized such as the Hadoop Distributed File System (HDFS) [14]. It is important to have
the ability to store data in a persistent state on these systems, and useful to have fast
random lookup of any subset by key.
A data set that has been partitioned into a collection of subsets stored as key-value
pairs, potentially distributed across machines or disks of a cluster, is called a distributed
data object (DDO). When the data for each subset of a DDO is a slice of a data frame, we
can more specifically call such an object a distributed data frame.
For D&R computation, we need to compute on a distributed key-value store in parallel
and in a way that allows us to shuffle data around in a division step, apply analytic methods
to the subsets, and combine results in a recombination step. Environments that implement
MapReduce [4] are sufficient for this. In MapReduce, a map operation is applied to a
collection of input key-value pairs in parallel. The map step outputs a transformed set of
key-value pairs, and these results are shuffled, with the results being grouped by the map
output keys. Each collection of data for a unique map output key is then sent to a reduce
operation, again executed in parallel. All D&R operations can be carried out through this
approach. Systems such as Hadoop [8] that run MapReduce on HDFS are a natural fit for
D&R computation.
D&R is sometimes confused as being equivalent to MapReduce. This is not the case, but
D&R operations are carried out by MapReduce. For example, a division is typically achieved
by a single MapReduce operation—the map step reassigns records to new partitions, the
shuffle groups the map output by partition assignment, and the reduce collates each par-
tition. The application of an analytic method followed by recombination is a separate
MapReduce operation, where the analytic method is applied in the map and the recombina-
tion is done in the shuffle and reduce. Recall that division is independent of recombination
and typically a division persists and is used for many different recombinations.
A common question that arises when discussing the use of systems such as Hadoop for
D&R computation is that of Hadoop being a batch processing system. As we discussed in
Section 3.2, this fits the purpose of the type of deep analysis we are doing, which is typically
offline analysis of historical data. Of course, the models and algorithms that result from a
deep analysis be adapted and integrated into a real-time processing environment, but that
is not the use case for D&R, where we are doing the actual discovery and validation of the
algorithms. But speed is still important, and much work is being done to improve the speed
of these type of systems, such as Spark [23], which can keep the data in memory distributed
across machines, avoiding the most expensive time cost present in Hadoop: reading and
writing data to disk multiple times throughout a MapReduce operation.
Divide and Recombine 41
bootstraps [11], alternating direction method of multipliers [2], and scatter matrix stability
weighting [10]. It is important to note that we seek methods that require a very minimal
amount of iteration, preferably none, as every MapReduce step can be very costly. There are
also many domains of statistics that remain to be studied, including spatial, time series, and
nonparametric statistics. There is a great opportunity for interesting research in this area.
3.4 Trelliscope
Visualization is crucial throughout the analysis process. This could not be more true than
in the case of large complex data. Typical approaches to visualizing large data are either
very specialized tools for data from a specific domain, or schemes that aggregate various
aspects of the data into a single plot. Specialized tools generally do not work for deep
analysis because of the flexibility required to make any imaginable plot we might deem
useful. Visualizations of summaries are indispensable, but alone are not enough. Summaries
can hide important information, particularly when the data are complex. We need flexible,
detailed visualization that scales. Trelliscope is an approach that addresses these needs.
is a problem here. Typically in D&R, we are dealing with very large datasets, and typically
divisions can result in thousands to hundreds of thousands of subsets. A multipanel display
would have as many panels. This can happen even with data of small to moderate size. It
is easy to generate thousands of plots, but it is not feasible to look at all of them.
The problem of having more visualizations than humanly possible to cognitively con-
sume is a problem that pioneering statistician John Tukey realized decades ago. Since it is
impossible to view every display in a large collection, he put forth the idea of asking the
computer to sort out which ones to present by judging the relative interest of showing each
of them [18]. He proposed computing diagnostic quantities with the help of a computer to
Downloaded By: 10.3.98.93 At: 09:03 15 Oct 2019; For: 9781482249088, chapter3, 10.1201/b19567-6
determine which plots to show. In his words, “it seems natural to call such computer guiding
diagnostics cognostics. We must learn to choose them, calculate them, and use them. Else
we drown in a sea of many displays” [18]. Hence, we use the term cognostic to mean a single
metric computed about a single plot that captures a behavior of interest that can be used
by the computer to bring interesting plots to our attention. For any collection of plots, we
may be interested in several behaviors, and will therefore compute several cognostics.
There has been interesting research on cognostics for scatterplots, called scagnostics,
that has yielded metrics that quantify different shapes or behaviors that might present
themselves in a scatterplot [21,22]. Beyond scatterplots, there are many useful metrics that
might be computed to guide the selection of plots. Many such cognostics may be context-
dependent. For example, when dealing with displays of quantile plots, metrics such as the
median, first and third quartile, and range might be good cognostics. Or when dealing
with time series plots, calculations of slope, autocorrelation coefficient, variance of first-
order differences, and so on might be good cognostics. Often, choosing a useful cognostic
is heavily based on the subject matter of the data and the particular plot being made.
For example, consider a collection quantile plots, one for each county in the United States,
showing the age distribution for male and female. For each quantile plot, a useful cognostic
might be the difference in median age between genders.
How does the computer use cognostics to determine which plots we should view? There
are many possibilities. One is ranking or sorting the plots based on the cognostics. For
example, we can effectively understand the data at extremes of the median age difference
by gender by calling up panels with the largest or smallest absolute age difference cognos-
tic. Another possibility is sampling plots across the distribution of a set of cognostics, for
example, looking at a representative set of panels that spans the age difference distribution.
Another action is to filter panels of a display to only panels that have cognostics in a range
of interest. There are many possible effective ways to get a good representation of interesting
panels in a large display, particularly when combining cognostics. For example, if we want
to find interesting panels with respect to a collection of cognostics, we can, for example,
compute projections of the set of cognostics into two dimensions, plot the projection as a
scatterplot, and select panels based on interesting regions of the projection. Another possi-
bility is to apply a clustering algorithm to a set of cognostics to view panels representative
of each of the clusters.
3.4.3 Trelliscope
Trelliscope is a computational system that implements the ideas of large-scale multipanel
display with cognostics for effective detailed display of large complex data [9]. In Trelliscope,
the analyst creates a division of the data, specifies a plotting function to be applied to each
subset, and also specifies a cognostics function that for each subset will compute a set of
metrics. The cognostics are collected from every subset and are used in an interactive viewer
allowing the user to specify different actions with the cognostics (sort, filter, and sample)
in different dimensions, thereby arranging panels in a desired way.
Divide and Recombine 43
As discussed in Section 3.3.3, a back end for D&R needs to be able to do MapReduce over
a distributed key-value store. Additionally, there needs to be a mechanism that connects
R to these back ends, as we need to be able to run R code inside of the map and reduce
steps. Tessera currently has implemented support for four different back ends which we will
discuss in the following sections: in-memory storage with R MapReduce (small), local disk
storage with multicore R MapReduce (medium), HDFS storage with Hadoop MapReduce
via RHIPE (large), and HDFS storage with Spark MapReduce via SparkR (large).
Downloaded By: 10.3.98.93 At: 09:03 15 Oct 2019; For: 9781482249088, chapter3, 10.1201/b19567-6
operations are a bonus in that the same data being used for D&R can be used for other
parallel computing purposes. Tessera connects to Spark using the SparkR package [19],
which exposes the Spark API in the R console.
Support in Tessera for Spark at the time of this writing is very experimental—it has
been implemented and works, and adding it to Tessera is a testament of Tessera’s flexibility
in being back end agnostic, but it has only been tested with rather small datasets.
Downloaded By: 10.3.98.93 At: 09:03 15 Oct 2019; For: 9781482249088, chapter3, 10.1201/b19567-6
3.6 Discussion
In this chapter, we have presented one point of view regarding methodology and compu-
tational tools for deep statistical analysis and visualization of large complex data. D&R is
attractive because of its simplicity and its ability to make a wide array of methods available
without needing to implement scalable versions of them. D&R also builds on approaches
that are already very popular with small data, particularly implementations of the split-
apply-combine paradigm such as the plyr and dplyr R packages. D&R as implemented in
datadr is future proof because of its design, enabling adoption of improved back-end tech-
nology as it comes along. All of these factors give D&R a high chance of success. However,
there is a great need for more research and software development to extend D&R to more
statistical domains and make it easier to program.
References
1. Richard A. Becker, William S. Cleveland, and Ming-Jen Shyu. The visual design
and control of trellis display. Journal of Computational and Graphical Statistics,
5(2):123–155, 1996.
2. Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eck-stein. Dis-
tributed optimization and statistical learning via the alternating direction method of
multipliers. Foundations and Trends in Machine Learning, 3(1):1–122, 2011.
3. William S. Cleveland and Ryan Hafen. Divide and recombine (D&R): Data science
for large complex data. Statistical Analysis and Data Mining: The ASA Data Science
Journal, 7(6):425–433, 2014.
4. Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large
clusters. Communications of the ACM, 51(1):107–113, 2008.
5. Saptarshi Guha and William S. Adviser-Cleveland. Computing environment for the
statistical analysis of large and complex data. PhD thesis, Department of Statistics,
Purdue University, West Lafayette, IN, 2010.
6. Saptarshi Guha, Ryan Hafen, Jeremiah Rounds, Jin Xia, Jianfu Li, Bowei Xi, and
William S. Cleveland. Large complex data: Divide and recombine (D&R) with rhipe.
Stat, 1(1):53–67, 2012.
7. Saptarshi Guha, Paul Kidwell, Ryan Hafen, and William S. Cleveland. Visualization
databases for the analysis of large complex datasets. In International Conference on
Artificial Intelligence and Statistics, pp. 193–200, 2009.
46 Handbook of Big Data
9. Ryan Hafen, Luke Gosink, Jason McDermott, Karin Rodland, Kerstin Kleese-Van Dam,
and William S. Cleveland. Trelliscope: A system for detailed visualization in the deep
analysis of large complex data. In IEEE Symposium on Large-Scale Data Analysis and
Visualization (LDAV ), pp. 105–112. IEEE, Atlanta, GA, 2013.
10. Michael J. Kane. Scatter matrix concordance: A diagnostic for regressions on subsets
of data. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2015.
Downloaded By: 10.3.98.93 At: 09:03 15 Oct 2019; For: 9781482249088, chapter3, 10.1201/b19567-6
11. Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar, and Michael I. Jordan. A scalable
bootstrap for massive data. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 2014.
12. R Core Team. R: A Language and Environment for Statistical Computing. R Foundation
for Statistical Computing, Vienna, Austria, 2012.
13. Steven L. Scott, Alexander W. Blocker, Fernando V. Bonassi, Hugh A. Chipman,
Edward I. George, and Robert E. McCulloch. Bayes and big data: The consensus Monte
Carlo algorithm. In EFaB@Bayes 250 Conference, volume 16, 2013.
14. Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The Hadoop
distributed file system. In IEEE 26th Symposium on Mass Storage Systems and Tech-
nologies, pp. 1–10. IEEE, Incline Village, NV, 2010.
15. Luke Tierney, Anthony Rossini, Na Li, and Han Sevcikova. SNOW: Simple Network of
Workstations. R package version 0.3-13, 2013.
16. Edward R. Tufte. Visual Explanations: Images and Quantities, Evidence and Narrative,
volume 36. Graphics Press, Cheshire, CT, 1997.
17. John W. Tukey. Exploratory Data Analysis. 1977.
18. John W. Tukey and Paul A. Tukey. Computer graphics and exploratory data analysis:
An introduction. The Collected Works of John W. Tukey: Graphics: 1965–1985, 5:419,
1988.
19. Shivaram Venkataraman. SparkR: R frontend for Spark. R package version 0.1, 2013.
20. Hadley Wickham. The split-apply-combine strategy for data analysis. Journal of Sta-
tistical Software, 40(1):1–29, 2011.
21. Leland Wilkinson, Anushka Anand, and Robert L. Grossman. Graph-theoretic scagnos-
tics. In INFOVIS, volume 5, p. 21, 2005.
22. Leland Wilkinson and Graham Wills. Scagnostics distributions. Journal of Computa-
tional and Graphical Statistics, 17(2):473–491, 2008.
23. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion
Stoica. Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX
Conference on Hot Topics in Cloud Computing, pp. 10–10, 2010.