Data Visualization in The Age of Big Data
Data Visualization in The Age of Big Data
Rafael Lacerda
University of Toronto
Toronto, Canada
[email protected]
Page 1
With this motivation, we seek to understand the changes or visualizations can inform the user about distinct aspects of
workarounds available to improve users' workflows when the selected data [4].
dealing with big data through the lens of data visualization.
High dimensionality presents a cognitive hurdle for users
Although there are many commercial solutions available,
[14, 19]. Dimensionality reduction techniques find lower
this paper focuses exclusively on peer-reviewed solutions.
dimensional structures hidden in high dimensional data and
may be a crucial step in data visualization. Typically for
visualization, the goal is to map a dataset into a two or
BACKGROUND LITERATURE
three-dimensional space that preserves the intrinsic
Regardless of the coinage of "Big Data", dataset sizes have structure as much as possible. Well known methods are
been growing, along with techniques for dealing with their Principal Component Analysis, Independent Component
sizes. Thus, unless specifically marked as associated to the Analysis and Multidimensional Scaling, though many
term, the question of whether a technique is associated with others are available [12]. More recently, the use of neural
Big Data may be up for debate. This background section networks as a generalization of PCA has been proposed,
will discuss traditional approaches used for visualizing with better results [14].
large data sets, independent of their explicit presentation as
For certain visualizations, precomputation is an option [19],
a Big Data technique. Large datasets and Big Data are used
most notably in mapping visualizations [11]. This solution
interchangeably.
creates a tradeoff between computation in the past versus at
Data exploration and scalability runtime, with the benefit of avoiding computing the same
task twice. This approach is also related to the concept of
With cognitive and latency issues in mind, solutions and
granularity, with the benefit that data derived from panning
workarounds have been developed to mitigate them. A
and zooming will already be computed, generating less
widespread class of techniques, known as data reduction
latency for user tasks.
methods, have the benefits of improving perceptual and
computational scalability of visualizations [19]. Displays
Filtering and sampling techniques allow standard Shneiderman’s definition of big data becomes relevant in
visualization techniques to be applied to a subset of the data this subsection. Traditional desktop displays have a
[19]. They typically involve representing data with maximum resolution of 4 Mpixel. At one data point per
centrality measures (mean, median), spread measures pixel, they are not sufficient to display a very large dataset
(standard deviation), or selecting a subsample as the data at their native resolution and trade-offs must be made to
(specified by the user or randomized) used to render visualize the entire dataset such as the ones mentioned in
visualizations [7]. However, the resulting data are evidently the previous subsection. While panning and zooming into
degraded versions of the original and can miss structural subsets of data are a legitimate solution, operating these
information or outliers [7, 19]. interfaces can also introduce cognitive hurdles by
distracting the user from the actual data exploration,
Binned aggregation is a technique that separates data
rendering traditional monitors ineffective for large data
according to a desired category or time interval. Granular
visualization [25].
computing is a useful paradigm for dealing with complex
hierarchies in data by using fuzzy sets, rough sets and A solution for interacting with large datasets came in 1992
random sets [22]. Interactive visualizations allow the user by the name of CAVE. The first version was a hollow
to select a level of granularity of bins to accommodate their ten-foot cube with interactive computer graphics projected
cognitive needs, such as hierarchical categories or on four of its surfaces, achieving a total resolution of 5
hierarchical time intervals [19]. Mpixel, quite high for the time. A CAVE-like environment
can be seen on Figure 1. By immersing the user in
In 1991, Buja et al. proposed the concepts of focusing and
stereoscopic graphics, it was ideal for exploring three
linking as simple classifications for techniques aimed at
dimensional visualizations. Although descendants of CAVE
exploring and interacting with large datasets. Focusing
introduced better resolution and picture quality, the design
refers to selecting a subsample, such as panning, zooming
itself required use of projectors, dimly lit spaces built
and slicing the data. It also includes dimensionality
specifically for this purpose, rendering its adoption difficult
reduction like projections and false coloring. Focusing
to integrate into offices [25].
allows viewing a small section of the data with finely
grained information. Linking allows interactive control of
multiple views at the same time. By selecting a point or
section of data, multiple two or three-dimensional
Page 2
of data reduction between the database management system
and the visualization will be left to the user [3]. Other
drawbacks of using isolated tools for querying and
visualizing is duplication of functionality and missed
opportunities to reuse data intelligently [27].
To address this, Battle et al. designed ScalaR, a
visualization layer that dynamically performs resolution
reduction when the expected number of rows is too large for
satisfactory rendering performance. [3].
ScalaR works as an intermediate layer between the
front-end visualization system and the DBMS. When
expected query results trigger one of the data limits
imposed by ScalaR, such as available resources for the
visualization front-end, a trigger will be activated to reduce
the resolution of query results. Data can be explored by
panning and zooming, which also trigger dynamic queries
to retrieve missing data. While the aggregation, sampling
and filtering techniques used by ScalaR are nothing new, it
Figure 1. VisBox VisCube, a CAVE-like display with four
screens. Reprinted from "The Effects of Display Fidelity, uses the query plan provided by the DBMS that estimates
Visual Complexity, and Task Scope on Spatial Understanding query result size. If the result size violates predefined
of 3D Graphs." by Bacim, F. et al., (2013) "The effects of performance constraints, it allows the user to choose
display fidelity, visual complexity, and task scope on spatial between visualizing full and low resolution of query results
understanding of 3d graphs." Proceedings of Graphics instead. In this case, ScalaR modifies the original query to
Interface 2013. Canadian Information Processing Society. produce a reduced version. However, due to the basic data
reduction algorithms used in the implementation, ScalaR
has suboptimal performance in truly large query results,
Display walls have seen a larger adoption in recent years by particularly when using aggregation techniques [3].
tiling flat screen monitors to form a contiguous surface,
allowing much higher resolution, image quality and ease of A later paper by Battle et al. proposes Ermac, an end-to-end
setup than CAVE. This solution has seen widespread Data Visualization Management System (DVMS) to take
adoption across offices due to higher quality and lower cost advantage of visualization-specific optimization
of setup [25]. opportunities as well as supporting a language specific to
describe visualizations [27].
Although background literature includes initial approaches
to combining 2D and 3D visualizations to form what is Optimizations mentioned include filtering out data occluded
known as a "Hybrid Reality" or sometimes "Mixed Reality" in the visualization, downsampling to match zoom level and
display, this remains an active topic of research, and will be canvas size and dividing rendering workload between client
further discussed in the next section. In 2003, Kuester et al. and server [27].
developed VizClass, a hardware architecture for classroom The core concept in Ermac however is its declarative
visualization, combining 2D and 3D visualizations with language used to map raw data into visualizations. By
touch sensitive displays in the context of a classroom for compiling the visualization language into a relational
the study of earthquake engineering [15, 16]. This algebra query, the necessary data can be fetched by the
environment was later used by Tresens and Kuester to database. This approach has the advantage of removing the
present a method to combine 2D slices and 3D visuals for a user's burden of considering how raw data could be
better understanding of medical data [26]. transformed into visualizations, or the indirect task of
figuring out how to fetch necessary data with a traditional
relational algebra query that can be transformed into a
CURRENT RESEARCH desired visualization. These optimizations allow interactive
Data exploration and scalability visual exploration to scale better, providing better rendering
performance on very large datasets and greater user
Commonly used visualization tools are still not designed to expression via Ermac's declarative language [27].
handle large volumes of data [7]. In most cases, attempting
to create visualizations from query results will still produce To tackle the issue of large dimensionality, the concept of
an unsatisfactory user experience, and typically, the burden Interactive Data Exploration (IDE) applications has been
Page 3
presented by Cetintemel et al. to augment the human projections. A second method implements a novel approach
capability of exploring data. The authors further suggest to binned aggregation, data representation and parallel
automated database navigators (DBNav) be integrated into processing that allows dynamically loading data that
DBMS to "steer the user towards interesting trajectories enables scalable interactive visualizations in real-time. In
through the data". Particularly relevant within its goals is to their research, they achieved nearly 50 frames-per-second
provide a visual representation of the underlying data space, while brushing and linking dozens of visualizations
serving as an easier path to grasping the relevant patterns in simultaneously [19].
the data, such as distributions, trends, outliers, relationships
Immersive displays
and others. In addition, the authors propose the concept of
"query steering", sequential queries that guide the user in a While CAVE excelled at 3D visualizations, display walls
meaningful way through the data [5]. have better task performance for 2D visualizations. Reda et
al. propose merging these solutions to form the “Hybrid
With a similar goal, a solution that aims to assist users in
Reality Enviroment" (HRE) known as CAVE2. By tiling
pinpointing interesting aspects of data is SeeDB. Proposed
stereoscopic displays, it is possible to seamlessly retain the
by Parameswaran et al., it is a DBMS focused on providing
best in each of the prior solutions, allowing immersive
the user with only the most interesting visualizations related
virtual reality for 3D visualizations as well as the resolution
to their query, similar to a visualization recommender
for large 2D visualizations with a less demanding setup than
engine. In a typical scenario, the analyst would use domain
CAVE. Novel features introduced in the new solution are
knowledge, intuition and exhaustive search to find
head tracking and six-degree-of-freedom input devices.
interesting visualizations or relationships. In an example
Head tracking allows a novel and intuitive form of
given by the authors, given a query about staplers, SeeDB
interaction with data, by simply moving one’s head to peek
should consider the entire space of possible visualizations
at different angles or viewpoints [26]. Research suggests
and recommend the most interesting, such as increasing
that similar high-resolution displays can lead to faster
stapler sales despite a general trend of decreasing sales of
location and comparison of targets with less frustration, and
other products [21].
increasing confidence in their findings. Virtual navigation
SeeDB's innovation is using a heuristic to find aberrations such as panning and zooming in traditional setups appear to
in the data. By grouping and aggregating an initial query's be responsible for a great amount of this frustration. In
results to build what are called discriminating views, which comparison, in high resolution displays we see more
are individually compared against a similar view of the physical navigation, which may contribute to the observed
initial ungrouped result, e.g. the trend in stapler sales decrease in frustration [1].
greatly differs from the trend in general sales. With the
Although CAVE2 has a lower cost than its predecessor, it
calculated deviation between these views using a
remains a relatively expensive setup, requiring a dedicated,
distribution comparison function such as earth movers
customized room, tracking hardware, projections and
distance, a utility score can be attributed to each view [21].
screens. Ponto et al. propose the DSCVR, a similar
An interesting optimization used in SeeDB is taking specification to that of CAVE2, only it uses exclusively
advantage of the inexact nature of visualizations [21]. By commodity-grade hardware instead. Most notably, the user
using approximate results instead of exact ones when tracking component was replaced by the widely available
evaluating trends, the system can avoid extra latency [17, Microsoft Kinect system. The resulting setup has a
21, 27]. comparable result with minor trade-offs such as distortions
resulting from stereo image cross-talk and intrusive screen
Further improvements in scalability can be achieved by data
bezels, visible in Figure 2, both of which are not apparent in
indexing methods that adapt to the user's interests.
the CAVE2 system shown in Figure 1. The total cost is just
Zoumpatianos et al. designed an adaptive indexing system
over $40,000, or less than 5% of the cost of a CAVE2.
that indexes data according to user querying behavior. At
While this is still not an accessible price point for most
each executed query, the index is refined and subsequent
consumers, this could open the Hybrid Reality Environment
query runtimes are reduced, both in simulated and real
market to small businesses [23].
world workflows [28].
Liu, Jiang and Heer have developed a system for real-time
visual querying named imMens, focused on reducing
latency for big data exploration. One of their methods, with
a similar approach to precomputation of views, but instead
recommend precomputing projections corresponding to
materialized database views in runtime, as needed. That is,
decomposing a data cube into a set of 3 or 4 dimensional
Page 4
reported to make significantly fewer errors when using
systems with head tracked rendering. Users also had
significantly faster task performance when using
stereoscopic displays with head tracked rendering enabled
[24].
FUTURE DIRECTIONS
The feasibility of interactive visualization in Big Data is
currently tied to the use of optimization techniques,
especially due to latency. Heer and Kandel suggest further
improvements in data indexing, use of metadata and
searching methods are an avenue for future research in data
Figure 2. Students using CAVE2 to interact with data.
discovery [18].
Reprinted from "Visualizing large, heterogeneous data in
hybrid-reality environments." by Reda, K. et al., (2013) IEEE Battle et al. not only suggest reducing latency resulting
Computer Graphics and Applications 33.4, 38-48. from user interaction, but also masking high latency queries
from the user. Query cost analysis allow estimation of
query runtimes [27], which coupled with query steering to
more cost-effective views [5, 27] can be a valuable piece of
information to allow users to make the best use of their time
[27].
Researchers envision an algorithm to automatically choose
the best visualization given a dataset by using machine
learning techniques on existing visualizations. To further
improve visualization performance, they suggest
prefetching by using predictive algorithms to anticipate
panning and zooming actions by the user as well as predict
general exploration trajectories and patterns. [3, 17]
Rich contextual recommendations allow users to explore
Figure 3. DSCVR System running a scene built using the Unity datasets more effectively. Virtual "tour guides" that assist in
3D game engine. Reprinted from "DSCVR: designing a high level analysis help users gain insights they might not
commodity hybrid virtual reality system." by Ponto, K., Joe K. have found on their own just by using exhaustive search,
and Tredinnick, R., (2015) Virtual Reality 19.1, 57-70. domain knowledge and intuition; and remain an active area
of research [5, 27]
However, are immersive environments appropriate for Heer and Kandel also suggest the data discovery process
manipulating large amounts of data? Recent studies have itself should be studied and managed in a way that enables
tested exactly what aspects of immersive environments auditing, sharing and reuse [18], so we should consider how
affect user task performance and in what ways. future visualizations might help us achieve these goals.
Bacim et al. researched the relationship between display On the display front, low cost Hybrid Reality Environments
fidelity, visual complexity, task scope and user spatial have achieved satisfactory quality and allow the entry of
ability on immersive display systems. Results show that small businesses and research labs into the market [23].
while display fidelity is progressively related to overall Further development and research in applications for HREs
user task performance, that visual complexity and task is ripe with potential in both science and industry.
scope independently have an adverse effect on user
performance speed. Tasks involving spatial search and
finely grained distinction between objects greatly benefit CONCLUSIONS
from immersive environments, while tasks involving size As Heer suggested, it seems like data sample sizes and
comparison and sense of scale benefit very little if at all [1]. dimensionalities continue to grow in spite of our cognitive
Task performance on stereoscopic displays with head abilities remaining relatively constant [18]. Virtual data
tracking were researched by Ragan et al. Participants were exploration assistants seem to function as a prosthetic
solution to augment our capacity to handle ever-growing
Page 5
dimensionalities. Visualization should be coupled with the [9] Diebold, F. X. (2003) "’Big Data’Dynamic factor
data exploration process to allow us to have a better grasp models for macroeconomic measurement and forecasting."
of the most interesting characteristics of data. Advances in Economics and Econometrics: Theory and
Applications, Eighth World Congress of the Econometric
Despite significant effort towards reducing latency
Society,” (edited by M. Dewatripont, LP Hansen and S.
throughout the years by using better interfaces, more
Turnovsky).
efficient algorithms, aggregation, binning, sampling,
filtering and suggesting alternate visualizations, it remains a [10] Fisher, D. et al., (2012) "Interactions with big data
significant issue. Latency directly impacts the feasibility of analytics." interactions 19.3 (2012), 50-59.
big data visualization and exploration in workflows, so
[11] Fisher, D., (2007) "Hotmap: Looking at geographic
these tasks are fundamentally dependent on advances in
attention." IEEE transactions on visualization and computer
latency reducing optimizations.
graphics 13.6, 1184-1191.
Hybrid reality displays seem to be approaching a point in
[12] Geng, X., De-Chuan Z., and Zhi-Hua Z., (2005)
time in which it will become commonplace in research and
"Supervised nonlinear dimensionality reduction for
analysis environments due to lower costs and greater
visualization and classification." IEEE Transactions on
flexibility. Literature has shown significantly fewer errors
Systems, Man, and Cybernetics, Part B (Cybernetics) 35.6,
and greater speed when using immersive environments,
1098-1107.
although visual complexity and large task scope have a
negative effect on performance. Further research and case [13] Gray, W. D., and Boehm-Davis, D. A., (2000)
studies in this topic are expected to popularize in the "Milliseconds matter: An introduction to microstrategies
following years. and to their use in describing and predicting interactive
behavior." Journal of Experimental Psychology: Applied
6.4, 322.
REFERENCES
[14] Hinton, G. E., and Ruslan R. S. (2006) "Reducing the
[1] Bacim, F., et al. (2013) "The effects of display fidelity, dimensionality of data with neural networks." Science
visual complexity, and task scope on spatial understanding 313.5786, 504-507.
of 3d graphs." Proceedings of Graphics Interface 2013.
[15] Hutchinson, T. C., and Kuester, F., (2004) "Hardware
Canadian Information Processing Society.
architecture for a visualization classroom: Vizclass."
[2] Ball, R., and Chris N. (2005) "Effects of tiled Computer Applications in Engineering Education 12.4
high-resolution display on basic visualization and (2004), 232-241.
navigation tasks." CHI'05 extended abstracts on Human
[16] Hutchinson, T. C., et al., (2005) "A hybrid reality
factors in computing systems. ACM.
environment and its application to the study of earthquake
[3] Battle, L., Michael S., and Remco, C. (2013) "Dynamic engineering." Virtual Reality 9.1 (2005), 17-33.
reduction of query result sets for interactive visualizaton."
[17] Stratos, I., Papaemmanouil, O., and Chaudhuri, S.,
Big Data, 2013 IEEE International Conference on. IEEE.
(2015) "Overview of data exploration techniques."
[4] Buja, A., et al. (1991) "Interactive data visualization Proceedings of the 2015 ACM SIGMOD International
using focusing and linking." Visualization, 1991. Conference on Management of Data. ACM.
Visualization'91, Proceedings., IEEE Conference on. IEEE.
[18] Heer, J., Kandel, S., (2012) "Interactive analysis of Big
[5] Cetintemel, U., et al. (2013) "Query Steering for Data", XRDS: Crossroads the ACM Magazine for Students,
Interactive Data Exploration." CIDR. vol. 19, no. 1, 50-54.
[6] Chen, CL P., and Chun-Yang Z. (2014). "Data-intensive [19] Liu, Z., Biye J., and Heer, J., (2013) "imMens:
applications, challenges, techniques and technologies: A Real-time Visual Querying of Big Data." Computer
survey on Big Data." Information Sciences 275, 314-347. Graphics Forum. Vol. 32. No. 3pt4. Blackwell Publishing
Ltd.
[7] Choo, J., and Haesun P. (2013) "Customizing
computational methods for visual analytics with big data." [20] Michael, K., and Miller, K. W.., (2013) "Big data: New
IEEE Computer Graphics and Applications 33.4, 22-28. opportunities and new challenges [guest editors'
introduction]." Computer 46.6 , 22-24.
[8] Cox, N. J., and Kelvyn J. (1981) "Exploratory data
analysis." Quantitative Geography, London: Routledge, [21] Parameswaran, A. Neoklis P., and Garcia-Molina, H.,
135-143. (2013) "Seedb: Visualizing database queries efficiently."
Proceedings of the VLDB Endowment 7.4, 325-328.
Page 6
[22] Pedrycz, W. (2007) "Granular computing-the
emerging paradigm." Journal of uncertain systems 1.1,
38-61.
[23] Ponto, K., Joe K. and Tredinnick, R., (2015) "DSCVR:
designing a commodity hybrid virtual reality system."
Virtual Reality 19.1, 57-70.
[24] Ragan, E. D., et al. (2013) "Studying the effects of
stereo, head tracking, and field of regard on a small-scale
spatial judgment task." IEEE Transactions on Visualization
and Computer Graphics 19.5, 886-896.
[25] Reda, K. et al., (2013) "Visualizing large,
heterogeneous data in hybrid-reality environments." IEEE
Computer Graphics and Applications 33.4, 38-48.
[26] Tresens, M. A., and Falko K., (2004) "Hybrid-Reality:
Collaborative Biomedical Data Exploration Exploiting 2-D
and 3-D." Medicine Meets Virtual Reality 12: Building a
Better You: the Next Tools for Medical Education,
Diagnosis, and Care 98, 22.
[27] Wu, E., Leilani B., and Samuel R. M. (2014). "The
case for data visualization management systems: vision
paper." Proceedings of the VLDB Endowment 7.10,
903-906.
[28] Zoumpatianos, K., Stratos I., and Themis P., (2014)
"Indexing for interactive exploration of big data series."
Proceedings of the 2014 ACM SIGMOD international
conference on Management of data. ACM.
Page 7