High Dimensional - Visualizations KDD2001 Color
High Dimensional - Visualizations KDD2001 Color
1 2
Institute for Visualization and Perception Research and AnVil Informatics, Inc.
University of Massachusetts Lowell
Abstract Here the data is n-dimensional, 3 axes are selected and laid out on
the plane (the physical medium). The n-dimensional points are
In this paper we provide a brief background to data visualization projected on the 2D surface. Hence this is a 3-dimensional
and point to key references. We differentiate between high- visualization on the 2D surface. Note that we could also consider
dimensional data visualization and high-dimensional data the dimensionality of the data represented. By using color and
visualizations and review the various high-dimensional shape we could argue that a 3D scatterplot is a 5-dimensional
visualization techniques. Our goal is to define metrics that identify representation of n-dimensional data on a 2D surface. In such a
how visualizations deal with n dimensions when displayed on the display there are perceptual ambiguities resulting from the
screen. We define intrinsic dimensionality metrics that assess occlusion of points. These can be resolved by providing various
these techniques and closely analyze selected high-dimensional tools, including interactive ones. For example the user can rotate
visualizations’ display of data. such a display to see hidden points.
Keywords: visualization techniques overview, evaluation, high- We can thus classify visualizations based on the intrinsic
dimensional data visualization, metrics dimensionality of the logical representation as well as its potential
dimensionality by adding in additional data attributes. Since the
additional data attributes can often be applied to most
1 INTRODUCTION visualizations, we will only consider the intrinsic dimensionality.
A visualization is a visual representation of data. Data is mapped
to some numerical form and translated into some graphical 2 VISUALIZATION BACKGROUND
representation. The term “high-dimensional data visualization”
and “high-dimensional visualization” are often used Visualization is used increasingly in the data exploration process
interchangeably. However, a visualization of high-dimensional but still not to the extent possible. In its early years it was mostly,
data is different than a high-dimensional visualization. In the first if not only, used to convey the results of statistical computation or
the term “high” refers to data whereas in the second it refers to data mining algorithms [7], [49], [10]. Over the last decade it has
visualization. This paper defines some simple metrics for high- been used in the data massaging and cleansing process, and
dimensional visualization. somewhat in the data management process. It is still not being
used in the computational steering processes within the data
exploration pipeline except for some research systems.
We assume the data is n-dimensional where n is an integer. In this
paper we focus on high-dimensional data visualizations and more
specifically visualizations that can present a large number of 2.1 Visualization Taxonomies
dimensions or parameters of the data. We attempt to identify what
constitutes a high-dimensional visualization. There are numerous visualizations and a good number of valuable
taxonomies [45].
All visualizations basically still end up on a display surface (soft
or hardcopy). There are a few 3D-displays and much of what Historically static displays, most of which have been extended to
follows still apply to these. One interpretation therefore is that all support probing and even more dynamic interactions, include
visualizations project the n-dimensional data down to 2 histograms, scatterplots, and numerous of their extensions. These
dimensions. Although this is correct we wish to differentiate can be seen in most commercial graphics and statistical packages.
between the dimensionality of the physical medium (2
dimensions) and the logical representation of the data that may be We focus on tables of numerical data (rows and columns)
higher. An example can be given by considering a 3D scatterplot. although many of the techniques apply to categorical data.
Looking at the taxonomies the following stand out as high-
1 Department of Computer Science, University of Massachusetts Lowell, Lowell, dimensional visualizations:
MA, 01854, USA. Email: {grinstein | mtrutsch | ucvek} @cs.uml.edu
2 AnVil Informatics, Inc., 600 Suffolk Street, Lowell, MA 01854, USA.
• 2D and 3D scatterplots
• Matrix of scatterplots
• Heat maps
• Height maps
• Table lens
• Survey plots
• Iconographic displays
• Dimensional stacking (general logic diagrams) Numerous mappings or transformations can be applied to it. The
• Parallel coordinates displayed points can have numerous attributes such as color, size,
• Line graph, multiple line graph shape, texture, motion and even sound (when interacted with). To
• Pixel techniques, circle segments interpret the 3D projection interaction, it is necessary to resolve
• Multi-dimensional scaling and Sammon plots ambiguities, although other techniques have been used
• Polar charts (animation). In its most general form this method is related to
• RadViz iconographic and pixel displays. Figure 3.1 displays the Iris
• PolyViz Flower data set as 2D and 3D scatterplots.
• Principal component and principal curve analysis
• Grand Tours 3.2 Matrix of Scatterplots
• Projection pursuit
• Kohonen self-organizing maps A matrix of scatterplots is an array of scatterplots displaying all
possible pairwise combinations of dimensions or coordinates. For
Several of these are quite similar and related. We give a brief n-dimensional data this yields n(n − 1) 2 scatterplots with shared
description and visualization for each, along with key references scales, although most often n2 scatterplots are displayed. The
(see [23], [19], [13]). We use the Fisher Iris flower data set [15] or scatterplots can also be positioned in a non-array format (circular,
the car data set from UC Irvine Machine Learning Repository, hexagonal, etc.). One can visually link features of one scatterplot
whenever possible. The Iris flower data set contains 50 specimens with features on another, which greatly increases its power.
from each of the three species of Iris flowers: Iris setosa, I.
Versicolor, and I. Virginica. The dimensions of the data set are
sepal length, sepal width, petal length and petal width, measured
in millimeters.
3 HIGH-DIMENSIONAL DATA
VISUALIZATIONS
3.1 2D and 3D Scatterplots
A scatterplot is a point projection (usually affine) of the data into
a 2D or 3D dimensional space represented on the screen in classic
(X, Y) or (X, Y, Z) format. This is the most commonly utilized
data visualization method.
This technique has been in use long before its publication [3], [7].
Several variations on the theme of a matrix of scatterplots have
since been developed: the hyperslice [50], N-vision [14],
prosection [47], hyperbox [2], just to name a few. The hyperslice
is a matrix of panels where “slices” of multivariate function are
shown at a certain focal point of interest. The method is similar to
N-vision, where the matrix panel accommodates for interactive
exploration of a multivariate function. Prosection is a method
more suitable for data mining, since it does not project all points
onto the scatterplot matrix, but rather projects only points within a
certain range of each dimension, similar to brushing and dynamic
queries [1]. The hyperbox uses the same pairwise projections of
the data, but projects onto panels of an n-dimensional box. Each
of the panels has a different orientation and the dimensions can be
cut in order to show histograms on the panels, according to ranges
of the dimensions being cut.
Figure 3.6: Table lens with selected rows of a sales data set
Source: Inxight Software, http://inxight.com
Figure 3.4: Heat map of the Iris data set
Figure 3.13: 2D and 3D Sammon Plots of the Iris data set Figure 3.14: Polar line and polar glyph plot of the Iris data set
3.14 RadViz 3.16 Principal Component and Principal
Curves Analysis
RadViz is a display technique that places dimensional anchors
(dimensions) around the perimeter of a circle [22]. Spring Principal component analysis (PCA) is an analytic technique often
constants are utilized to represent relational values among points - coupled with a visual representation that identifies a lower
one end of a spring is attached to a dimensional anchor, the other dimensional space preserving variance (spread) in the data [24].
is attached to a data point. The values of each dimension are Numerous implementations exist, including neural networks [40],
usually normalized to 0 to1 range. Each data point is displayed at [9]. Self-organizing Maps (described below) can produce a PCA.
the point where the sum of all spring forces equals zero. The PCA does not handle non-linearity well since it identifies linear
position of a data point depends largely on the arrangement of subspaces. If the data set is non-linear then extensions must be
dimensions around the circle. used.
3.15 PolyViz
The PolyViz visualization extends the RadViz method with each
of the dimensions anchored as a line not just a point. Spring
constants are utilized along the dimensional anchor (the line) that
corresponds to all the values the dimension has. Each data point is
positioned as in RadViz. The position of the point in the display
depends as in RadViz on the arrangement of the dimensions.
PolyViz provides more information than RadViz by giving insight
into the distribution of the data for each dimension. Figure 3.17: 2D and 3D principal component analysis of the Iris
data set
The self-organizing map (SOM) [29], [30], [31], [32] is a neural We will consider
n two extreme cases: the set of n-dimensional unit
network algorithm that has been used to cluster in an unsupervised vectors in ℜ , where one coordinate (dimension) is 1 and all
fashion and generate a visual representations of the clusters. others 0, and a set of n-dimensional binary vectors, where each
SOMs both cluster and reduce the dimensionality of the data by coordinate is 1 or 0.
projecting the clusters typically onto a 2-dimensional space. The
Kohonen SOM is similar to a k-means clustering algorithm,
4.1 Intrinsic Dimension
extending it by providing a topological structure and placing
similar objects in neighboring clusters. Numerous SOM
Given an n-dimensional space, the intrinsic dimension (ID) of a
algorithms and extensions have been developed in a multitude of
visualization is defined to be the largest k, k ≤ n , for which a set
fields which include engineering applications and neural networks
of k unit vectors in that n-dimensional space can be uniquely
(see [32], [38], [28], [51] and [53]).
identified (perceived) in the visualization.
5 ANALYSIS
There are a number of factors that can affect the result. Color, size
and shape of the points will make a difference. Perception is
dependent on the viewer and the environment. Screen resolution
and size have a significant bearing on the evaluation of intrinsic
dimensions since the metric involves perception of unique points
or values.
The intrinsic record ratio for the 10-dimensional binary dataset is In Figure 5.6 we display the 100 points (unit vectors). These are
therefore 4/1024 = 1/256 as a 2D scatterplot and 8/1024 = 1/128 well distributed in the rendering space. The intrinsic dimension is
as a 3D scatterplot. 0 as points are not distinguishable. Here, too, the intrinsic record
ratio is approximately 1, depending on the Sammon plot output
The intrinsic coordinate dimension for the 10-dimensional and and the number of points.
100-dimensional data sets is 2.
Since we assume that we are dealing with a perfect display of A pixel display of a 10-dimensional unit vector data set is shown
unlimited resolution, these limitations do not affect the intrinsic in Figure 5.10.
dimension but will effect its perception.
The intrinsic dimension for this data set is 10 and for the 100 unit
vectors it would be 100. The intrinsic record ratio is 1 and
intrinsic coordinate dimension is d as each coordinate for a single
record can be discerned.