0% found this document useful (0 votes)
57 views14 pages

High Dimensional - Visualizations KDD2001 Color

The document discusses high-dimensional data visualizations. It defines the difference between visualizing high-dimensional data versus high-dimensional visualizations. It then reviews several common high-dimensional visualization techniques, including scatterplots, parallel coordinates, heat maps, dimensional stacking, and more. The document aims to define metrics for assessing how these techniques deal with displaying data of many dimensions on a 2D screen.

Uploaded by

WUYUE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views14 pages

High Dimensional - Visualizations KDD2001 Color

The document discusses high-dimensional data visualizations. It defines the difference between visualizing high-dimensional data versus high-dimensional visualizations. It then reviews several common high-dimensional visualization techniques, including scatterplots, parallel coordinates, heat maps, dimensional stacking, and more. The document aims to define metrics for assessing how these techniques deal with displaying data of many dimensions on a 2D screen.

Uploaded by

WUYUE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

High-Dimensional Visualizations

Georges Grinstein1, 2, Marjan Trutschl1, Urška Cvek1

1 2
Institute for Visualization and Perception Research and AnVil Informatics, Inc.
University of Massachusetts Lowell

Abstract Here the data is n-dimensional, 3 axes are selected and laid out on
the plane (the physical medium). The n-dimensional points are
In this paper we provide a brief background to data visualization projected on the 2D surface. Hence this is a 3-dimensional
and point to key references. We differentiate between high- visualization on the 2D surface. Note that we could also consider
dimensional data visualization and high-dimensional data the dimensionality of the data represented. By using color and
visualizations and review the various high-dimensional shape we could argue that a 3D scatterplot is a 5-dimensional
visualization techniques. Our goal is to define metrics that identify representation of n-dimensional data on a 2D surface. In such a
how visualizations deal with n dimensions when displayed on the display there are perceptual ambiguities resulting from the
screen. We define intrinsic dimensionality metrics that assess occlusion of points. These can be resolved by providing various
these techniques and closely analyze selected high-dimensional tools, including interactive ones. For example the user can rotate
visualizations’ display of data. such a display to see hidden points.

Keywords: visualization techniques overview, evaluation, high- We can thus classify visualizations based on the intrinsic
dimensional data visualization, metrics dimensionality of the logical representation as well as its potential
dimensionality by adding in additional data attributes. Since the
additional data attributes can often be applied to most
1 INTRODUCTION visualizations, we will only consider the intrinsic dimensionality.
A visualization is a visual representation of data. Data is mapped
to some numerical form and translated into some graphical 2 VISUALIZATION BACKGROUND
representation. The term “high-dimensional data visualization”
and “high-dimensional visualization” are often used Visualization is used increasingly in the data exploration process
interchangeably. However, a visualization of high-dimensional but still not to the extent possible. In its early years it was mostly,
data is different than a high-dimensional visualization. In the first if not only, used to convey the results of statistical computation or
the term “high” refers to data whereas in the second it refers to data mining algorithms [7], [49], [10]. Over the last decade it has
visualization. This paper defines some simple metrics for high- been used in the data massaging and cleansing process, and
dimensional visualization. somewhat in the data management process. It is still not being
used in the computational steering processes within the data
exploration pipeline except for some research systems.
We assume the data is n-dimensional where n is an integer. In this
paper we focus on high-dimensional data visualizations and more
specifically visualizations that can present a large number of 2.1 Visualization Taxonomies
dimensions or parameters of the data. We attempt to identify what
constitutes a high-dimensional visualization. There are numerous visualizations and a good number of valuable
taxonomies [45].
All visualizations basically still end up on a display surface (soft
or hardcopy). There are a few 3D-displays and much of what Historically static displays, most of which have been extended to
follows still apply to these. One interpretation therefore is that all support probing and even more dynamic interactions, include
visualizations project the n-dimensional data down to 2 histograms, scatterplots, and numerous of their extensions. These
dimensions. Although this is correct we wish to differentiate can be seen in most commercial graphics and statistical packages.
between the dimensionality of the physical medium (2
dimensions) and the logical representation of the data that may be We focus on tables of numerical data (rows and columns)
higher. An example can be given by considering a 3D scatterplot. although many of the techniques apply to categorical data.
Looking at the taxonomies the following stand out as high-
1 Department of Computer Science, University of Massachusetts Lowell, Lowell, dimensional visualizations:
MA, 01854, USA. Email: {grinstein | mtrutsch | ucvek} @cs.uml.edu
2 AnVil Informatics, Inc., 600 Suffolk Street, Lowell, MA 01854, USA.
• 2D and 3D scatterplots
• Matrix of scatterplots
• Heat maps
• Height maps
• Table lens
• Survey plots
• Iconographic displays
• Dimensional stacking (general logic diagrams) Numerous mappings or transformations can be applied to it. The
• Parallel coordinates displayed points can have numerous attributes such as color, size,
• Line graph, multiple line graph shape, texture, motion and even sound (when interacted with). To
• Pixel techniques, circle segments interpret the 3D projection interaction, it is necessary to resolve
• Multi-dimensional scaling and Sammon plots ambiguities, although other techniques have been used
• Polar charts (animation). In its most general form this method is related to
• RadViz iconographic and pixel displays. Figure 3.1 displays the Iris
• PolyViz Flower data set as 2D and 3D scatterplots.
• Principal component and principal curve analysis
• Grand Tours 3.2 Matrix of Scatterplots
• Projection pursuit
• Kohonen self-organizing maps A matrix of scatterplots is an array of scatterplots displaying all
possible pairwise combinations of dimensions or coordinates. For
Several of these are quite similar and related. We give a brief n-dimensional data this yields n(n − 1) 2 scatterplots with shared
description and visualization for each, along with key references scales, although most often n2 scatterplots are displayed. The
(see [23], [19], [13]). We use the Fisher Iris flower data set [15] or scatterplots can also be positioned in a non-array format (circular,
the car data set from UC Irvine Machine Learning Repository, hexagonal, etc.). One can visually link features of one scatterplot
whenever possible. The Iris flower data set contains 50 specimens with features on another, which greatly increases its power.
from each of the three species of Iris flowers: Iris setosa, I.
Versicolor, and I. Virginica. The dimensions of the data set are
sepal length, sepal width, petal length and petal width, measured
in millimeters.

3 HIGH-DIMENSIONAL DATA
VISUALIZATIONS
3.1 2D and 3D Scatterplots
A scatterplot is a point projection (usually affine) of the data into
a 2D or 3D dimensional space represented on the screen in classic
(X, Y) or (X, Y, Z) format. This is the most commonly utilized
data visualization method.

Figure 3.2: Matrix of scatterplots

This technique has been in use long before its publication [3], [7].
Several variations on the theme of a matrix of scatterplots have
since been developed: the hyperslice [50], N-vision [14],
prosection [47], hyperbox [2], just to name a few. The hyperslice
is a matrix of panels where “slices” of multivariate function are
shown at a certain focal point of interest. The method is similar to
N-vision, where the matrix panel accommodates for interactive
exploration of a multivariate function. Prosection is a method
more suitable for data mining, since it does not project all points
onto the scatterplot matrix, but rather projects only points within a
certain range of each dimension, similar to brushing and dynamic
queries [1]. The hyperbox uses the same pairwise projections of
the data, but projects onto panels of an n-dimensional box. Each
of the panels has a different orientation and the dimensions can be
cut in order to show histograms on the panels, according to ranges
of the dimensions being cut.

3.3 Heat Maps


This is an array of cells where each cell is colored based on some
data value or function on the data. The method is a generalization
of a scatterplot where the points are grid cells and each cell is
colored. There are many named variants (clustered image map,
Figure 3.1: 2D and 3D scatterplots of the Iris data set heatmaps, patchgrid).
3.5 Table Lens
The table lens takes a spreadsheet and allows each cell to be
displayed optionally using a line whose length depends on the
numerical value of the cell and whose color can represent some
other attribute of the data [42]. This provides for both a symbolic
and graphical representation of data within a single table. This can
be viewed as an intermediate view of data between a pure
spreadsheet and a heat map where each item is represented as a
number.

Figure 3.3: Heat map of a random data set

Figure 3.6: Table lens with selected rows of a sales data set
Source: Inxight Software, http://inxight.com
Figure 3.4: Heat map of the Iris data set

3.6 Survey Plots


3.4 Height Maps
A survey plot is a 2D or 3D point projection of the data [36] and
A height map is a further extension of a heat map with the grid
generally consists of n rectangular areas, each representing one
represented as a height field instead of by color. Making the cell dimension in a data matrix. A point in a line graph (like a bar
size small can generate an almost continuous map. An example is graph) is extended down to an axis. A line (or a rectangle,
ThemeViewTM [54], where the topics or themes within a set of
depending on the number of records and the size of the output
documents are shown as a relief map of natural terrain. area) is used to represent the data for each dimension, with its
length proportional to the dimensional value it represents. The
method gives insight to correlation between any two variables
(especially when the data is sorted by a dimension) and can find
exact rules in a machine learning dataset.

Figure 3.5: Height map of document themes


Source: Pacific Northwest National Laboratory
Figure 3.7: Survey plot of atmospheric data
In Figure 3.5 the mountains indicate themes within the documents Source: Geophysical Institute, University of Alaska Fairbanks
with the peak heights as the relative strengths of the topics. The
layout of the themes depends on a similarity metric. This
visualization is similar to self-organizing maps (SOMs), described
3.7 Iconographic Displays
below.
An iconographic display is a graphical representation visualizing
high-dimensional data by letting each coordinate dimension of a
record drive some parameter or attribute of an entity (pixel, icon
or glyph) and displaying a number of these entities (records) at 3.9 Parallel Coordinates
once on the screen. These displays integrate several dimensions at
once and thus can represent high-dimensional data sets [3], [8], Parallel coordinates use parallel axes instead of perpendicular to
[41], [35]. represent dimensions of a multidimensional data set [25], [26]. A
vertical line is used for the projection of each dimension or
There are two types of glyph and icon visualizations; the first are attribute, with the maximum and minimum values of each
displays where certain dimensions of the n-dimensional data set dimension usually scaled to the upper and lower boundaries on
are mapped to certain features of the glyph or icon. These include: those vertical lines. A polyline made up of n-1 lines at the
Chernoff faces [8], where data dimensions are mapped to facial appropriate dimensional values connects the axes to represent an
features; star glyphs (plots) [7], where the dimensions are n-dimensional point.
represented as equal angualr spokes radiating from the center of a
circle. The second type of glyph and icon visualizations have
glyphs or icons packed together in a dense display, with textures
representing features of the dataset [41]. Some other icon
visualizations are shape coding [5], color icons [35], [27], [12]
and tilebars [21].

Figure 3.10: Parallel coordinate display of the Iris data set

3.10 Line Graph, Multiple Line Graph


Figure 3.8: An icon and an integrated iconographic display of 5 Line graphs display single-valued or piecewise continuous
satellite images of the Great Lakes region functions of one dimension. To accommodate multi-dimensional
data sets, multiple line graphs are displayed in a multi-line graph.
3.8 Dimensional Stacking (General Often, the ordering of the data is correlated to one of the
Logic Diagrams) dimensions of the data, such as time. The dimensions are
distinguished using different colored lines, and/or types of
Dimensional stacking is a 2D or 3D point projection of the data continuous lines (dashed, dotted).
where dimensions are embedded within other dimensions. It was
initially used only to visualize binary data [37]. The method was
later extended to discrete categorical values and binned ordinal
values, and used for general data exploration [52]. The stacking
divides a 2D grid into sets of embedded rectangles, representing
categorical dimensions or attributes of the data. Two outer
dimensions are placed along the X and Y axes, and each
additional pair of dimensions is embedded into the outer level
rectangles, until all dimensions are incorporated.

Figure 3.11: Multiple line graph of the car data set

3.11 Pixel Techniques, Circle Segments


Pixel techniques represent a generalization of heat maps,
extending them to very large multi-dimensional data sets. These
visualizations arrange the data into an area, starting from some
origin, according to the size and number of dimensions, using
various techniques including recursive, spiral, and circle
segments. The interpretation of the (X, Y) position of the cell
depends on the mapping. In VisDB [27] the goal is to show
Figure 3.9: Dimensional Stacking of the Iris data set
similarities between attributes of the data. Various similarity The goal of Multi-Dimensional Scaling (MDS) techniques is to
functions may be used and their values represented as colors. identify meaningful underlying dimensions that could explain
similarities or dissimilarities in the data. MDS typically preserves
For circle segments each arc on the circle represents a data value the distance metric. Most often the projection space is 2-
of one dimension. Originally, the arc would represent many data dimensional. Other techniques attempt to preserve some degree of
values, one for each pixel in the arc, but variations now use structure. The result is a 2D or 3D display in which points close to
straight lines. each other are close in the original n-dimensional space.

There are numerous variations and in all cases a dissimilarity


matrix is built (based on the selected metric) with various cost
functions and other parameters. Bentley and Ward presented
extensions to MDS to enhance visualizations of high-dimensional
data, such as animation, stochastic perturbation and flow
visualization techniques [6]. The most frequently used variation of
MDS is Sammon plot, a non-linear MDS mapping [44].
Figure 3.12: Pixel display of an eight-dimensional data set of
1,000 records using 2D arrangement 3.13 Polar Charts
Source: VisDB, [27]
A polar chart is a circular graph for plotting polar coordinates.
3.12 Multi-Dimensional Scaling and Polar coordinates map data onto a 2D surface using the angle and
Sammon Plots radius, creating a “wrap-around” version of a line graph. Polar
charts bridge the limitation of line graphs, which are used only for
An analytic or graphical representation that maps a data set into a displaying single valued or piecewise continuous functions of one
space of lower dimensionality is considered a projection method. dimension. These can be considered circular representations of
In most cases some invariants are preserved or closely preserved parallel coordinates and thus can reduce the limiting effect of a
(such as distance). This is a classic technique, well over 50 years large number of dimensions. However, the size of the data point
old [57], [48], [33], [11], [55], [56]. representations depends on the closeness to the center.

Figure 3.13: 2D and 3D Sammon Plots of the Iris data set Figure 3.14: Polar line and polar glyph plot of the Iris data set
3.14 RadViz 3.16 Principal Component and Principal
Curves Analysis
RadViz is a display technique that places dimensional anchors
(dimensions) around the perimeter of a circle [22]. Spring Principal component analysis (PCA) is an analytic technique often
constants are utilized to represent relational values among points - coupled with a visual representation that identifies a lower
one end of a spring is attached to a dimensional anchor, the other dimensional space preserving variance (spread) in the data [24].
is attached to a data point. The values of each dimension are Numerous implementations exist, including neural networks [40],
usually normalized to 0 to1 range. Each data point is displayed at [9]. Self-organizing Maps (described below) can produce a PCA.
the point where the sum of all spring forces equals zero. The PCA does not handle non-linearity well since it identifies linear
position of a data point depends largely on the arrangement of subspaces. If the data set is non-linear then extensions must be
dimensions around the circle. used.

Figure 3.15: RadViz visualization of the Iris data set

3.15 PolyViz
The PolyViz visualization extends the RadViz method with each
of the dimensions anchored as a line not just a point. Spring
constants are utilized along the dimensional anchor (the line) that
corresponds to all the values the dimension has. Each data point is
positioned as in RadViz. The position of the point in the display
depends as in RadViz on the arrangement of the dimensions.
PolyViz provides more information than RadViz by giving insight
into the distribution of the data for each dimension. Figure 3.17: 2D and 3D principal component analysis of the Iris
data set

Principal curves analysis [20] identifies smooth curves which


represent the mean of all projected data points [39], [43],
generalizing linear principal component analysis.

3.17 Grand Tour and Projection Pursuit


Projections of the data using a scatterplot matrix (or any other
static representation of data) do not necessarily guarantee the best
insight into the data. The most insight might be gained by some
projection that allows a linear discrimination of two or more
classes of data. In the grand tour method [4], sequences of 2D or
3D projections are displayed. The grand tour is most often applied
to a single 2D or 3D scatterplot with the coordinate axes, moving
Figure 3.16: PolyViz visualization of the Iris data set through a sequence of projections that cover almost all of the n-
dimensional space. In the classic grand tour a step and space-
filling curve are defined. A plane is moved along this curve and
the data projected.
The grand tour can be interpreted as an unguided exploratory reality devices offer the capability to display graphics in
projection pursuit. After a particular goal is identified, a guided stereoscopic 3D, allowing the user to better perceive depth
projection pursuit is utilized. This produces projections of the data information. To date, there have been little commercial virtual
where a particular goal drives the projections, such as reality data exploration environments.
discrimination of two data classes. Linear projections are selected
which attempt to identify and bring out the data deviating from A number of interactive techniques can also be provided to alter
normal distribution as much as possible. Projection pursuit can each of these visualizations. For example display transformations
handle some non-linearity but it too is not general enough [16], such as hyperbolic mappings and other distortion mappings can be
[17]. Depending on the utilized display techniques and when a applied to the resulting images to provide non-linear expansions
useful projection is found, it is not always clear how to extract of the data ([46], [18], [2], [34], [42]).
useful information from the linear combinations of dimensions.
4 INTRINSIC DIMENSIONALITY
3.18 Kohonen Self-Organizing Maps
(SOM) We now define precisely intrinsic dimensionality. The goal is to
define metrics that identify how visualizations deal with n
The Self-Organizing Map (SOM) combines an analytic and dimensions when displayed on the screen. The main problems are
graphical technique to group data in order to reduce its size. It is a that points may overlap and that coordinate data may be lost in the
summarization technique that attempts to reduce the complexity projection. With probing one can get all the coordinate values of a
of the data set by displaying clusters of the data in a grid. single point.

The self-organizing map (SOM) [29], [30], [31], [32] is a neural We will consider
n two extreme cases: the set of n-dimensional unit
network algorithm that has been used to cluster in an unsupervised vectors in ℜ , where one coordinate (dimension) is 1 and all
fashion and generate a visual representations of the clusters. others 0, and a set of n-dimensional binary vectors, where each
SOMs both cluster and reduce the dimensionality of the data by coordinate is 1 or 0.
projecting the clusters typically onto a 2-dimensional space. The
Kohonen SOM is similar to a k-means clustering algorithm,
4.1 Intrinsic Dimension
extending it by providing a topological structure and placing
similar objects in neighboring clusters. Numerous SOM
Given an n-dimensional space, the intrinsic dimension (ID) of a
algorithms and extensions have been developed in a multitude of
visualization is defined to be the largest k, k ≤ n , for which a set
fields which include engineering applications and neural networks
of k unit vectors in that n-dimensional space can be uniquely
(see [32], [38], [28], [51] and [53]).
identified (perceived) in the visualization.

The intrinsic dimension of a 2D scatterplot is 2: the n unit vectors


project to 3 points, (0, 0) and either (0, 1) or (1, 0), only two of
which obviously come from unique points.

4.2 Intrinsic Record Ratio


Given an n-dimensional space, the intrinsic record ratio (IRR) of
a visualization is defined to be k/n, where k is the largest value for
which the set of 2n binary vectors with all 0’s and 1’s in that n-
dimensional space can be uniquely identified (perceived) in the
visualization. It represents the percentage of records that can be
distinguished, if one had reasonably distributed records. We can
more precisely define this ratio using Monte Carlo techniques.

We have 2n points (binary vectors) that represent values


Figure 3.18: Self-organizing map of the Iris data set
[0,…,(2n-1)]. If all are discernible then the intrinsic record ratio is
1. The 2n binary vectors project to 4 points, (0, 0), (0, 1), (1,0) and
3.19 Remarks (1, 1), and the intrinsic record ratio is 4/2n. As n gets large, the
intrinsic record ratio decreases and approaches 0.
There are many systems incorporating a number of the techniques
described above. Along with traditional static sorts of displays
such as histograms, scatterplots, and parallel coordinates, most
4.3 Intrinsic Coordinate Dimension
software packages provide interactive and dynamic querying of
Given a n-dimensional space, the intrinsic coordinate dimension
data. Currently, most PC or workstation-based tools are used to
(ICD) of a visualization is defined to be the largest k, k ≤ n for
view multivariate data. These tools display 3D graphics on a
which k-coordinates of any vector in that n-dimensional space can
traditional computer monitor. However, extensions using virtual
be uniquely identified in the visualization.
The intrinsic dimension of a 2D scatterplot is 2 and its intrinsic
coordinate dimension is 2. The intrinsic dimension for the 3D
scatterplot is 3 whereas its intrinsic coordinate dimension is 2 (the
projected point may come from several ones projecting to a line in
3D). Note that in many cases the intrinsic coordinate dimension is
smaller than the intrinsic dimension since we know the vectors
being examined in the first case whereas in the second we look to
identify coordinates of any vector. Using rigid transformations
such as rotate, pan and zoom, one can often increase the number
of coordinates determined.

5 ANALYSIS
There are a number of factors that can affect the result. Color, size
and shape of the points will make a difference. Perception is
dependent on the viewer and the environment. Screen resolution
and size have a significant bearing on the evaluation of intrinsic
dimensions since the metric involves perception of unique points
or values.

For more than a certain number of records or dimensions the


screen/dot ratio becomes the limiting factor. In all visualizations it
is either the linear dimensions of the screen (e.g., the axes in a
scatterplot) or the surface dimensions (e.g., the points in a
scatterplot) that limit the perception of data.

In order to avoid all these perceptual problems and issues we look


to the first two definitions as theoretical. That is, what is the best
that one could do having an arbitrarily large screen with infinite
resolution. We look to the last definition to handle perceptual
issues by permitting interaction to resolve size problems. The
Figure 5.1: 2D and 3D scatterplots of the 10-dimensional and
intrinsic dimension and coordinate dimension require being able
100-dimensional unit vectors
to pull out coordinate interpretations whereas the intrinsic record
ratio pulls out records. In all cases we assume that the selected
Only two and three data records, respectively, are uniquely
color and shape of the points or tick marks is reasonable.
identifiable. Thus the intrinsic dimension is 2 for a 2D scatterplot
and 3 for the 3D scatterplot.
We now look at some examples of the visualizations described
above and their intrinsic dimensions. In order to get a sense of the
high dimensionality of the various visualizations we analyze the
different visualizations into three classes as follows:

1. 10 - 100 intrinsic dimensions


2. 100 - 1000 intrinsic dimensions
3. 1000 or more intrinsic dimensions.

In this paper, we show several sample visualizations and their


intrinsic properties for both 10- and 100-dimensional spaces (in
some cases) and discuss the 1000 ones.

5.1 2D and 3D Scatterplot


We project 10 and 100 unit vectors in 2D and 3D to produce
scatterplots of 10- and 100-dimensional space (Figure 5.1). The Figure 5.2: 2D scatterplot of the 10-dimensional binary vectors
scatterplots for the 100-dimensional unit and binary vectors are
identical to the scatterplots of the 10-dimensional unit and binary
vectors, respectively.
Figure 5.5: 3D Sammon plot of the 10-dimensional unit vector
Figure 5.3: 3D scatterplot of the 10-dimensional binary vectors data set

The intrinsic record ratio for the 10-dimensional binary dataset is In Figure 5.6 we display the 100 points (unit vectors). These are
therefore 4/1024 = 1/256 as a 2D scatterplot and 8/1024 = 1/128 well distributed in the rendering space. The intrinsic dimension is
as a 3D scatterplot. 0 as points are not distinguishable. Here, too, the intrinsic record
ratio is approximately 1, depending on the Sammon plot output
The intrinsic coordinate dimension for the 10-dimensional and and the number of points.
100-dimensional data sets is 2.

5.2 2D and 3D Sammon Plot


The Sammon plot representation of the 10- and 100-dimensional
unit vectors is displayed in Figure 5.4 through Figure 5.6.

While the points are all visible, it is impossible to identify the


values associated with them. Therefore, the intrinsic dimensions
for both datasets are 0.

Figure 5.4: 2D Sammon plot of the 10-dimensional unit vector


data set

Figure 5.6: 2D and 3D Sammon plot of the 100-dimensional unit


vector data set

Thus we find that the ID and IRR are dimension independent.


Figure 5.8: Parallel coordinates of the 10- and 100-dimensional
unit vectors

Figure 5.7: 2D and 3D Sammon plot of the 10-dimensional


binary vectors

When we display the 1024 10-dimensional points (binary vectors)


using the Sammon plot (Figure 5.7), we cannot easily determine
the number of points in the display and so the intrinsic record ratio
cannot be precisely determined visually. Estimate yields an
intrinsic record ratio of ≈ 0.2 (2D) and ≈ 0.9 (3D). Note that
repeated application of the Sammon plot algorithm may yield
different intrinsic record ratios.

5.3 Parallel Coordinates


Figure 5.9: Parallel coordinates of the 10-dimensional binary
Parallel coordinates representing the 10- and the 100-dimensional vectors
unit vector datasets are displayed in Figure 5.8. The specific unit
vector (polyline) is identifiable by the coordinate with value equal
to 1. The intrinsic dimensions thus are respectively 10 and 100. 5.4 Pixel Display

Since we assume that we are dealing with a perfect display of A pixel display of a 10-dimensional unit vector data set is shown
unlimited resolution, these limitations do not affect the intrinsic in Figure 5.10.
dimension but will effect its perception.

The intrinsic record ratio under perfect conditions (unlimited


resolution) is 0, as it is not possible to identify a single unique
point in the display and thus we cannot determine the number of
Figure 5.10: Pixel display of the 10-dimensional unit vector data
points. The intrinsic coordinate dimension is equal to the number
of dimensions in the data set, as we can uniquely identify each of
The intrinsic dimension is 10 as one can identify each coordinate
the coordinate values.
directly from the multiple grids. For the 10-dimensional binary
vectors data set, the pixel display would consist of 10 rectangles, The intrinsic dimension is 10 and 100 respectively. The intrinsic
each containing 2n cells. Each record is uniquely identifiable and record ratio is 1 and intrinsic coordinate dimension is not
the intrinsic record ratio is 1.0. The intrinsic coordinate dimension determinable in general if the point is not on the boundary of the
is not precisely determinable as the coordinate value is circle.
represented by a color, which depends on the color map as well as
the viewer’s perceptual capabilities.
5.6 PolyViz
5.5 RadViz PolyViz display of the 10-dimensional unit vector data sets is
shown in Figure 5.13.
RadViz displays of the 10- and 100-dimensional unit vector data
sets are shown in Figure 5.11, followed by a display of the 10-
dimensional binary vectors data set (Figure 5.12).

Figure 5.13: PolyViz display of the 10-dimensional unit vectors


data set colored by the first dimension

The intrinsic dimension for this data set is 10 and for the 100 unit
vectors it would be 100. The intrinsic record ratio is 1 and
intrinsic coordinate dimension is d as each coordinate for a single
record can be discerned.

5.7 Kohonen Self-Organizing Map (SOM)


Figure 5.14 and Figure 5.15 display a SOM of an arbitrary size for
the 10- and 100-dimensional unit vector data sets.

The intrinsic dimension is 0.

Figure 5.11: 10- and 100-dimensional unit vector data sets


rendered using RadViz algorithm

Figure 5.14: 10x10 SOM of the 10-dimensional unit vectors

Figure 5.12: RadViz display of the 10-dimensional binary vectors


data set
It is clear that some of the computations for the IRR require a
precise determination of the number of distinguishable points,
since this applies to both Sammon plots (and other visualization
techniques not listed). Perceived separation determination with
automatic computation with Monte Carlo techniques is necessary.

These definitions were used to begin to try to identify intrinsic


metrics for high-dimensional visualizations. We see that several
visualizations deal with high dimensions quite well. These include
Pixel Displays, RadViz and PolyViz. Realistically, the limitations
of screen resolution and color perception do have a bearing. These
problems can be resolved through multiple linked visualizations
or with interactions and tools that increase the intrinsic coordinate
dimensions.
Figure 5.15: 10x10 SOM of the 100-dimensional unit vectors
Acknowledgements: We thank Dr. Patrick Hoffman for his
Looking at Figure 5.16 we find that the intrinsic record ratio is 1.0 detailed critique and help in generating some of the images.
if the number of grids is large enough and that the intrinsic
coordinate dimension is 0.
References
[1] C. Ahlberg, C. Williamson, and B. Shneiderman, “Dynamic
Queries for Information Exploration: an Implementation
and Evaluation,” presented at ACM CHI, 1992.
[2] B. Alpern, “Hyperbox,” presented at IEEE Visualization
'91, San Diego, CA, 1991.
[3] D. F. Andrews, “Plots of High-Dimensional Data,”
Biometrics, vol. 29, pp. 125-136, 1972.
[4] D. Asimov, “The Grand Tour: A tool for Viewing
Multidimensional Data,” DIAM Journal on Scientific and
Statistical Computing, vol. 61, pp. 128-143, 1985.
[5] J. Beddow, “Shape Coding of Multidimensional Data on a
Microcomputer Display,” presented at IEEE Visualization
'90, San Francisco, CA, 1990.
Figure 5.16: 10x10 SOM of the 10-dimensional binary vectors [6] C. L. Bentley and M. O. Ward, “Animating
Multidimensional Scaling to Visualize N-Dimensional Data
Sets,” presented at IEEE Information Visualization '96, San
6 SUMMARY Francisco, CA, 1996.
These visualizations are just a few of many possible examples. [7] J. M. Chambers, W. S. Cleveland, B. Kleiner, and P. A.
Table 1 provides a summary of intrinsic properties for Tukey, Graphical Methods for Data Analysis. New York:
visualizations discussed above. Both 10- and a 100-dimensional Chapman and Hall, 1976.
unit vector datasets were used for this task. Since an ideal display [8] H. Chernoff, “The Use of Faces to Represent Points in k-
(of unlimited size and resolution) is used, there is no difference Dimensional Space Graphically,” Journal of the American
between the 10- and the 100-dimensional dataset. Statistical Association, vol. 68, pp. 361-368, 1973.
[9] A. Cichocki and R. Unbehauen, Neural Networks for
Intrinsic Intrinsic Record Intrinsic Coord. Optimization and Signal Processing. Chichester, England:
Visualization
Dim. Ratio Dim. John Wiley, 1993.
2D Scatterplot 2 4/2d 2 [10] W. S. Cleveland and M. E. McGill, Dynamic Graphics for
3D Scatterplot 3 8/2d 2 Statistics. Belmont, CA: Wadsworth Advanced Books and
2D Sammon Plot 0 ≈ 0.2 0 Software, 1988.
3D Sammon Plot 0 ≈ 0.9 0 [11] J. de Leeuw and W. Heiser, “Theory of Multidimensional
Parallel Coord. d 0.0 d Scaling,” in Handbook of Statistics, vol. 2, P. R. Krishnaiah
Pixel Display d 1.0 Indeterminate
and L. N. Kanal, Eds. Amsterdam: North-Holland
Publishing, 1982, pp. 285-316.
RadViz d 1.0 Indeterminate
[12] R. F. Erbacher, D. Gonthier, and H. Levkowitz, “The Color
PolyViz d 1.0 d
Icon: A New Design and a Parallel Implementation,”
SOM 0 1.0 0
presented at SPIE '95 Conference on Visual Data
Table 1: A summary of intrinsic properties for selected
Exploration and Analysis II, San Jose, CA, 1995.
visualizations
[13] U. Fayyad, G. Grinstein, and A. Wierse, Information
d = dimensionality of the data set
Visualization in Data Mining and Knowledge Discovery, 1st
ed: Morgan-Kaufmann Publishers, 2001.
[14] S. Feiner and C. Beshers, “Worlds Within Worlds: [29] T. Kohonen, “Self-Organized Formation of Topologically
Metaphors for Exploring N-Dimensional Virtual Worlds,” Correct Feature Maps,” Biological Cybernetics, vol. 43, pp.
presented at UIST '90 (ACM Symp. on User Interface 59-69, 1982.
Software and Technology), Snowbird, UT, 1990. [30] T. Kohonen, “The Self-Organizing Map,” presented at
[15] R. A. Fisher, “The Use of Multiple Measurements in IEEE, 1990.
Taxonomic Problems,” Annals of Eugenics, vol. 7, pp. 179- [31] T. Kohonen, Self-Organizing Maps. Berlin: Springer, 1995.
188, 1936. [32] T. Kohonen, E. Oja, O. Simula, A. Visa, and J. Kangas,
[16] J. H. Friedman, “Exploratory Projection Pursuit,” Journal “Engineering Applications of the Self-Organizing Map,”
of the American Statistical Association, vol. 82, pp. 249- presented at IEEE, 1996.
266, 1987. [33] J. B. Kruskal and M. Wish, Multidimensional Scaling: Sage
[17] J. H. Friedman and J. W. Tukey, “A Projection Pursuit Publications, 1978.
Algorithm for Exploratory Data Analysis,” IEEE [34] J. Lamping and R. Rao, “Laying out and Visualizing Large
Transactions on Computers, vol. C, pp. 881--889, 1974. Trees Using a Hyperbolic Space,” presented at UIST '94,
[18] G. Furnas, “Generalized Fisheye Views,” presented at 1994.
Human factors in Computing Systems ACM CHI '86, [35] H. Levkowitz, “Color Icons: Merging Color and Texture
Boston, MA, 1986. Perception for Integrated Visualization of Multiple
[19] G. Grinstein, P. E. Hoffman, S. Laskowski, and R. Pickett, Parameters,” presented at IEEE Visualization '91, 1991.
“Benchmark Development for the Evaluation of [36] H. Lohninger, “INSPECT, a Program System to Visualize
Visualization for Data Mining,” in Information and Interpret Chemical Data,” Chemometrics and
Visualization in Data Mining and Knowledge Discovery, Intelligent Laboratory Systems, vol. 22, pp. 147-153, 1994.
The Morgan Kaufmann Series in Data Managament [37] R. S. Michalski, “A Planar Geometric Model for
Systems, U. Fayyad, G. Grinstein, and A. Wierse, Eds., 1st Representing Multidimensional Discrete Spaces and
ed: Morgan-Kaufmann Publishers, 2001. Multiple-Valued Logic Functions,” University of Illinois at
[20] T. Hastie and W. Stuetzle, “Principal Curves,” Journal of Urbana-Champaign, Technical Report UIUCDCS-R-78-
the American Statistical Association, vol. 84, pp. 502-516, 897, 1978.
1989. [38] N. J. S. Mørch, U. Kjems, L. K. Hansen, C. Svarer, I. Law,
[21] M. A. Hearst, “Tilebars: Visualization of Term Distribution B. Lautrup, S. Strother, and K. Rehm, “Visualization of
Information in Full Text Information Access,” presented at Neural Networks Using Saliency Maps,” presented at ICCN
ACM CHI '95 Human Factors in Computing Systems, '95, 1995.
Denver, CO, 1995. [39] F. Mulier and V. Chrkassky, “Self-organization as an
[22] P. Hoffman and G. Grinstein, “Dimensional Anchors: A Iterative Kernel Smoothing Process,” Neural Computation,
Graphic Primitive for Multidimensional Multivariate vol. 7, pp. 1165-1177, 1995.
Information Visualizations,” presented at NPIV '99 [40] E. Oja, Subspace Methods of Pattern Recognition.
(Workshop on New Paradigmsn in Information Letchworth, England: Research Studies Press, 1983.
Visualization and Manipulation), 1999. [41] R. M. Pickett and G. G. Grinstein, “Iconographic Displays
[23] P. E. Hoffman and G. Grinstein, “Multidimensional for Visualizing Multidimensional Data,” presented at IEEE
Information Visualizations for Data Mining with Conference on Systems, Man and Cybernetics, Beijing and
Applications for Machine Learning Classifiers,” in Shenyang, People's Republic of China, 1988.
Information Visualization in Data Mining and Knowledge [42] R. Rao and S. K. Card, “The Table Lens: Merging
Discovery, The Morgan Kaufmann Series in Data Graphical and Symbolic Representations in an Interactive
Managament Systems, U. Fayyad, G. Grinstein, and A. Focus+Context Visualization for Tabular Information,”
Wierse, Eds., 1st ed: Morgan-Kaufmann Publishers, 2001. presented at ACM CHI '94, Boston, MA, 1994.
[24] H. Hotelling, “Analysis of a Complex of Statistical [43] H. Ritter, T. Martinetz, and K. Schulten, Neural
Variables into Principal Components,” Journal of Computation and Self-Organizing Maps: An Introduction.
Educational Psychology, vol. 24, pp. 417-441, 498-520, Reading, MA: Addison-Wesley, 1992.
1933. [44] J. W. J. Sammon, “A Nonlinear Mapping for Data Structure
[25] A. Inselberg, “The Plane with Parallel Coordinates,” Analysis,” IEEE Transactions on Computers, vol. 18, pp.
Special Issue on Computational Geometry: The Visual 401-409, 1969.
Computer, vol. 1, pp. 69-91, 1985. [45] B. Shneiderman, “The Eyes Have It: A Task by Data Type
[26] A. Inselberg and B. Dimsdale, “Parallel Coordinates for Taxonomy of Information Visualization,” presented at
Visualizing Multidimensional Geometry,” presented at IEEE Symposium on Visual Languages '96, Boulder, CO,
Computer Graphics International '87, Tokyo, 1987. 1996.
[27] D. A. Keim and H.-P. Kriegel, “VisDB: Database [46] R. Spence, “Data Base Navigation: An Office Environment
Exploration Using Multidimensional Visualization,” IEEE for the Professional,” Behaviour and Information
Computer Graphics and Applications, vol. 14, pp. 40-49, Technology, vol. 1, pp. 43-54, 1982.
1994. [47] R. Spence, L. Tweedie, H. Dawkes, and H. Su,
[28] S. Klinke and J. Grassmann, Visualization and “Visualization for Functional Design,” presented at IEEE
Implementation of Feedforward Neural Networks via Information Visualization Symposium '95, 1995.
Multidimensional Scaling in XploRe, 1996. [48] W. S. Torgerson, “Multidimensional Scaling: Theory and
Method,” Psychometrika, vol. 17, pp. 401-419, 1952.
[49] J. W. Tukey, Exploratory Data Analysis. Reading, MA:
Addison-Wesley, MA, 1977.
[50] J. J. van Wijk and R. van Liere, “HyperSlice,” presented at
IEEE Visualization, San Jose, CA, 1993.
[51] L. G. Vuurpijl and T. Schouten, “Convis, a distributed
environment for control and visualization of neural
networks,” presented at International Conference on
Artificial Neural Networks, Amsterdam, 1993.
[52] M. O. Ward, J. LeBlanc, and R. Tipnis, “N-Land: A
Graphical Tool for Exploring N-Dimensional Data,”
presented at Computer Graphics International Conference,
Melbourne, 1994.
[53] P. Wilke, “Visualization of Neural Networks using
NeuroGraph,” presented at IFIP WG 3.2 Working
Conference on Visualization in Scientific Computing: Uses
in University Education, Irvine, CA, 1993.
[54] J. A. Wise, J. J. Thomas, et al, “Visualizing the Non-Visual:
Spatial Analysis and Interaction with Information from Text
Docments,” presented at IEEE Information
Visualization'95, Atlanta, GA, 1995.
[55] M. Wish and J. D. Carroll, “Multidimensional Scaling and
its Applications,” in Handbook of Statistics, vol. 2, P. R.
Krishnaiah and L. N. Kanal, Eds. Amsterdam: North-
Holland Publishing, 1982, pp. 317-345.
[56] F. W. Young, “Multidimensional Scaling,” in Encyclopedia
of Statistical Sciences, vol. 5, S. Kotz and N. L. Johnson,
Eds. New York: Wiley, 1985, pp. 649-659.
[57] G. Young and A. S. Householder, “Discussion of a Set of
Points in Terms of Their Mutual Distances,”
Psychometrika, vol. 3, pp. 19-22, 1938.

You might also like