DVT Unit-Ii
DVT Unit-Ii
4.1 Introduction
The amount of data collected by organizations is constantly growing due to better data
collection methods, computerized transactions, and advancements in storage
technology. This growth results in high-dimensional data sets with many recorded
attributes. Large-scale information banks, like data warehouses, contain
interconnected data from various sources. New technologies, such as genomic and
proteomic technologies and sensor-based monitoring systems, contribute to these data
sets. To gain valuable insights, it is essential to visualize the data's structure and
identify patterns and relationships. This text focuses on visual exploration of data
through graphs, particularly addressing issues with categorical data and introducing a
mathematical framework for optimizing graph layouts.
Graphs are important tools as they show relationships between groups of objects.
They model complex systems, like computer networks, transportation systems, and
molecular structures, and help visualize relationships in social networks and database
diagrams. In statistics and data analysis, graphs appear as dendrograms, trees, and
path diagrams. They are also interesting in mathematics, with many studies focused
on their properties. Different visualizations can uncover hidden patterns and
relationships within the data.
The literature includes various sources on how to draw graphs and the problems
related to graph representation in different dimensions. A key interest is how data sets
can be represented through graphs, linking multivariate statistics and graph drawing.
Examples include the representation of a protein interaction network and the structure
of contingency tables and correlation matrices, where nodes represent categories or
variables, and lines indicate frequencies or correlations.
Another interesting data structure that can be represented successfully by a graph is
that corresponding to a multivariate categorical data set, as the following example
attests (Table 4. 3). The data on 21 sleeping bags and their characteristics come from
Prediger (1997) and have also been discussed in Michailidis and de Leeuw (2001).
4.3 Graph Layout Techniques
Layout algorithms often follow aesthetic rules, which help in creating visually
appealing graphs. Common rules include evenly distributing nodes and edges,
keeping edge lengths similar, and minimizing edge crossings. These rules can lead to
optimization problems, some of which are difficult to solve, like the edge-crossing
minimization, which is known to be NP-hard. Basic layouts are usually generated
quickly and then refined to meet these aesthetic criteria, particularly useful for large
graphs. Graph drawing systems such as Nicheworks, GVF, and H3Viewer use these
strategies, with tools like Cytoscape allowing for manual adjustments.
The general goal is to represent graph edges as points and vertices as lines. Two main
approaches exist for creating these drawings: the metric or embedding approach,
which focuses on the distances between vertices, and the adjacency model, which
emphasizes the relationships between adjacent vertices. This paper primarily
discusses the adjacency model and how to measure the quality of the resulting graph
layout.
The n × p matrix X includes the coordinates of n vertices in R^p, with di j(X) representing the distances
between points xi and xj. The weights ai j relate to the adjacency matrix A of graph G, while pushing
weights B = {bi j} can come from the adjacency matrix or external constraints. The functions ϕ(ċ) and
ψ(ċ) apply aesthetic considerations to the layout. A convex ϕ function increases large distances to
highlight unique data features, while a concave one reduces the impact of isolated vertices. This
framework supports both simple and weighted graphs and is used in a force-directed technique
discussed by Di Battista et al. (1998).
The constraint term in the Lagrangian relates to the push component of Q(ċ).
Examples of η(X) include η(X) = trace(X′X) or η(X) = det(X′X). Other options
involve ensuring orthonormality or fixing some Xs. This method allows integrating
the metric approach in graph drawing, aiming to minimize the difference between
graph-theoretic and Euclidean distances.
The method involves creating a K-nearest neighbor graph for the data points and
calculating shortest path distances, followed by applying MDS. An example using
Swiss Roll data shows how the Isomap algorithm better captures the underlying
geometry than standard MDS, highlighting differences in how points are arranged
based on their progression on the roll.
In this model, the similarity of nodes is key. A simple graph considers only
connections, while a weighted graph gives more importance to edges with large
weights. The normalization constraint keeps points separate and prevents them from
collapsing to the origin. This model has been studied with various distance functions,
including squared Euclidean distances, which is significant for data visualization.
Some algebra shows that the objective function can be expressed in matrix algebra
form.
The graph Laplacian, expressed as L = D−A, involves a diagonal matrix D made from
the row sums of the adjacency matrix A. Minimizing a certain function leads to nodes
with many connections clustering together, while those with few connections stay on
the edges of the layout. In weighted graphs, stronger weights indicate closer bonds
between nodes, enhancing clustering. A helpful normalization constraint is X′DX =
Ip, making minimization easier and linking it to a generalized eigenvalue problem.
The solution is related to weighted Euclidean space. For example, in a protein
interactions network, proteins with few interactions are on the outskirts, whereas
central 'hub' proteins are located in the middle. Another example from the UCI
machine learning repository features handwritten numerals, where traditional methods
struggle with class separation. Additionally, large graphs from testing algorithms were
examined, illustrating structures with varying densities and holes through weighted
graphs based on nearest neighbors.
The graph representation of a contingency table and categorical data has unique
features, where the node set can be divided into two subsets. For a contingency table,
one subset includes the categories of one variable, and the other subset includes the
categories of the second variable. Connections only exist between these two subsets.
This also applies to categorical data, where one subset relates to objects, like sleeping
bags, and the other to variable categories. This leads to bipartite graphs, and a
modification of the Q(ċ) objective function creates interesting graph layouts for these
data sets. The objective function for squared Euclidean distances can be expressed
using the coordinates of both subsets.
The optimal solution follows the centroid principle, which states that category points
are at the center of gravity of their objects. This graph-drawing solution is known in
multivariate analysis as correspondence analysis for contingency tables and multiple
correspondence analysis for categorical data. The graph layout of the sleeping bags
data set shows patterns: high-quality, expensive sleeping bags filled with down and
cheaper, low-quality ones filled with synthetic fibers, along with some intermediate
options. The centroid principle helps interpret the layout, which is less uniform than
one produced by MDS, capturing data features better. The distance function and
normalization significantly impact the graph's visual quality.
Graph-theoretic Graphics
5.1 Introduction
This chapter will cover the uses of graphs for making graphs. The mixing of terms is
an unfortunate historical issue that combined graph-of-a-function with graph-of-
vertices-and-edges. Vertex-edge graphs are crucial for developing algorithms and are
also key for statistical graphics and visualizations.
The chapter addresses laying out graphs on a plane, using points for vertices and line
segments for edges. It adopts the grammar of graphics view, assuming that a graph
maps geometric forms to vertices and edges. Definitions of graph-theoretic terms are
provided and can be referred to later as a glossary.
5.2 Definitions
Ultrametric Trees
5.3 Graph Drawing
Suppose we have a recursive list of single parents and their children. Each child has
one parent, and each parent has one or more children. One node, the root, has no
parent.
A directed geometric tree has one root and many children, representing flow from the
root to the leaves. Examples include water and migration flows. Phan et al. (2005)
provide algorithms for rendering a flow tree using geographic data. An example with
Colorado migration data from 1995 to 2000 shows edge merging for smooth, distinct
flows.
->Spanning Trees
Figure 5. 9 (Wilkinson 2005) illustrates a small website's data, where each page is a
node and the links between them are branches. The thickness of these branches shows
traffic between the pages, with the root node near the center due to a force-directed
algorithm that attracts adjacent nodes and repels non-adjacent ones. depicts a plant
model where branches should be short for better water distribution, and leaves should
be spread out for sunlight exposure, resembling the website layout in Figure 5. 9.
Additive Trees
Additive trees need complex calculations. We have a distance matrix for n objects and
must create a spanning tree that closely matches the original distances. The article
mentions that edge angles are not important; the edges are arranged for easy path
tracing.
-> Networks
Networks are generally cyclic graphs. Force-directed layout methods often work well
on networks. The springs algorithm doesn't require a graph to be a tree. Subjects
were asked to produce a list of animal names. Names found adjacent in subjects’
lists were considered adjacent in a graph.
->Directed Graphs
Directed graphs are arranged with source nodes on top and sink nodes at the bottom.
To lay out a directed graph, a topological sort is needed. Cyclical edges are
temporarily inverted to create a directed acyclic graph (DAG) for identifying paths to
sink nodes. A topological sort produces a linear order of the DAG, with vertex u
above vertex v for each edge. Reducing edge crossings is difficult and involves
maximizing Kendall’s τ correlation between layers. Heuristic methods include direct
search, simulated annealing, or constrained optimization. Figure 5. 14 shows the
evolution of the UNIX operating system as computed by a graph layout program from
AT&T.
->Treemaps
Treemaps are a way to divide a space into smaller parts. The easiest example is a
nested rectangular layout. To create a rectangular treemap from a binary tree, we start
at the tree's root and split a rectangle vertically. Each part represents one of the root's
children. We then split those parts horizontally and keep doing this until all tree nodes
are represented. We can also color the rectangles based on weights or resize them
according to those weights. An example shows this using color and size to visualize
news sources.
Geometric graphs are important for data mining and analysis due to their ability to
describe sets of points in a space. We will use some of these graphs for visual
analytics in the next section. This section includes examples using data from the Box–
Jenkins airline dataset.
Many geometric graphs have been created to show the "shape" of a set of points X on
a plane. Most of these are proximity graphs, which have edges based on an indicator
function determined by distances between points in a metric space. To define this
function, an open disk D is used. D touches a point if the point is on its boundary and
contains a point if it is inside D. The smallest disk touching two points is D₂, with a
radius of half the distance between them, and its center is halfway between the two
points. An open disk of fixed radius is called D(r), and one of fixed radius centered on
a point is D(p,r).
->Disk Exclusion
Several proximity graphs exist when pairs of points have empty disks.
Delaunay Triangulation
In a Delaunay graph, an edge connects any two points that can be touched by an open
disk containing no other points. The Delaunay triangulation and its dual, the Voronoi
tessellation, are useful for describing point distributions. Although they can be
generalized to higher dimensions, they are mostly used in two dimensions. There are
several proximity graphs that are subsets of the Delaunay triangulation.
Convex Hull
A polygon is a closed shape with n vertices and n − 1 faces. Its boundary can be
shown as a geometric graph with vertices as polygon points and edges as its faces.
The hull of a set of points X in 2D space is a group of one or more polygons that
include some points from X as vertices and contain all points in X. A polygon is
convex if it includes all straight lines between any two points inside it. The convex
hull of X is the smallest convex shape that contains X. There are various algorithms to
find the convex hull, which is related to the outer edges of the Delaunay triangulation,
allowing computation in O(n log n) time.
Nonconvex Hull
A nonconvex hull is a type of hull that isn't convex. It includes simple shapes like a
star convex or monotone convex hull, as well as complex shapes, space-filling
objects, and those with separate parts. We consider the hull of these shapes as the
outer edges of their complexes. In an alpha-shape graph, an edge connects two points
if an open dis D(α) containing no points can touch them.
Complexes
There are several important types of Delaunay triangulation used for understanding
point density, shape, and more. In a Gabriel graph, an edge connects points without
any points in their D2 area. A relative neighborhood graph connects points if their
lune region has no points. A beta skeleton graph is a mix of both, with size
determined by a parameter β. A minimum spanning tree is a part of a Gabriel graph.
->Disk Inclusion
Several proximity graphs are defined by distance. Edges in these graphs exist when
certain distances contain pairs of points and are not usually subsets of the Delaunay
triangulation. In a k-nearest-neighbor graph (KNN), a directed edge is present
between points p and q if their distance is among the k smallest distances.
Applications often simplify KNN by removing self-loops and edge weights. If k = 1, it
is a subset of the MST. A distance graph connects points within a defined radius,
while a sphere-of-influence graph connects points based on nearest-neighbor
distances.
5.5 Graph-theoretic Analytics
Some graph-analysis methods are suitable for visualization.
-> Scagnostics
When there are many variables, scatterplot matrices can become difficult to use. The
clarity of the display diminishes, and finding patterns becomes impractical for more
than 25 variables due to the large number of scatterplots. To address this issue, Tu and
his team developed a method that reduces the complexity of visual analysis by using
fewer measures that describe the distributions of the 2D scatterplots. These measures
include the area and perimeter of convex hulls, kernel density contours, and other
statistics.
After calculating these measures, Tu created a new scatterplot matrix using them. This
new matrix helped identify unusual patterns in the original data. Wilkinson and
colleagues later improved this method by using proximity graphs, increasing the
efficiency and allowing it to handle different types of variables. They defined nine
scagnostics measures to further analyze the data. An example showed how the
Outlying measure flagged significant cases and described their characteristics in the
scatterplots.
Comparing Sequences
Suppose we have two sequences of characters or objects and we wish to compare
them. If the sequences are of length n, we can create an n by n table of zeroes, placing
a 1 in a diagonal cell if the values match at that position. An identity matrix indicates
identical sequences, plotted as a square array of pixels. With real data, matching runs
of subsequences are often found off the diagonal. Figure 5.29 shows how
subsequences appear as diagonal runs, with longer bars indicating longer matching
subsequences.
Critical Paths
Suppose we have a directed acyclic graph (DAG) where the vertices represent tasks
and an edge (u, v) means task u must be completed before task v. How do we
schedule tasks to minimize overall completion time? This job-scheduling problem has
many variants. One common variant weights the edges by the time needed to
complete tasks. We will discuss two aspects involving graphing: first, how to layout a
graph for the project by flipping it to a horizontal orientation, resulting in a CPM
(critical path method) graph. Second, how to identify and color the critical path,
which is easy without weighted edges through a breadth-first search. Finding the
shortest path in a weighted graph requires dynamic programming. Large project graph
layouts can be messy, so an alternative is a Gantt chart, which shows time on the
horizontal axis and task duration as bar lengths. Modern Gantt charts combine these
tasks with critical path information. Most computer project management packages
compute the critical path using graph algorithms and display them in Gantt charts.
->Graph Matching
Given two graphs, we can determine if they are isomorphic and identify isomorphic
subgraphs or calculate an overall measure of concordance if they are not. This topic is
important in biology, chemistry, image processing, computer vision, and search
engines. Graph matching helps in searching databases for specific graphs and
provides a way to index large databases of different materials, focusing on matching
2D geometric graphs in this chapter.
Recent techniques explore varied measures for concordance. Notably, the Google
search engine uses a graph spectral measure for similarity assessment. In shape
recognition, proximity graphs built from polygons have been applied. Klein et al.
(2001) introduced a method to match medial axis graphs using edit distance, which
counts the operations needed to convert one graph into another, allowing the accurate
characterization of 2D shapes from 3D shapes. Torsello (2004) expanded on these
ideas. Proximity graphs can address shape-recognition challenges using edit distance
or similarity measurements, as demonstrated by Gandhi (2002) with leaf shape
analysis through turning angles and dynamic time warping.
High-dimensional Data Visualization
6.1 Introduction
“Have is a hammer, every problem looks like a nail. ” This saying applies to the use
of graphics as well. An expert in grand tours will likely include a categorical variable
in high-dimensional scatterplots, while a mosaic plot expert will fit a data problem
into a categorical framework. This chapter focuses on using different plots for high-
dimensional data analysis, highlighting their strengths and weaknesses.
1. Exploration: During this phase, analysts use various graphics, often unsuitable for
presentations, to uncover important features. The need for interaction is high, and
plots should be created quickly, allowing for instant modifications.
Mosaic plots require significant training for data analysts but are highly versatile
when fully utilized. This section will discuss their typical uses and trellis displays.
The study aims to determine if a person's brand choice affects their detergent
preference. It shows that the interaction between user and preference is significant,
while two other variables, Water Softness and Temperature, indicate that harder water
requires warmer temperatures for effective washing with a fixed detergent amount.
Mosaic plots facilitate the examination of how user and preference interact, especially
for different combinations of Water Softness and Temperature. Recommendations for
creating effective high-dimensional mosaic plots include placing the key interaction in
the last two positions and the conditioning variables first. To avoid clutter, variables
with fewer categories should be listed first. If there are empty cell combinations, it's
advised to position the variables causing emptiness higher in the plot. Highlighting
the last binary variable can reduce cell numbers, and interactive mosaic plots allow
for changing displayed variables to reveal potential interactions more clearly.
The infection risk is highest for cases with risk factors when antibiotics are not given.
Planned caesareans lower the infection risk by about half, and there were no
infections in unplanned caesareans without risk factors and antibiotics, though at least
three were expected. Figures help explore these results better than classic models, but
they should still check significance.
-> Models
Meyer et al. (2008, Chapter III. 12) describes a way to show association models using
mosaicplots. Instead of just looking at observed values in log-linear models, expected
values can also be plotted, allowing visibility for empty cells in the modeled data.
Mosaicplots can visualize any continuous variable for categorical data crossings. The
text mentions interactions such as Water Softness and Temperature, and the M-user
and Preference. It also notes that certain models are hard to interpret for
nonstatisticians and discusses log-linear models in more detail in Heus and Lauer
(1999).
6.3 Trellis Displays
Trellis displays plot high-dimensional data using a grid-like structure based on certain
subgroups.
-> Definition
Trellis displays were created by Bec er et al. in 1996 to visualize multivariate data.
They use a lattice-like setup to organize plots into panels, where each plot depends on
at least one other variable. To make comparisons easier across rows and columns, all
panel plots use the same scales.
A basic example of a trellis display is a boxplot that shows the gas mileage of cars
based on car type. This setup allows for easy comparison among different car types
since the scale remains consistent. Additional variables, especially binary ones, can be
included using highlighting. A trellis display can feature up to seven variables at once,
with five being categorical and two continuous. The panel plot at the center can
include various statistical graphics, typically limited to scatterplots. Up to three
categorical variables can dictate the rows, columns, and pages of the display, and each
panel is labeled to show its corresponding category.
Trellis displays also use shingling, which divides a continuous variable into
overlapping intervals, turning it into a discrete variable. This method differs from
using categorical variables. Although it offers benefits, reading the information from
the strip labels can be difficult. For example, one trellis display may show a boxplot
based on car type and gas mileage, while another could be a scatterplot comparing
MPG and weight, also showing car type and drive as conditioning variables.
However, a common issue in trellis displays is having empty panels or those with
very few observations.
The conditional framework in a trellis display acts like still images of interactive
statistical graphics. Each panel in a trellis display represents a specific part of the data
for a subgroup. An example can be seen with the cars data set. Interactions involve
selecting a subgroup in a barchart or mosaic plot and "brushing," which is moving an
indicator along one or two axes of a plot. Brushing helps select an interval of a
variable and can divide a continuous variable into multiple intervals. This technique
shows flexibility compared to the static view of a trellis display, which is easier to
print.
Wrap-up
Trellis displays are best for continuous axis variables, categorical conditioning
variables, and categorical adjunct variables. While shingling can be used sometimes,
it is usually better to avoid it for clarity. Trellis displays are easy to learn and support
static reproduction, but interactive graphics offer more flexibility for exploratory data
analysis. They allow for linking to other plots but lack a global overview.
This section will explore the main use of parallel coordinate plots in data analysis
applications. The key aspects include investigating groups/clusters, outliers, and
structures across many variables simultaneously. Three main uses in exploratory data
analysis can be identified.
- Overview
No other statistical graphic can plot so much information (cases and variables) at a
time. Thus, parallel coordinate plots are an ideal tool to get a first overview of a data
set. Figure 6. 11 shows a parallel coordinate plot of almost 400 cars with 10 variables.
All axes have been scaled to min-max. Several features, like a few very expensive
cars, three very fuel-efficient cars, and the negative correlation between car size and
gas mileage, are immediately apparent.
- Profiles
Parallel coordinate plots can highlight the profile of a single case, not just for one case
but also for entire groups to compare with other data. They are especially useful when
the axes are ordered, like time. Figure 6. 12 shows the highlighted profile of the most
fuel-efficient car.
- Monitor
When working on subsets of a data set, parallel coordinate plots can connect features
of a specific subset to the whole data set. For example, they can help identify major
axes in multidimensional scaling (MDS). The leftmost cases in MDS are hybrid cars
with high gas mileage, while the top right are heavy cars like pickups and SUVs.
Similar findings could also be achieved with biplots.
-> Limits
Parallel coordinates are often seen as overvalued for understanding multiple features
in a data set. Scatterplots are better for examining 2-D features, but scatterplot
matrices (SPLOMs) require much more space to display the same information as
parallel coordinate plots (PCPs). While PCPs do not typically help in identifying
multivariate outliers, they are very beneficial for interpreting results from multivariate
procedures like outlier detection, clustering, or classification.
PCPs can manage many variables, but they struggle when plotting more than a few
hundred lines due to overplotting. This issue is more significant in PCPs because they
use only one dimension for plotting, unlike scatterplots which utilize points. One way
to address overplotting is through the use of α-blending, which adjusts the opacity of
plotted lines to improve visibility in densely populated areas.
Figures demonstrate how α-blending enhances readability. The “Pollen” data set, for
instance, highlights a hidden word when using a lower alpha value. In another
example featuring real data on olive oil fatty acids, varying α-values can reveal the
group structure of regions. It's important to experiment with different α-blending
settings for optimal results, as its effectiveness varies based on the rendering system
and plot size.
Sorting
Sorting in parallel coordinate plots is important for understanding the data, as patterns
are often found among neighboring variables. In a plot of k variables, only k - 1
adjacencies can be analyzed without changing the order. The default order usually
reflects the sequence in the data file, which can be arbitrary. One needs 2^(k-1)
different arrangements to view all adjacencies. When variables share the same scale,
sorting them by criteria such as statistics or multivariate results can help clarify the
plot. For larger data sets, sorting options should be available both manually and
automatically.
Scalings
Besides the standard way of plotting values across each axis from the minimum to
maximum of the variable, there are other scaling methods that can be helpful. The key
option is whether to scale the axes individually or to use a common scale for all.
Other scaling methods determine how the values are aligned, such as at the mean,
median, a specific case, or a specific value. For individual scales, using a 3σ scaling is
often effective for fitting the data into the plot area.
Parallel coordinate plot of individual stage times for 155 cyclists from the 2005 Tour
France. The upper plot displays individual scales, while the middle plot shows a
common scale, allowing for better comparability of times, though the spread is less
visible during certain stages. The lower plot aligns the axes at their median, making it
easier to see overall performance, particularly the time of the peloton.
For a comprehensive view of the race, examining cumulative times is more beneficial.
Cumulative times with an individual scale, highlighting the varying resolutions of
data. A common scale is suggested, but this must be aligned at the median to illustrate
how different stages influenced overall performance. The impact of mountain stages
is evident, with the plot showing how cyclists from the "Discovery Channel" team
varied throughout the race.
Further compares these developments alongside two profile plots showing the
cumulative category of stage difficulty and the average speed of the stage winner,
revealing their negative correlation.
-> Wrap-up
Parallel coordinate plots need features like α-blending and scaling to be useful.
Examples in this chapter show how these additions provide insights into high-
dimensional data. Highlighting subgroups helps understand group structures and
outliers.
The grand tour is an advanced tool for exploring data, allowing for nonorthogonal
projections that can reveal details missed by traditional methods. An example using
the cars data set shows different projections colored by the number of cylinders.
Unlike parallel coordinate plots, which display univariate distributions and some
bivariate relationships, the grand tour focuses only on multivariate features, often
showing minimal results unless significant patterns are present. While it can illustrate
the relationships of variables beyond three dimensions, examples of structures in over
five dimensions are uncommon. There are few flexible implementations of the grand
tour, making it challenging to apply these methods effectively.
7.1 Introduction
In data visualization, a glyph is a visual way to represent data, where graphical traits
are based on data attributes. For instance, a box's size can reflect a student's exam
scores, while its color can show the student's gender. This broad definition includes
various visual elements like scatterplot markers, histogram bars, and line plots.
Glyphs help visualize multivariate data effectively, making it easier to see patterns
involving multiple dimensions compared to other techniques. They allow analysts to
detect and classify complex relationships between data records.
However, glyphs have limitations. They may not convey data accurately due to size
constraints and human visual perception limits. Also, visualizing too many data
records can cause overlaps or force glyphs to shrink, making patterns hard to see.
Therefore, glyphs work best for qualitative analysis of smaller data sets. This paper
discusses glyph generation, issues affecting glyph effectiveness, and offers ideas for
future research in visualization.
7.2 Data
Glyphs are often used to display multivariate data sets, which consist of items defined
by a vector of values. This data can be seen as a matrix where rows are records and
columns are variables or dimensions. For this paper, we will consider data items as
vectors of numeric values, although categorical and non-numeric values can also be
shown using glyphs after conversion to numeric format. A data set is made up of one
or more records, allowing for normalization through calculated minimum and
maximum values. Dimensions can be independent or dependent, suggesting the need
for grouping or consistent mapping based on data type.
7.3 Mappings
Many authors have created lists of graphical attributes for mapping data values,
including position, size, shape, orientation, material, line style, and dynamics. These
attributes allow for various mappings for data glyphs, classified as one-to-one
mappings, one-to-many mappings, and many-to-one mappings.
One-to-one mappings pair each data attribute with a different graphical attribute,
leveraging the user's knowledge for intuitive understanding. One-to-many mappings
use redundant mappings to improve interpretation, like mapping population to both
size and color for clearer analysis. Many-to-one mappings help compare values across
different dimensions for the same record. This paper mainly discusses one-to-one and
many-to-one mappings, though the principles also apply to other types.
The following list (from Ward, 2002) includes some glyphs found in literature or
common use. Some are specific to applications like fluid flow visualization, while
others are general purpose. Later, we analyze these mappings to identify their
strengths and weaknesses.
The list above shows that many possible mappings exist, many of which are not yet
suggested or assessed. The question is which mapping will best fit the task's purpose,
data characteristics, and the user's knowledge and perception skills. These issues are
explained in the sections below.
One common criticism of data glyphs is the implicit bias in mappings, where some
attributes are easier to perceive than others. For instance, in profile glyphs, adjacent
dimensions are easier to measure than separated ones, and in Chernoff faces, certain
attributes are perceived more accurately than others. This section categorizes these
biases, using previous studies and our own research, highlighting the need for more
work to measure and correct these biases in glyph design and data analysis.
Perception-based bias
Certain graphical attributes are easier to see and compare than others. Experiments
show that length along a common axis is measured more accurately than angle,
orientation, size, or color. Different mappings of the same data illustrate this, with
profile glyphs being the easiest and pie glyphs being the hardest to interpret.
Proximity-based bias
In most glyphs, it's easier to see and remember relationships between data dimensions
that are next to each other than those that are not. No experiments have quantified this
bias, but Chernoff and Rizvi (1975) reported a 25% variance due to data
rearrangement. The bias likely varies with the glyph type.
Grouping-based bias
Graphical attributes that are not next to each other but can be grouped may introduce
bias. For instance, mapping two variables to ear size can reveal relationships clearer
than mapping one to eye shape and one to ear size.
Each dimension of a data set corresponds to a specific graphical feature. Changing the
order of these dimensions while keeping the mapping type can create different data
views. There are N! possible orderings, leading to unique views. It is crucial to
identify which orderings best support the task. This section will discuss several
dimension-ordering strategies that can help create informative views compared to
random ordering.
-> Correlation-driven
Many researchers suggest using correlation and similarity measures to better organize
dimensions for visualization. Bertin’s reorderable matrix demonstrated how
rearranging rows and columns in a table can reveal groups of related records. Erst et
al. used cross-correlation and a heuristic search to rearrange dimensions for clarity.
Friendly and Kwan proposed effect ordering, where the order of graphical objects is
based on observable trends. Borg and Staufenbiel compared traditional glyphs with
factorial suns, showing improved data interpretation for users.
-> Symmetry-driven
Gestalt principles show that people prefer simple shapes and are better at recognizing
symmetry. Peng et al. (2004) studied star glyphs' shapes based on two qualities:
monotonicity and symmetry. They found an ordering that produced more simple and
symmetric shapes, which users preferred. The idea is that simpler shapes are easier to
recognize and help in spotting small variations and outliers, but more formal
evaluations are needed to confirm this. See Fig. 7. 3 for an example.
-> Data-driven
Another option is to base the order of the dimensions on the values of a single record
(base), using an ascending or descending sorting of the values to specify the global
dimension order. This allows users to see similarities and differences between the
base record and all other records. It is especially good for time-series data sets to
show the evolution of dimensions and their relationships over time. For example,
sorting the exchange rates of ten countries with the USA by their relative values in the
first year of the time series exposes a number of interesting trends, anomalies, and
periods of relative stability and instability. In fact, the original order is nearly reversed
at a point later in the time series.
-> User-driven
As a final strategy, we can let users use their knowledge of the data set to order and
group dimensions in various ways, such as by derivative relations, semantic
similarity, and importance. Derivative relations show that some dimensions may come
from combinations of others. Semantic similarities relate to dimensions with similar
meanings, even if their values don't correlate well. Lastly, some dimensions may be
more important for a specific task, so highlighting these can improve task
performance.
The position of glyphs can show various data attributes like values, order,
relationships, and derived aspects. This section describes a taxonomy of glyph layout
strategies based on several factors: whether placement is data-driven or structure-
driven, whether glyph overlaps are allowed, the balance between efficient screen use
and white space, and the possibility of adjusting glyph positions for better visibility.
Understanding the trade-offs between accuracy and clarity is crucial for interpreting
glyphs effectively.
The direct method in simulations has clear meanings for positions, helping to
highlight or replace data dimensions. Derived methods can enhance the display's
information and reveal hidden relationships in data. However, data-driven methods
often cause glyph overlap, leading to misunderstandings and unnoticed patterns.
Various techniques exist to resolve this issue by distorting position information.
Random jitter is frequently used for data with limited values. Other approaches, like
spring methods, aim to reduce overlaps and displacement. Woodruff et al. introduced
a relocation algorithm for consistent display density. Users should control distortion
levels through maximum displacement settings or animations showing glyph
movement.
Placement strategies based on structure can vary in overlap. A grid layout can avoid
overlaps in ordered data, while tree and graph layouts in dense datasets might lead to
significant overlap. To address overlap, distortion methods help maintain structure
visibility even with movement of glyphs. Nonlinear distortion techniques allow users
to focus on specific data areas without occlusion, enhancing the separation of data
groups as well. This shows a blend of structure and data-driven approaches.
Lined views have been proposed to overcome flat 2D limitations. Identical plot
symbols and colors keep track of similar cases in static displays. This idea was first
used in 1982 to link observations across scatterplots. Currently, "scatterplot brushing”
is a well-known method to connect data in scatterplots.
Lined views offer benefits like easy graphical displays and quick ways to explore
different data aspects, which are crucial in early data analysis. For instance,
combining a barchart with a histogram allows for comparisons across groups without
altering the original data. The dataset discussed comes from an international survey
assessing the math and science performance of 13- and 14-year-old students in
Germany, encompassing various continuous and categorical variables. Lined views
also work well with complex data, especially in geographic contexts.
Anselin (1999), Wills (1992), and Roberts (2004) discuss the importance of lined
displays in exploring spatial data. These displays help in the statistical exploration of
datasets by allowing users to investigate distributional characteristics, identify
unusual behaviors, and detect patterns and relationships. Lined views are
particularly beneficial for categorical data and offer easy conditional views. For
instance, a spine plot can reveal male students' reading habits, highlighting that they
are underrepresented in medium reading categories. While flexibility in data
visualization is essential, it is equally important to have a stabilizing element that
ensures patterns observed are genuine data features. The subsequent sections will
outline a systematic approach to lined views, focusing on essential characteristics for
effective dataset exploration.
Lining views means that two or more plots share and exchange information. To do
this, a lining procedure must create a relationship between the plots. Once this
relationship is set up, it’s important to decide what information is shared and how it is
shared. To explore different lining schemes, we look at data displays as suggested by
Wilhelm (2005). A data analysis display consists of a frame, a type, a set of graphical
elements, and scales, along with a model and a sample population.
There are four main types of lining structures: linking frames, linking types, linking
models, and linking sample populations. These can be divided into data linking and
scale linking. Information can be shared in two ways: directly from one layer to
another or through an internal process involving the sample population layer, with
sample population lining being the most common method.
Sample population lining connects two displays and serves as a platform for user
interactions. It defines a mapping that links elements of one sample space to another.
This method is used to create subsets of data and analyze conditional distributions,
ensuring a joint sample space for proper definition of these distributions.
Identity Linking
The easiest and most common case of sample population lining, known as empirical
lining, uses the identity mapping id Ω Ω. This lining scheme aims to show the
connection between observations taken from the same individual or case. It helps to
utilize the natural connection between features observed for the same set of cases.
Identity lining is built into common data matrices where each row is a case and each
column is a measured variable. It is not limited to identical sample populations, as
any two variables of the same length can be combined in one data matrix, leading
software programs to treat them as if observed with the same individuals. Care must
be taken when interpreting these artificially lined variables.
Hierarchical Linking
Databases for analysis come from different sources and use different units. However,
when analyzed together, they usually show some connection between the sample
populations. Often, there is a hierarchy among these populations that ranges from
individual persons to various social groups, and even to different societies. This is
similar for spatial data measured on different geographical levels. It is useful to
visualize these connections through hierarchical aggregation displays. A relation must
be established to map elements between different sample population spaces.
Models, as discussed by Wilhelm (2005), are symbols that represent variable terms
and identify the data to be shown in displays. They are essential for defining data
visualization and specify the information to be presented. For instance, a histogram
for a quantitative variable uses a categorization model defined by a vector C =
(C0, . . . , Cc), which segments the variable's range. It counts the frequency of
observations in each segment. The histogram's scale includes the categorization
vector, the order of values in C, and the maximum counts per bin, with the vertical
axis starting at zero.
A model linking for the example can be created through the set of observations or the
scale. Linking scales can lead to three cases: linking the categorization vector, linking
the order of categorization values, and linking the maximum count for one bin. If the
categorization operator model is shown as a histogram, the third case involves linking
the vertical axis scales, while the other two cases link the horizontal axis scales,
focusing on bin width and anchor point.
In Manet, histogram scales can be linked, as illustrated in a figure where two
histogram scales are aligned. The left plot is the active one, propagating its scale to
the plot on the right. Manet defines a histogram with five parameters: horizontal scale
limits, bin width, number of bins, and maximum bin height. Any two of the first four
parameters combined with the fifth are enough to define a histogram fully, allowing
parameters to be shared. It's important to also use the same frame size for accurate
comparison, not just the same scales.
Examples show three histograms for the same variable, with the left plot being active,
the bottom right plot unlinked, and the top right plot sharing scales but differing in
frame size. Linking scale information is essential, notably in the form of sliders,
which are 1-D graphical representations of model parameters that users can adjust.
Moving a slider changes the underlying model, which updates all related plots. Sliders
assist in dynamic queries, helping filter and analyze data visually.
The order of categorization values matters less for continuous data in histograms but
is crucial for nominal categories lacking a clear order. Linking this scale is common
in bar charts and mosaic plots. The categorization vector is part of both the
observation and scale components, and linking models generally means that plots
share the same variables. All plots in a specific example represent the same variable,
contributing to a comprehensive view of the dataset. Systems have been developed
that combine various views of the same dataset, offering multiple perspectives and
allowing for effective exploration of related variables.
The model layer of a data display is adaptable, encompassing complex models such as
regression, grand tours, and principal components. A basic model link involves
interconnected plots showing raw observations, models, and residual information.
Young et al. introduced this connected structure using grand tour plots, showcasing a
spread plot with a rotating plot and scatterplots for residual information, updated
promptly with any model changes.
The type layer in graphical displays represents the model as closely as possible, but
not all models can be shown without losing information due to limited space and
resolution. The relationship between the type level and model level is strong, meaning
similarities in two displays usually come from aligned models. For instance,
histograms that use the same categories will have the same bin widths. Direct links
between type levels of displays without corresponding model links are rare. Color and
size are key attributes of graphical elements that can be aligned, often without linking
to the model. In pie charts, different colors for slices help distinguish categories and
can be assigned arbitrarily. If slices are ordered meaningfully, such as alphabetically
or by size, the color can reflect model information. Using a consistent color scheme
can reduce misinterpretation.
Aligning axis information means that all displays use the same parameters, typically
reflecting model scales. Different scales can lead to ineffective plot use since some
space remains empty, which might not be noticed if scales are matched. The same
axes can highlight varying observation ranges. Properly linking type information is
crucial for comparing plots effectively. Incorrect visual representations often arise
from closely adjusting axis parameters to scale parameters without considering their
visual differences.
The frame level controls the basic shape and size of a plot window, which is
important for saving screen space and making accurate graphical comparisons.
Different frame sizes can confuse analysts and result in incorrect conclusions. While
attributes like background color are less critical, setting them consistently helps in
data analysis.
The lining paradigm supports sharing information between displays. This occurs
when a new display is created using data from existing displays. Interactive
environments also share information when users modify plots. Roberts et al. (2000)
identify three strategies for exploration: replacement, overlay, and replication.
-> Replacement
In the replacement mode, old information is lost and replaced by new data. This
method works for plot parameters but not for subsetting and conditioning, as it loses
important information on marginal distributions. It is effective in scatterplots with
individual plot symbols where user interaction changes some attributes. However,
users cannot compare the current plot with previous versions directly, only with a
mental copy, which can distort comparisons. Keeping track of changing scenarios and
versions is useful, and a history system that records plot changes can be very helpful.
-> Overlaying
In direct manipulation graphics, overlaying is a common method for showing
conditional distributions in area plots. A histogram can be overlaid on a barchart to
represent data points of selected classes, which helps compare conditional and
marginal distributions. This approach has limitations, such as restricting parameter
choices since the new plot inherits from the original. It can also cause occlusion,
where part of the original display is hidden due to the overlay, especially when the
data varies significantly between subsets and the total sample. This issue is critical in
complex plots like boxplots.
-> Repetition
Repetition is the third strategy for visualizing linked interactions. This strategy allows
users to see repeated and different views of the same data at once. The advantage is
that users get a complete overview of the data, seeing how changes in parameters and
user interactions affect the visualizations. However, the disadvantage is that users
might feel overwhelmed by the many slightly changed views. This strategy needs a
simple way to track these changes and an effective system to organize the displays on
the screen. A condensed form of repetition, called juxtaposition, places a plot for a
selected subgroup next to the original plot rather than on top of it. This keeps
important features of the original plot visible while allowing easy comparison
between the two. Juxtaposition is well-known for static plots but hasn't been widely
used in interactive systems due to challenges in rearranging plots and redrawing them
after each interaction. However, modern computer capabilities can support this
process and allow for a better view of user interactions. Juxtaposition can also be
applied to statistical models, enabling comparisons between results for different
subsets of data.
Different problems arise when the lining scheme is not a straightforward 1-to-1 lining
but a more complicated form like m-to-1 lining, which occurs in hierarchical lining.
Consider a hierarchy with two levels: the macro level, such as a group of counties,
and the micro level, which consists of towns within those counties. When some towns
in a county are selected, it's helpful to show this partial selection of the county
through partial highlighting. If the macro level is represented with regular shapes,
partial highlighting can occur by dividing the shape into selected and nonselected
parts. A broader approach is to use varying intensities of the filling color in graphical
elements to depict the different proportions selected. This method is effective for
graphical elements with nonrectangular layouts and is generally easier to understand.
A “data view” refers to any way of viewing data to understand it better. While it is
often associated with charts like bar graphs or scatterplots, it also includes other forms
like regression analysis results, neural network predictions, or geographic information
like maps. Additionally, a family tree displaying relationships is a type of data view.
The term encompasses various forms, such as graphs, diagrams, and visualizations,
but for clarity, we stick to calling it a “data view. ”
A linked data view is one that changes in response to modifications made in another
view. An example is a scroll bar in a text editor that adjusts to show which part of the
document is being viewed when it’s moved. This concept is common in user
interfaces and software involving data analysis. A figure in the text demonstrates this
idea using baseball statistics from 1871 to 2004, showing the relationship between
players' salaries and their batting averages in a scatterplot while also displaying a
histogram of the years.
Linking views allows users to select parts of one view, affecting the other views to
highlight data connections. In the example provided, black shows selected data while
gray indicates unselected data. This linking illustrates salary trends over years but
does not seem to change the connection between batting average and salary.
A key question when using visualization is: “Why should I use this? ” Analysts may
explore data further if they notice interesting patterns or relationships, like changes in
salary and years played. They often consider other views or add variables to explain
findings better. For instance, they can create additional scatterplots or use advanced
techniques to explore data in higher dimensions.
Despite its benefits, this approach has significant issues that limit its effectiveness.
The main problems are as follows:
As plots get more complex, they are harder to understand. While simple 1-D plots are
easy for most people, multi-dimensional visualizations like 3-D point clouds and
multivariate projections are less intuitive.
Furthermore, monolithic data views struggle to handle different types of data. High-
dimensional techniques often assume variables are numeric, making it difficult to add
a numeric variable to a table of two categorical variables without switching to a
different display type, like a trellis.
Some data types, specific to certain domains, cannot be directly integrated. Analyzing
relationships in multivariate data from geographic locations, graphs, or text
documents is challenging, often requiring the use of separate software tools that
complicate the analysis process.
The lined views approach can address these challenges by creating several simpler
views that are interconnected. When a user interacts with one view, the others update
automatically, making interpretation easier and allowing for more specialized data
integration.
However, lined data views are not always better than single complex views. In some
cases, a unified multivariate technique is essential to identify specific features, and
presenting results from an interactive exploration can be harder. Still, for numerous
situations, especially those focused on conditional distributions, lined data views are
very effective.
For example, the histogram now shows years in the league, resembling a Poisson
distribution. Players with five or more years of experience not only earn higher
salaries but also show a stronger linear relationship between batting average and
log(salary). For younger players, performance may not significantly influence pay
unless their batting average is above average.
In Sect. 9. 1, examples of the lined views paradigm were discussed. This section will
define it more precisely and explain how to implement a lined views system. An
interactive environment is necessary for this system, as lining involves interaction
with graphical data representations. A lined views environment should have multiple
views that meet certain conditions.
First, at least one view must detect user interaction and translate it into a degree of
interest in the displayed data, distinguishing between different data subsets based on
that interaction. Second, there must be a mechanism to share the degree of interest
from the first view to other views. Third, another view must be able to respond to the
interest measure by changing its appearance to reflect the degree of interest
concerning the data it displays.
The “degree of interest” concept measures how interesting a user finds specific data
subsets. For example, if a user selects bars for five or more years in a league, they
indicate interest in those data rows. Each subset must have a numerical interest
measure so that results can be aggregated. An aggregated view represents multiple
data cases with a single graphic item, like a histogram. In contrast, an unaggregated
view shows each data row individually, as seen in scatterplots. Data cases and data
rows refer to single data observations, and graphic items are visually distinct units that
can be identified separately from others.
A simpler version of the degree of interest can be used, where each data case is
assigned a degree of interest value, typically ranging from 0 to 1. A value of 0 means
no interest, while 1 indicates maximum interest. For a subset of data, the average of
these values can represent the interest measure. However, other summary functions
might be useful depending on the context. For instance, a maximum summary
function helps identify outliers since they have high interest, which can be missed by
average measures.
Another simplification states that any view defining a degree of interest must assign it
as either 0 or 1, separating selected cases from unselected ones. This binary system is
common, but more complex scenarios will be discussed later.
To meet the requirement of indicating user interest, there are various methods,
including brushing, where users drag a shape over data to select it, setting the degree
of interest to 1 for those items. Rectangle or rubber-band selection allows users to
click and drag to create a shape, selecting items inside it. Lassoing enables users to
create a polygon selection by clicking to define the shape and selecting the
intersecting items.
To fulfill the requirement for displaying degrees of interest, a view must represent the
main data alongside the interest value. This value can be treated like other variables in
graphic design. Different types of variables can be used to define a view's layout and
appearance.
In the provided figures, different methods of showing degrees of interest are
illustrated. For one method, a 3D barchart uses interest as a continuous variable along
the z-axis. Additionally, each bar could be split to show the proportions of selected
and unselected data.
Showing the interest degree through a brightness scale could also be implemented.
The last view in one figure divides data into selected and unselected subsets, fitting
into established faceting schemes. This demonstrates how binary selection can be
integrated into existing structures, as seen in a baseball context with varying leagues
that illustrate different clusters of data points over time. Further investigation into this
topic will be provided later.
One of the first widely recognized techniques for data visualization is the scatterplot
brushing method by Becker, Cleveland, and Wilks from 1987. This technique
arranges scatterplots of multiple variables in a matrix format, allowing quick
comparison of how one variable relates to others. The method is enhanced by using a
brush, which highlights data points in different colors across the matrix when
selected. This creates a visual connection among the scatterplots.
The effectiveness of this technique comes from the clear link between data points and
their graphical representation, allowing flexible use of colors and symbols for each
data item. Unlike aggregated data, where multiple values are combined, each data row
corresponds directly to a specific graphical element in the scatterplot. This is
illustrated in the related figures, showing how selecting data points can change their
display without losing clarity.
Although scatterplots are an obvious choice for this method, other graphic tools, like
XGobi, also provide linked views using brushes. GGobi is the latest version of this
software. Unaggregated views, including raw data tables, can also utilize linked
selection, allowing users to focus on selected data points in a refined table view,
commonly known as a "drill-down" view.
Parallel coordinates views, introduced by Inselberg in 1985, represent high-
dimensional data as lines in a 2-D space. Though they work best with smaller
datasets, they are suitable for linked views as they show the distinctions between
selected and unselected lines, even if crowded.
Brushing utilizes different modes, such as transient, which resets selections when the
brush moves away; additive, keeping selections active; and subtractive, which
deselects items. There are various combinations of these modes, although some are
less practical. Effective usage of these brushing modes enhances interaction and data
analysis.
The unaggregated approach struggles when dealing with large datasets. For example,
with tens of thousands of data points, using aggregated views like bar charts and
histograms is more effective. In this chapter, a dataset with around 16,000 players and
86,000 player seasons is examined to show how summary views are easier to
understand. The chapter highlights techniques for linking these aggregated views.
One of the first tools that featured linking in views was the Data Des software,
initially created for teaching statistical analysis but now a full-featured package. It
offers linked views in both aggregated and unaggregated data, allowing users to
explore unusual cases and modify models quickly. The section focuses specifically on
how Data Des links different aggregated views.
Figure 9. 6 illustrates linking from a scatterplot to a bar chart. Two methods show
data: one divides bars into sections for selected and non-selected data, while another
uses brightness to show data for each case. LispStat, another tool, allows users to
create their functions for linking views due to its interpreted language.
In the visual display, one technique stacks sections of bars to show selected items or
uses summary statistics for a single bar value. Both methods are important for
displaying linked views clearly. Figures 9. 7 and 9. 8 present these techniques with
extensive data and help to illustrate complex relationships among multiple variables.
The text also touches upon designated hitters in baseball, which have a specific role
and can impact player statistics. The document emphasizes how different graphical
methods can highlight various aspects of the data, particularly focusing on
distinguishing between selected and unselected subsets. It concludes that different
visualization approaches serve different analytical purposes.
An example of this is called dynamic queries in the FilmFinder, which allows each
variable to keep its own selection state. Memory-based linking, as shown in the
baseball data, enables detailed exploration. The memory-based system allows easy
multi-variable queries and is tolerant of mistakes, while a memoryless system
provides power and adaptability.
REGARD also advanced view lining in networks, which was further explored in
NicheWorks. This system focused on nodes and links in a graph and used the lining
mechanism to examine relationships between these data sets. Distance-based lining
was applied, defining distance by graph-theoretic connections.
Another significant area discussed is modeling results. In earlier sections, Data Des
included text descriptions of models within the lining paradigm. Developing model-
specific views, like hierarchical clustering, can be beneficial. Such clustering uses
similarities and can be shown in tree representations. Each leaf node has a degree of
interest, indicated by selections from parallel coordinates. Different representations of
clustering trees can be shown, including a treemap design that divides a rectangle
based on the size of children nodes, providing clear visibility of the data.
Visualizing Trees and Forests
10.1 Introduction
Tree-based models are a strong alternative to traditional models for various reasons.
They are easy to understand, can work with both continuous and categorical data,
handle missing values, perform variable selection, and model interactions effectively.
Common types of tree-based models include classification, regression, and survival
trees.
Visualization is key for tree models because they can be interpreted easily. Decision
trees, shown as decision rules, are intuitive to understand. They provide insights into
the data, such as cut point quality and prediction reliability. This chapter introduces
tree models and offers visualization techniques, including hierarchical views and
treemaps. It also discusses split stability and tree ensembles, and ways to visualize
multiple tree models.
The basic principle of all tree-based methods is to recursively divide the data space to
create subgroups for prediction. This process starts with the full dataset, using a rule
to split the data into separate parts. It continues until no further splitting rules exist.
Classification and regression trees use simple decision rules, evaluating one data
variable at a time. For continuous variables, splits create two partitions based on a
constant. Categorical variables split based on assigned categories.
This partitioning is represented by a tree, where the root node is the first split, and the
leaves are the final partitions. Each partition has a prediction, with classification trees
predicting classes and regression trees predicting constants. A decision tree consists of
rules in inner nodes, regardless of the prediction type in the leaves.
An example tree model is based on data from Italian olive oils, which includes
different fatty acids and their concentrations and categorizes the oils by region. The
model aims to show how olive oil composition varies across Italy, specifically across
five regions: Sicily, Calabria, Sardinia, Apulia, and North. The visualizations of this
tree vary, even with the same underlying model.
There are several tasks involved in visualizing a decision tree: placing nodes, visually
representing nodes and edges, and adding annotations for clarity. Different
representations can convey additional information. For instance, some plots may use
simple tick marks for nodes, while others might use rectangles sized based on data
cases. Colors may indicate class proportions within nodes too.
Placement of nodes has generated much discussion in graph visualization. For small
trees, straightforward methods are effective, but larger trees can complicate layouts.
In many cases, the information contained in nodes is more crucial than the exact
layout. The use of interactive tools allows users to explore large trees more easily.
The basic node placement divides available space equidistantly, but some approaches
consider the quality of splits for node positioning. Comparisons of class proportions
can also be enhanced by rotating the tree for clarity.
Edge representation is usually limited to lines, but it can also include polygons that
reflect the flow of data through a tree. Annotations can add textual or symbolic
elements along nodes and edges, helping clarify predictions and rules. However,
excessive annotations can create clutter, distracting from key points and hindering
readability.
To maintain clarity without losing important information, extra tools like zooming
and toggling are necessary, especially for more in-depth analyses. There are two ways
to provide additional information: integrating it into the tree visualization or using
linked external graphics. Integrating data into the tree makes it immediately relevant
and easier to use, but is limited by screen space. Accordingly, displaying directly
related information is often preferred.
Conversely, external linked graphics offer more flexibility as only one graphic
represents multiple data points, avoiding crowding. Such graphics must be interpreted
more carefully since they are not directly visible in the tree structure. There are no
strict rules about what information should be included in or outside the tree, but a
guideline suggests that complex graphics work better externally, while simpler, tree-
specific information should be included directly in the visualization.
Sectioned Scatterplots
Splitting rules are set in the covariate space, making a tree model visualizable along
with its partition boundaries. For univariate partitioning rules, these boundaries appear
on hyperplanes that run parallel to the covariate axes. A simple scatterplot can serve
as a 2-D projection, clearly showing all splits for the two plotted variables. A
sectioned scatterplot with the first two split variables is displayed along with the tree
model, where different regions are colored and partition boundaries are indicated.
The first tree split uses an eicosenoic variable to distinguish oils from northern Italy
and Sardinia from other regions. This split is very clear in the scatterplot. The next
two inner nodes use a linoleic variable to further differentiate oils from Sardinia and
the northern parts of Italy, and then from Apulian oils to Sicily and Calabria. Further
splits may not be visible in this projection as they involve other variables, but they can
be analyzed through interactive techniques using a series of sectioned scatterplots.
These scatterplots should use variables that are closely linked in the tree and be
limited to data from nodes nearer to the root. Some cut points may show clear
separations, while others might be noisy. In an interactive setting, users can drill down
between scatterplots. Extensions to these plots include changing the opacity of
partition lines based on depth and shading the background for depth or predicted
value. Scatterplots work best with continuous variables, while categorical variables
may benefit from local treemaps, which group categories within the same node.
Treemaps
One way to show all partitions is through area-based plots, where each terminal node
is a rectangle. Treemaps are a type of these plots. The main idea is to divide the
available rectangular plot area according to how the tree model divides data. The
treemap’s area represents the full dataset, starting with a horizontal partition based on
the number of cases for each child node. Next, each partition is split vertically
according to the proportions for its child nodes. This process continues recursively,
alternating between horizontal and vertical splits until reaching terminal nodes. Each
rectangle in the final plot corresponds to a terminal node, with the area reflecting the
number of cases in that node. It helps to adjust spaces between partitions to indicate
their depth, showing larger gaps for splits closer to the root.
Treemaps are good for assessing tree model balance and can reveal when trees create
large terminal nodes in noisy scenarios. They also allow highlighting for comparing
groups within terminal nodes. Treemaps extend those in information visualization and
are related to mosaic plots. The key advantage of treemaps is their efficient use of
space, enabling comparisons while keeping context, although they do not show
splitting criteria or allow direct comparisons within nodes..
Spineplots of Leaves
Another useful plot for tree model visualization is the spineplot of leaves (SPOL).
Instead of alternating the partitioning direction like treemaps, SPOL uses horizontal
partitioning, showing all terminal nodes in one row. This fixed height allows for a
visual comparison of the sizes of the terminal nodes, which are proportional to the
width of the bars. Relative proportions of groups can also be easily compared using
highlighting or brushing. A sample SPOL is shown in Fig. 10. 7, where each bar
represents a leaf, and its width is proportional to the number of cases in that node. The
plot allows for clear visibility of group proportions within each node and can include
annotations like a dendrogram of the tree.
SPOLs are especially useful for comparing group proportions in terminal nodes,
similar to spineplots, but with differences in how categories and gaps are handled.
This section has discussed several techniques for visualizing tree models based on
recursive partitioning, focusing on visualization of splits and data application. All
techniques can be applied to various data subsets, including training and test data,
allowing for a comparison of model adaptability and stability. The next section will
address tree model construction and visualization methods that reflect split quality.
To fit tree models, several methods have been suggested. The most common
algorithm is CART (Classification and Regression Trees), introduced by Breiman et
al. in 1984. This algorithm uses a greedy approach: for each node, it looks at all
possible splits and picks the one that most reduces the impurity of the child nodes
compared to the parent node using an impurity criterion. The split is made, and the
process repeats for each child node. Growth stops if certain rules are met, often
related to the minimum number of cases in a node or the required impurity decrease.
While pruning is a common practice in tree building, we will not discuss it here.
However, visualization can help with pruning, especially when parameters can be
adjusted interactively.
Impurity measures can be any convex functions, but common ones are entropy and
Gini index, both having theoretical support. It's important to note that this method
looks for local optima only, without considering multiple splits. This approach is
computationally lighter than a complete search and performs well in practice.
Focusing on local optima may make the model unstable. Small changes to training
data can lead to different splits, affecting the entire tree model. We aim to present a
visualization technique to understand decisions made at each node during tree fitting.
Mountain plots visualize impurity decrease across the full range of the split variable.
Using a binary classification example, one can see that there are multiple splits close
to the optimal cut chosen. The competition for best splits is not limited to a single
variable but can involve multiple variables. By comparing mountain plots of different
variables, we can evaluate the stability of a split. If one variable clearly stands out, the
split will be stable. Conversely, competing splits within the optimal range indicate
instability. Mountain plots help assess the quality of splits and can guide the
construction of tree models.
Bootstrapping models help analyze the features of fitted models and give insight into
the data. One key advantage of tree models is their implicit variable selection ability.
When a dataset is evaluated, the tree-growing algorithm creates a structure of splits,
and only the variables used in these splits are considered, effectively dropping others.
This analysis focuses on Wisconsin breast cancer data, using 20 trees generated
through bootstrapping with the CART algorithm.
The first visualization shows a global overview of variables used in the models. UCS
is the most frequently used variable, appearing 20 times, while Mts is the least used,
showing up only once. With a small number of variables available, none are left out
entirely. However, simply knowing how often a variable is used does not indicate its
importance, as deeper splits may involve fewer cases. Therefore, it is crucial to assess
the contribution of each split through a cumulative statistic like impurity decrease.
The second visualization shows the cumulative impurity decrease for each variable
across the 20 trees, ordered by importance. UCS stands out as the most influential,
followed by UCH and BNi. Care must be exercised when drawing conclusions from
this data, as variables that are correlated can affect results. The CART algorithm may
randomly select one of two correlated variables to use, which could leave out the less
significant one that might perform well alone.
To further study these behaviors and the differences among models, it is essential to
examine both the variables and individual trees. Two-dimensional diagrams of trees
and split variables are provided to reveal patterns and comparisons. Notably, four
model groups are identified based on gains, with UCS being the leading variable in
most models. The analysis underscores variable masking effects, confirming the
variability and instability of tree models in this dataset. Finally, for extensive trees,
parallel coordinate plots can be used to visualize the results, emphasizing the need for
careful ordering of the axes based on influential variables.
->Data View
The importance and use of variables in splits is a key part of tree models. In Section
10. 2. 2, another way to visualize trees was discussed: sectioned scatterplots. These
scatterplots can also show forests using semitransparent partition boundaries. An
example shows a forest with the olive oil data divided into nine regions, displaying
variables linoleic vs. palmitoleic and partition boundaries from 100 bootstrapped
trees. This technique aims to visualize all trees and their splits at once, unlike
individual trees that allow for detailed analysis.
The text discusses a method for visualizing classification trees through trace plots,
which consist of a grid with split variables as columns and node depths as rows. Each
cell in the grid represents a tree node, and glyphs within the cells show possible split
points. Continuous variables are indicated by a horizontal axis with tic marks for
splits, while categorical variables are shown as boxes for split combinations. In the
example tree, the root node features a split on palmitoleic, with child nodes splitting
on linoleic and oleic.
The trace plot allows for the reconstruction of tree splits and their hierarchical
structure, eliminating ambiguity present in hierarchical views, as the order of child
nodes is fixed in the grid. An advantage of trace plots is the ability to display multiple
tree models on the same grid, as seen in a plot of 100 bootstrapped classification trees.
To avoid overplotting, semitransparent edges are used, making popular paths clearer.
The first split consistently uses palmitoleic, but subsequent splits show various
alternatives, indicating stable subgroups that can be reached in different ways. The
specific example highlights some instability due to it being a multiclass problem,
resulting in variable separation sequences.