0% found this document useful (0 votes)
30 views55 pages

DVT Unit-Ii

This document discusses the importance of data visualization through graph representations, emphasizing how graphs can reveal relationships within high-dimensional data sets. It covers various graph layout techniques, including force-directed methods and multidimensional scaling, to optimize visual representations of complex data. Additionally, it explores the use of bipartite graphs and hierarchical trees for organizing and interpreting categorical data effectively.

Uploaded by

madhuri p
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views55 pages

DVT Unit-Ii

This document discusses the importance of data visualization through graph representations, emphasizing how graphs can reveal relationships within high-dimensional data sets. It covers various graph layout techniques, including force-directed methods and multidimensional scaling, to optimize visual representations of complex data. Additionally, it explores the use of bipartite graphs and hierarchical trees for organizing and interpreting categorical data effectively.

Uploaded by

madhuri p
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 55

UNIT-II

Data Visualization Through Their Graph Representations

4.1 Introduction

The amount of data collected by organizations is constantly growing due to better data
collection methods, computerized transactions, and advancements in storage
technology. This growth results in high-dimensional data sets with many recorded
attributes. Large-scale information banks, like data warehouses, contain
interconnected data from various sources. New technologies, such as genomic and
proteomic technologies and sensor-based monitoring systems, contribute to these data
sets. To gain valuable insights, it is essential to visualize the data's structure and
identify patterns and relationships. This text focuses on visual exploration of data
through graphs, particularly addressing issues with categorical data and introducing a
mathematical framework for optimizing graph layouts.

4.2 Data and Graphs

Graphs are important tools as they show relationships between groups of objects.
They model complex systems, like computer networks, transportation systems, and
molecular structures, and help visualize relationships in social networks and database
diagrams. In statistics and data analysis, graphs appear as dendrograms, trees, and
path diagrams. They are also interesting in mathematics, with many studies focused
on their properties. Different visualizations can uncover hidden patterns and
relationships within the data.

The literature includes various sources on how to draw graphs and the problems
related to graph representation in different dimensions. A key interest is how data sets
can be represented through graphs, linking multivariate statistics and graph drawing.
Examples include the representation of a protein interaction network and the structure
of contingency tables and correlation matrices, where nodes represent categories or
variables, and lines indicate frequencies or correlations.
Another interesting data structure that can be represented successfully by a graph is
that corresponding to a multivariate categorical data set, as the following example
attests (Table 4. 3). The data on 21 sleeping bags and their characteristics come from
Prediger (1997) and have also been discussed in Michailidis and de Leeuw (2001).
4.3 Graph Layout Techniques

The problem of graph drawing or layout is important in various scientific fields. It


involves arranging a set of connected nodes in space and determining how to draw the
lines (or curves) that connect them. Two main choices must be made: the selection of
space and the type of curves to use. Grid layouts place nodes at integer coordinates,
while hyperbolic layouts position them on a sphere. Most techniques use straight
lines, although some may utilize curved lines.

Layout algorithms often follow aesthetic rules, which help in creating visually
appealing graphs. Common rules include evenly distributing nodes and edges,
keeping edge lengths similar, and minimizing edge crossings. These rules can lead to
optimization problems, some of which are difficult to solve, like the edge-crossing
minimization, which is known to be NP-hard. Basic layouts are usually generated
quickly and then refined to meet these aesthetic criteria, particularly useful for large
graphs. Graph drawing systems such as Nicheworks, GVF, and H3Viewer use these
strategies, with tools like Cytoscape allowing for manual adjustments.

The general goal is to represent graph edges as points and vertices as lines. Two main
approaches exist for creating these drawings: the metric or embedding approach,
which focuses on the distances between vertices, and the adjacency model, which
emphasizes the relationships between adjacent vertices. This paper primarily
discusses the adjacency model and how to measure the quality of the resulting graph
layout.

-> Force-directed Techniques


The most useful graph-drawing techniques for data visualization are force-directed
techniques. These techniques use an analogy from physics, treating vertices as masses
that attract and repel each other with forces, resulting in an optimal graph layout in
equilibrium. An objective function reflecting this analogy is provided next:

The n × p matrix X includes the coordinates of n vertices in R^p, with di j(X) representing the distances
between points xi and xj. The weights ai j relate to the adjacency matrix A of graph G, while pushing
weights B = {bi j} can come from the adjacency matrix or external constraints. The functions ϕ(ċ) and
ψ(ċ) apply aesthetic considerations to the layout. A convex ϕ function increases large distances to
highlight unique data features, while a concave one reduces the impact of isolated vertices. This
framework supports both simple and weighted graphs and is used in a force-directed technique
discussed by Di Battista et al. (1998).

The constraint term in the Lagrangian relates to the push component of Q(ċ).
Examples of η(X) include η(X) = trace(X′X) or η(X) = det(X′X). Other options
involve ensuring orthonormality or fixing some Xs. This method allows integrating
the metric approach in graph drawing, aiming to minimize the difference between
graph-theoretic and Euclidean distances.

W = wi j represents a set of weights. The distances δi j are based on graph G, and


transformation η is often the identity, square, or logarithm. Although ρ(d(X)) does not
increase or reach zero, expanding the square shows it is like minimizing Q(X W) with
specified functions and parameters. All points gather, but those with large distances
separate. The text then explores the metric orembedding approach and the pulling
under constraints model, useful for visualizing data-derived graphs.

-> Multidimensional Scaling

The metric approach discussed is a version of multidimensional scaling (MDS), which


approximates a set of distances in low-dimensional Euclidean space. The aim is to
find coordinates for points in R^p so that the distances between them align closely
with the given distances. The specific distances correspond to shortest path distances
on a graph. MDS as a graph-drawing technique includes other options beyond
Euclidean space. The least-squares-loss function, known as Stress, was introduced by
Kruskal in 1964.
The main focus of the text is on loss functions used in force-directed techniques and
the methods for minimizing them. It references several studies, including McGee
(1966) and Sammon (1969), which discuss different weight choices for loss functions
related to stretching elastic springs from one length to another. The minimization can
be achieved using iterative majorization algorithms or steepest descent methods, with
the latter being implemented in the GGobi visualization system.

An example of a two-dimensional MDS solution is provided using sleeping bag data,


illustrating how the data is spread evenly to avoid edge crossings. The text then
discusses a recent application of MDS in cases where data exhibit nonlinearities. It
introduces the Isomap algorithm as a popular nonlinear embedding technique that
approximates local geometry.

The method involves creating a K-nearest neighbor graph for the data points and
calculating shortest path distances, followed by applying MDS. An example using
Swiss Roll data shows how the Isomap algorithm better captures the underlying
geometry than standard MDS, highlighting differences in how points are arranged
based on their progression on the roll.

->The Pulling Under Constraints Model

In this model, the similarity of nodes is key. A simple graph considers only
connections, while a weighted graph gives more importance to edges with large
weights. The normalization constraint keeps points separate and prevents them from
collapsing to the origin. This model has been studied with various distance functions,
including squared Euclidean distances, which is significant for data visualization.
Some algebra shows that the objective function can be expressed in matrix algebra
form.
The graph Laplacian, expressed as L = D−A, involves a diagonal matrix D made from
the row sums of the adjacency matrix A. Minimizing a certain function leads to nodes
with many connections clustering together, while those with few connections stay on
the edges of the layout. In weighted graphs, stronger weights indicate closer bonds
between nodes, enhancing clustering. A helpful normalization constraint is X′DX =
Ip, making minimization easier and linking it to a generalized eigenvalue problem.
The solution is related to weighted Euclidean space. For example, in a protein
interactions network, proteins with few interactions are on the outskirts, whereas
central 'hub' proteins are located in the middle. Another example from the UCI
machine learning repository features handwritten numerals, where traditional methods
struggle with class separation. Additionally, large graphs from testing algorithms were
examined, illustrating structures with varying densities and holes through weighted
graphs based on nearest neighbors.

-> Bipartite Graphs

The graph representation of a contingency table and categorical data has unique
features, where the node set can be divided into two subsets. For a contingency table,
one subset includes the categories of one variable, and the other subset includes the
categories of the second variable. Connections only exist between these two subsets.
This also applies to categorical data, where one subset relates to objects, like sleeping
bags, and the other to variable categories. This leads to bipartite graphs, and a
modification of the Q(ċ) objective function creates interesting graph layouts for these
data sets. The objective function for squared Euclidean distances can be expressed
using the coordinates of both subsets.

DY is a diagonal matrix with column sums of A, and DZ is a diagonal matrix with


row sums of A. For a contingency table, DY and DZ show the marginal frequencies,
while for multivariate categorical data, DY shows univariate marginals, and DZ is a
multiple of the identity matrix. A modified normalization constraint leads to a
solution obtained through a block relaxation algorithm.

The optimal solution follows the centroid principle, which states that category points
are at the center of gravity of their objects. This graph-drawing solution is known in
multivariate analysis as correspondence analysis for contingency tables and multiple
correspondence analysis for categorical data. The graph layout of the sleeping bags
data set shows patterns: high-quality, expensive sleeping bags filled with down and
cheaper, low-quality ones filled with synthetic fibers, along with some intermediate
options. The centroid principle helps interpret the layout, which is less uniform than
one produced by MDS, capturing data features better. The distance function and
normalization significantly impact the graph's visual quality.

Graph-theoretic Graphics
5.1 Introduction
This chapter will cover the uses of graphs for making graphs. The mixing of terms is
an unfortunate historical issue that combined graph-of-a-function with graph-of-
vertices-and-edges. Vertex-edge graphs are crucial for developing algorithms and are
also key for statistical graphics and visualizations.

The chapter addresses laying out graphs on a plane, using points for vertices and line
segments for edges. It adopts the grammar of graphics view, assuming that a graph
maps geometric forms to vertices and edges. Definitions of graph-theoretic terms are
provided and can be referred to later as a glossary.

5.2 Definitions

A graph is a set V together with a relation on V. We usually express this by


saying that a graph G = (V, E) is a pair of sets, V is a set of vertices (sometimes called
nodes), and E is a set of edges (sometimes called arcs or links). An edge e(u, v), with
e E and u, v V, is a pair of vertices.
We usually assume the relation on V induced by E is symmetric; we call such
a graph undirected. If the pair of vertices in an edge is ordered, we call G a directed
graph, or digraph. We denote direction by saying, with respect to a node, that an edge
is incoming or outgoing.
A graph is weighted if each of its edges is associated with a real number. We
consider an unweighted graph to be equivalent to a weighted graph whose edges all
have a weight of 1.
A graph is complete if there exists an edge for every pair of vertices. If it has n
vertices, then a complete graph has n(n − )
 edges.
A loop is an edge with u = v. A simple graph is a graph with no loops. Two
edges (u, v) and (s, t) are adjacent if u = s or u = t or v = s or v = t. Li ewise, a vertex v
is adjacent to an edge (u, v) or an edge (v, u).
A path is a list of successively adjacent, distinct edges. Let e ,...,ek be a se
quence of edges in a graph. his sequence is called a path if there are vertices v ,...,
vk such that ei = (vi− , vi) for i = , . . . , k.
Two vertices u, v of a graph are called connected if there exists a path from vertex
u to vertex v. If every pair of vertices of the graph is connected, the graph is called
connected.
A path is cyclicif a node appears more than once in its corresponding list of
edges.
A graph is cyclic if any path in the graph is cyclic. We oten call a directed
acyclic graph a DAG.
A topological sort of the vertices of a DAG is a sequence of distinct vertices
(v1 ,..., vn) . For every pair of vertices vi , vj in this sequence, if (vi , vj) is an edge,
then i < j.
A linear graph is a graph based on a list of n vertices; its n− edges connect
vertices that are adjacent in the list. A linear graph has only one path.
Two graphs G = (V1, E1) and G= (V2 , E2) are isomorphic if there exists a
bijective mapping between the vertices in V1 and V2 and there is an edge between
two vertices of one graph if and only if there is an edge between the two
corresponding vertices in the other graph.
A graph G1 = (V1 , E1) is a subgraph of a graph G2 = (V2 , E2)
-> Trees

Ultrametric Trees
5.3 Graph Drawing

A graph is considered embeddable on a surface if it can be drawn there without edges


crossing except at the vertices. A graph is planar if it can be drawn on a sphere.
Euler's theorem can show that a graph is not planar, but to prove a graph is planar, it
must be drawn without edge crossings. Drawing graphs is important, especially for
representing electrical circuits in the semiconductor industry and for modeling
metabolic pathways and transportation networks. The graph-drawing problem asks
how to create a layout for a planar graph or minimize edge crossings for non-planar
graphs. Different graphs need different algorithms for layouts, starting with trees, then
networks and directed cyclic graphs. Input data usually includes lists of vertices and
edges.

-> Hierarchical Trees

Suppose we have a recursive list of single parents and their children. Each child has
one parent, and each parent has one or more children. One node, the root, has no
parent.

A common example of a list is the directory structure of a hierarchical file system,


displayed using a tree browser. Creating this display is simple by starting at the root
and indenting children relative to their parents. The primitive vertical layout is
effective compared to more complex designs.

To layout a tree from a list of edges, we need to identify parent-child relationships,


locate leaves, and assign layer values based on paths to leaves. Then, we group
children by parent and align them above the middle child. If nodes are ordered by an
external variable, they can be placed on a scale instead of using parentage for
ordering,
Data includes FBI-reported murder rates for US states in 1970, analyzed through a
single lineage cluster analysis that arranged leaves by murder rates, creating a tree
structure. This example is notable because clustering typically involves multiple
variables, but here it applies to a single variable. Hierarchical clustering reveals dense
observation intervals, showing clusters of southern and mid western states with
similar murder rates. The mode tree is another way to represent one-dimensional data.
Moreover, sorting the tree's leaves by murder values results in a topological sort.
Larger hierarchical trees may be complicated in rectangular layouts, so polar
coordinates or circular layouts are preferred for clarity.

Classification and regression trees organize a set of objects hierarchically. Wilkinson


(1999) created a tree display called a mobile, which shows data about bank employees
with dot histograms at each node. This model emphasizes the importance of splits but
is not efficient in space and doesn't work well in a polar arrangement.

A directed geometric tree has one root and many children, representing flow from the
root to the leaves. Examples include water and migration flows. Phan et al. (2005)
provide algorithms for rendering a flow tree using geographic data. An example with
Colorado migration data from 1995 to 2000 shows edge merging for smooth, distinct
flows.

->Spanning Trees

We can layout a spanning tree effectively by using Euclidean distance to approximate


graph-theoretic distance. This method keeps adjacent vertices close and separates
distant ones. The springs algorithm, a version of multidimensional scaling, uses
springs to represent edges and a loss function for total energy, reducing it through
steepest descent iterations.

Laying out a Simple Tree

Figure 5. 9 (Wilkinson 2005) illustrates a small website's data, where each page is a
node and the links between them are branches. The thickness of these branches shows
traffic between the pages, with the root node near the center due to a force-directed
algorithm that attracts adjacent nodes and repels non-adjacent ones. depicts a plant
model where branches should be short for better water distribution, and leaves should
be spread out for sunlight exposure, resembling the website layout in Figure 5. 9.

Laying out Large Trees


Laying out large spanning trees presents challenges, as they can fill the display area
and the springs algorithm is costly. Graham Wills developed a hexagon layout to
address these issues.

Additive Trees
Additive trees need complex calculations. We have a distance matrix for n objects and
must create a spanning tree that closely matches the original distances. The article
mentions that edge angles are not important; the edges are arranged for easy path
tracing.

-> Networks
Networks are generally cyclic graphs. Force-directed layout methods often work well
on networks. The springs algorithm doesn't require a graph to be a tree. Subjects
were asked to produce a list of animal names. Names found adjacent in subjects’
lists were considered adjacent in a graph.

->Directed Graphs
Directed graphs are arranged with source nodes on top and sink nodes at the bottom.
To lay out a directed graph, a topological sort is needed. Cyclical edges are
temporarily inverted to create a directed acyclic graph (DAG) for identifying paths to
sink nodes. A topological sort produces a linear order of the DAG, with vertex u
above vertex v for each edge. Reducing edge crossings is difficult and involves
maximizing Kendall’s τ correlation between layers. Heuristic methods include direct
search, simulated annealing, or constrained optimization. Figure 5. 14 shows the
evolution of the UNIX operating system as computed by a graph layout program from
AT&amp;T.

->Treemaps
Treemaps are a way to divide a space into smaller parts. The easiest example is a
nested rectangular layout. To create a rectangular treemap from a binary tree, we start
at the tree's root and split a rectangle vertically. Each part represents one of the root's
children. We then split those parts horizontally and keep doing this until all tree nodes
are represented. We can also color the rectangles based on weights or resize them
according to those weights. An example shows this using color and size to visualize
news sources.

5.4 Geometric Graphs

Geometric graphs are important for data mining and analysis due to their ability to
describe sets of points in a space. We will use some of these graphs for visual
analytics in the next section. This section includes examples using data from the Box–
Jenkins airline dataset.

Many geometric graphs have been created to show the "shape" of a set of points X on
a plane. Most of these are proximity graphs, which have edges based on an indicator
function determined by distances between points in a metric space. To define this
function, an open disk D is used. D touches a point if the point is on its boundary and
contains a point if it is inside D. The smallest disk touching two points is D₂, with a
radius of half the distance between them, and its center is halfway between the two
points. An open disk of fixed radius is called D(r), and one of fixed radius centered on
a point is D(p,r).

->Disk Exclusion
Several proximity graphs exist when pairs of points have empty disks.

Delaunay Triangulation
In a Delaunay graph, an edge connects any two points that can be touched by an open
disk containing no other points. The Delaunay triangulation and its dual, the Voronoi
tessellation, are useful for describing point distributions. Although they can be
generalized to higher dimensions, they are mostly used in two dimensions. There are
several proximity graphs that are subsets of the Delaunay triangulation.

Convex Hull

A polygon is a closed shape with n vertices and n − 1 faces. Its boundary can be
shown as a geometric graph with vertices as polygon points and edges as its faces.
The hull of a set of points X in 2D space is a group of one or more polygons that
include some points from X as vertices and contain all points in X. A polygon is
convex if it includes all straight lines between any two points inside it. The convex
hull of X is the smallest convex shape that contains X. There are various algorithms to
find the convex hull, which is related to the outer edges of the Delaunay triangulation,
allowing computation in O(n log n) time.

Nonconvex Hull

A nonconvex hull is a type of hull that isn't convex. It includes simple shapes like a
star convex or monotone convex hull, as well as complex shapes, space-filling
objects, and those with separate parts. We consider the hull of these shapes as the
outer edges of their complexes. In an alpha-shape graph, an edge connects two points
if an open dis D(α) containing no points can touch them.

Complexes
There are several important types of Delaunay triangulation used for understanding
point density, shape, and more. In a Gabriel graph, an edge connects points without
any points in their D2 area. A relative neighborhood graph connects points if their
lune region has no points. A beta skeleton graph is a mix of both, with size
determined by a parameter β. A minimum spanning tree is a part of a Gabriel graph.

->Disk Inclusion
Several proximity graphs are defined by distance. Edges in these graphs exist when
certain distances contain pairs of points and are not usually subsets of the Delaunay
triangulation. In a k-nearest-neighbor graph (KNN), a directed edge is present
between points p and q if their distance is among the k smallest distances.
Applications often simplify KNN by removing self-loops and edge weights. If k = 1, it
is a subset of the MST. A distance graph connects points within a defined radius,
while a sphere-of-influence graph connects points based on nearest-neighbor
distances.
5.5 Graph-theoretic Analytics
Some graph-analysis methods are suitable for visualization.

-> Scagnostics

A scatterplot matrix, also known as SPLOM, is a square arrangement of scatterplots


that show the relationships between pairs of variables. Each off-diagonal cell in the
matrix represents a scatterplot instead of a single number. This concept was
introduced by John Hartigan in 1975 and became well-known through the work of Tu
and colleagues at Bell Laboratories.

When there are many variables, scatterplot matrices can become difficult to use. The
clarity of the display diminishes, and finding patterns becomes impractical for more
than 25 variables due to the large number of scatterplots. To address this issue, Tu and
his team developed a method that reduces the complexity of visual analysis by using
fewer measures that describe the distributions of the 2D scatterplots. These measures
include the area and perimeter of convex hulls, kernel density contours, and other
statistics.

After calculating these measures, Tu created a new scatterplot matrix using them. This
new matrix helped identify unusual patterns in the original data. Wilkinson and
colleagues later improved this method by using proximity graphs, increasing the
efficiency and allowing it to handle different types of variables. They defined nine
scagnostics measures to further analyze the data. An example showed how the
Outlying measure flagged significant cases and described their characteristics in the
scatterplots.

-> Sequence Analysis

A sequence is a list of objects where the order is defined by a relation. In sequence


analysis, objects are often shown as tokens, and sequences as strings of tokens. For
example, in biosequencing, the letters A, C, T, and G represent the four bases in
DNA. To find the most frequently occurring substrings of a given length m in a string
of length n, a basic algorithm generates candidate substrings and checks them against
the target string. Starting with substrings of length one, the algorithm builds longer
subsequences and counts their frequency. This process continues until substrings of
length m are tested or all counts are zero. Visualizing this analysis with a graph can
simplify understanding these subsequences.

Comparing Sequences
Suppose we have two sequences of characters or objects and we wish to compare
them. If the sequences are of length n, we can create an n by n table of zeroes, placing
a 1 in a diagonal cell if the values match at that position. An identity matrix indicates
identical sequences, plotted as a square array of pixels. With real data, matching runs
of subsequences are often found off the diagonal. Figure 5.29 shows how
subsequences appear as diagonal runs, with longer bars indicating longer matching
subsequences.

Critical Paths
Suppose we have a directed acyclic graph (DAG) where the vertices represent tasks
and an edge (u, v) means task u must be completed before task v. How do we
schedule tasks to minimize overall completion time? This job-scheduling problem has
many variants. One common variant weights the edges by the time needed to
complete tasks. We will discuss two aspects involving graphing: first, how to layout a
graph for the project by flipping it to a horizontal orientation, resulting in a CPM
(critical path method) graph. Second, how to identify and color the critical path,
which is easy without weighted edges through a breadth-first search. Finding the
shortest path in a weighted graph requires dynamic programming. Large project graph
layouts can be messy, so an alternative is a Gantt chart, which shows time on the
horizontal axis and task duration as bar lengths. Modern Gantt charts combine these
tasks with critical path information. Most computer project management packages
compute the critical path using graph algorithms and display them in Gantt charts.

->Graph Matching
Given two graphs, we can determine if they are isomorphic and identify isomorphic
subgraphs or calculate an overall measure of concordance if they are not. This topic is
important in biology, chemistry, image processing, computer vision, and search
engines. Graph matching helps in searching databases for specific graphs and
provides a way to index large databases of different materials, focusing on matching
2D geometric graphs in this chapter.

Exact Graph Matching


Exact graph matching involves finding isomorphisms between two graphs, requiring a
corresponding vertex and edge in both graphs. If both graphs are connected, matching
the edges is enough to prove isomorphism. This problem has polynomial complexity.
However, finding isomorphisms with vertex relabeling has unknown complexity, but
for planar graphs, it is linear time, as shown by Hopcroft and Wong. Siena and Shasha
et al. further explore this and review graph matching software.

Approximate Graph Matching


Approximate graph matching aims to find the best agreement between two graphs
through relabeling. Numerous indices have been developed for this purpose. Earlier
methods focused on simple graph-theoretic measures, leading to distance or
correlation coefficients, like the cophenetic correlation, which compares distances in a
hierarchical clustering tree. This method creates a matching index based on
ultrametric distances from different trees.

Recent techniques explore varied measures for concordance. Notably, the Google
search engine uses a graph spectral measure for similarity assessment. In shape
recognition, proximity graphs built from polygons have been applied. Klein et al.
(2001) introduced a method to match medial axis graphs using edit distance, which
counts the operations needed to convert one graph into another, allowing the accurate
characterization of 2D shapes from 3D shapes. Torsello (2004) expanded on these
ideas. Proximity graphs can address shape-recognition challenges using edit distance
or similarity measurements, as demonstrated by Gandhi (2002) with leaf shape
analysis through turning angles and dynamic time warping.
High-dimensional Data Visualization
6.1 Introduction
“Have is a hammer, every problem looks like a nail. ” This saying applies to the use
of graphics as well. An expert in grand tours will likely include a categorical variable
in high-dimensional scatterplots, while a mosaic plot expert will fit a data problem
into a categorical framework. This chapter focuses on using different plots for high-
dimensional data analysis, highlighting their strengths and weaknesses.

Data visualization serves two main purposes:

1. Exploration: During this phase, analysts use various graphics, often unsuitable for
presentations, to uncover important features. The need for interaction is high, and
plots should be created quickly, allowing for instant modifications.

2. Presentation: After exploring key findings, these must be presented to a wider


audience. Presentation graphics are usually not interactive and must be suitable for
print. Some high-dimensional graphics are complex and may not be easily understood
by those without statistical training.
Interactivity is a key factor distinguishing exploratory from presentation graphics.
Interactive linked highlighting helps convey multivariate contexts effectively. The
chapter emphasizes the importance of reproduction quality, noting that printed
graphics may lack clarity in black and white, with better versions available online.

6.2 Mosaic Plots

Mosaic plots require significant training for data analysts but are highly versatile
when fully utilized. This section will discuss their typical uses and trellis displays.

-> Associations in High-dimensional Data

Meyer et al. introduced techniques for visualizing association structures of categorical


variables using mosaic plots. In high-dimensional problems, interactions are often
more complex. While statistical theory assumes equal importance of variables, real
problems may differ. Proper variable order in mosaic plots can reflect their different
roles.

Example: Detergent data


For an illustration of mosaicplots and their applications, we chose to loo at the -D
problem of the detergent data set (cf. Cox and Snell, ). In this data set we loo
at
the following four variables:
.
Water sotness
(soft, medium, hard)
.
Temperature
(low, high)
.
M-user (person used brand M before study)
(yes, no)
. Preference (brand person prefers ater test)
(X, M)

The study aims to determine if a person's brand choice affects their detergent
preference. It shows that the interaction between user and preference is significant,
while two other variables, Water Softness and Temperature, indicate that harder water
requires warmer temperatures for effective washing with a fixed detergent amount.

Mosaic plots facilitate the examination of how user and preference interact, especially
for different combinations of Water Softness and Temperature. Recommendations for
creating effective high-dimensional mosaic plots include placing the key interaction in
the last two positions and the conditioning variables first. To avoid clutter, variables
with fewer categories should be listed first. If there are empty cell combinations, it's
advised to position the variables causing emptiness higher in the plot. Highlighting
the last binary variable can reduce cell numbers, and interactive mosaic plots allow
for changing displayed variables to reveal potential interactions more clearly.

-> Response Models


In many data sets, there is one main outcome that is categorical and several
influencing factors that are also categorical. The best way to show this is with a
mosaic plot for the influencing factors and a bar chart for the outcome. This example
is illustrated using Caesarean data, which includes three factors: Antibiotics, Risk
Factor, and Planned, with the dependent outcome being Infection for 251 cases. The
goal is to identify which factors may lead to a higher infection rate. A specific case
where no caesarean was planned, a risk factor was present, and no antibiotics were
given shows that 23 out of 26 cases resulted in infection, which is nearly 88. 5%.

The infection risk is highest for cases with risk factors when antibiotics are not given.
Planned caesareans lower the infection risk by about half, and there were no
infections in unplanned caesareans without risk factors and antibiotics, though at least
three were expected. Figures help explore these results better than classic models, but
they should still check significance.

-> Models

Meyer et al. (2008, Chapter III. 12) describes a way to show association models using
mosaicplots. Instead of just looking at observed values in log-linear models, expected
values can also be plotted, allowing visibility for empty cells in the modeled data.
Mosaicplots can visualize any continuous variable for categorical data crossings. The
text mentions interactions such as Water Softness and Temperature, and the M-user
and Preference. It also notes that certain models are hard to interpret for
nonstatisticians and discusses log-linear models in more detail in Heus and Lauer
(1999).
6.3 Trellis Displays
Trellis displays plot high-dimensional data using a grid-like structure based on certain
subgroups.

-> Definition
Trellis displays were created by Bec er et al. in 1996 to visualize multivariate data.
They use a lattice-like setup to organize plots into panels, where each plot depends on
at least one other variable. To make comparisons easier across rows and columns, all
panel plots use the same scales.

A basic example of a trellis display is a boxplot that shows the gas mileage of cars
based on car type. This setup allows for easy comparison among different car types
since the scale remains consistent. Additional variables, especially binary ones, can be
included using highlighting. A trellis display can feature up to seven variables at once,
with five being categorical and two continuous. The panel plot at the center can
include various statistical graphics, typically limited to scatterplots. Up to three
categorical variables can dictate the rows, columns, and pages of the display, and each
panel is labeled to show its corresponding category.

Trellis displays also use shingling, which divides a continuous variable into
overlapping intervals, turning it into a discrete variable. This method differs from
using categorical variables. Although it offers benefits, reading the information from
the strip labels can be difficult. For example, one trellis display may show a boxplot
based on car type and gas mileage, while another could be a scatterplot comparing
MPG and weight, also showing car type and drive as conditioning variables.
However, a common issue in trellis displays is having empty panels or those with
very few observations.

-> Trellis Display vs. Mosaic Plots


Trellis displays and mosaicplots are quite different, as shown when comparing Figs.
6. 1 and 6. 6. The panel plot is not a 2-D mosaicplot, making it hard to compare.
Current trellis displays in R do not allow for mosaicplots as panel plots, and the
interaction structure in Fig. 6. 6 is harder to see than in a mosaicplot. A mosaicplot
shows independence with straight gaps between categories, while independence in a
trellis display means that the pairs of levels of two variables show identical barcharts,
differing only by a scaling factor. Comparing ratios becomes challenging with more
variables or large differences in cell counts. Variations of mosaicplots exist, like same
bin size and multiple barcharts, which use an equal-sized grid. Flexible mosaicplot
implementations can be found in Mondrian and MANET.

-> Trellis Displays and Interactivity

The conditional framework in a trellis display acts like still images of interactive
statistical graphics. Each panel in a trellis display represents a specific part of the data
for a subgroup. An example can be seen with the cars data set. Interactions involve
selecting a subgroup in a barchart or mosaic plot and "brushing," which is moving an
indicator along one or two axes of a plot. Brushing helps select an interval of a
variable and can divide a continuous variable into multiple intervals. This technique
shows flexibility compared to the static view of a trellis display, which is easier to
print.

-> Visualization of Models


The main advantage of trellis displays is that they allow for consistent comparison
among all plot panels. This makes it easier to analyze how well a model fits different
data conditions. Trellis displays are especially useful for model diagnostics because
they help identify when a model works well and when it does not. Each panel can
show fitted curves or confidence intervals for specific subgroups. However, a
challenge is that it can be difficult to determine the number of cases represented in
each panel. Having confidence bands for smoother plots would help assess variability
across the panels.

Wrap-up
Trellis displays are best for continuous axis variables, categorical conditioning
variables, and categorical adjunct variables. While shingling can be used sometimes,
it is usually better to avoid it for clarity. Trellis displays are easy to learn and support
static reproduction, but interactive graphics offer more flexibility for exploratory data
analysis. They allow for linking to other plots but lack a global overview.

6.4 Parallel Coordinate Plots

Parallel coordinate plots, introduced by Inselberg (1985), allow analysis of many


variables simultaneously by using parallel coordinate axes, as explained by Inselberg
(2008) and Wegman (1990).

-> Geometrical Aspects vs. Data Analysis Aspects

This section will explore the main use of parallel coordinate plots in data analysis
applications. The key aspects include investigating groups/clusters, outliers, and
structures across many variables simultaneously. Three main uses in exploratory data
analysis can be identified.

- Overview
No other statistical graphic can plot so much information (cases and variables) at a
time. Thus, parallel coordinate plots are an ideal tool to get a first overview of a data
set. Figure 6. 11 shows a parallel coordinate plot of almost 400 cars with 10 variables.
All axes have been scaled to min-max. Several features, like a few very expensive
cars, three very fuel-efficient cars, and the negative correlation between car size and
gas mileage, are immediately apparent.

- Profiles
Parallel coordinate plots can highlight the profile of a single case, not just for one case
but also for entire groups to compare with other data. They are especially useful when
the axes are ordered, like time. Figure 6. 12 shows the highlighted profile of the most
fuel-efficient car.
- Monitor

When working on subsets of a data set, parallel coordinate plots can connect features
of a specific subset to the whole data set. For example, they can help identify major
axes in multidimensional scaling (MDS). The leftmost cases in MDS are hybrid cars
with high gas mileage, while the top right are heavy cars like pickups and SUVs.
Similar findings could also be achieved with biplots.

-> Limits
Parallel coordinates are often seen as overvalued for understanding multiple features
in a data set. Scatterplots are better for examining 2-D features, but scatterplot
matrices (SPLOMs) require much more space to display the same information as
parallel coordinate plots (PCPs). While PCPs do not typically help in identifying
multivariate outliers, they are very beneficial for interpreting results from multivariate
procedures like outlier detection, clustering, or classification.

PCPs can manage many variables, but they struggle when plotting more than a few
hundred lines due to overplotting. This issue is more significant in PCPs because they
use only one dimension for plotting, unlike scatterplots which utilize points. One way
to address overplotting is through the use of α-blending, which adjusts the opacity of
plotted lines to improve visibility in densely populated areas.

Figures demonstrate how α-blending enhances readability. The “Pollen” data set, for
instance, highlights a hidden word when using a lower alpha value. In another
example featuring real data on olive oil fatty acids, varying α-values can reveal the
group structure of regions. It's important to experiment with different α-blending
settings for optimal results, as its effectiveness varies based on the rendering system
and plot size.

-> Sorting and Scaling Issues


Parallel coordinate plots are useful for ordered variables like time or those with a
common scale. Scaling and sorting are crucial for effective data exploration.

Sorting

Sorting in parallel coordinate plots is important for understanding the data, as patterns
are often found among neighboring variables. In a plot of k variables, only k - 1
adjacencies can be analyzed without changing the order. The default order usually
reflects the sequence in the data file, which can be arbitrary. One needs 2^(k-1)
different arrangements to view all adjacencies. When variables share the same scale,
sorting them by criteria such as statistics or multivariate results can help clarify the
plot. For larger data sets, sorting options should be available both manually and
automatically.

Scalings
Besides the standard way of plotting values across each axis from the minimum to
maximum of the variable, there are other scaling methods that can be helpful. The key
option is whether to scale the axes individually or to use a common scale for all.
Other scaling methods determine how the values are aligned, such as at the mean,
median, a specific case, or a specific value. For individual scales, using a 3σ scaling is
often effective for fitting the data into the plot area.

Parallel coordinate plot of individual stage times for 155 cyclists from the 2005 Tour
France. The upper plot displays individual scales, while the middle plot shows a
common scale, allowing for better comparability of times, though the spread is less
visible during certain stages. The lower plot aligns the axes at their median, making it
easier to see overall performance, particularly the time of the peloton.

For a comprehensive view of the race, examining cumulative times is more beneficial.
Cumulative times with an individual scale, highlighting the varying resolutions of
data. A common scale is suggested, but this must be aligned at the median to illustrate
how different stages influenced overall performance. The impact of mountain stages
is evident, with the plot showing how cyclists from the "Discovery Channel" team
varied throughout the race.

Further compares these developments alongside two profile plots showing the
cumulative category of stage difficulty and the average speed of the stage winner,
revealing their negative correlation.

-> Wrap-up

Parallel coordinate plots need features like α-blending and scaling to be useful.
Examples in this chapter show how these additions provide insights into high-
dimensional data. Highlighting subgroups helps understand group structures and
outliers.

6.5 Projection Pursuit and the Grand Tour

The grand tour is an interactive technique for visualizing high-dimensional data


through continuous projections. It creates a series of d-dimensional projections from
p-dimensional data, where time typically represents the parameter. For a 3-D rotating
plot, p equals 3, and d equals 2. Unlike classical rotations, the grand tour uses
randomly selected projections, resulting in a smooth pseudorotation that reveals data
structures like groups, gaps, and dependencies. While scatterplots commonly
represent projections, other types like histograms can also be used. Projection pursuit
enhances this process by selecting new projection planes based on optimizing indices
that highlight specific features in the data.

-> Grand Tour vs. Parallel Coordinate Plots

The grand tour is an advanced tool for exploring data, allowing for nonorthogonal
projections that can reveal details missed by traditional methods. An example using
the cars data set shows different projections colored by the number of cylinders.
Unlike parallel coordinate plots, which display univariate distributions and some
bivariate relationships, the grand tour focuses only on multivariate features, often
showing minimal results unless significant patterns are present. While it can illustrate
the relationships of variables beyond three dimensions, examples of structures in over
five dimensions are uncommon. There are few flexible implementations of the grand
tour, making it challenging to apply these methods effectively.

Multivariate Data Glyphs: Principles and Practice

7.1 Introduction

In data visualization, a glyph is a visual way to represent data, where graphical traits
are based on data attributes. For instance, a box's size can reflect a student's exam
scores, while its color can show the student's gender. This broad definition includes
various visual elements like scatterplot markers, histogram bars, and line plots.
Glyphs help visualize multivariate data effectively, making it easier to see patterns
involving multiple dimensions compared to other techniques. They allow analysts to
detect and classify complex relationships between data records.

However, glyphs have limitations. They may not convey data accurately due to size
constraints and human visual perception limits. Also, visualizing too many data
records can cause overlaps or force glyphs to shrink, making patterns hard to see.
Therefore, glyphs work best for qualitative analysis of smaller data sets. This paper
discusses glyph generation, issues affecting glyph effectiveness, and offers ideas for
future research in visualization.

7.2 Data

Glyphs are often used to display multivariate data sets, which consist of items defined
by a vector of values. This data can be seen as a matrix where rows are records and
columns are variables or dimensions. For this paper, we will consider data items as
vectors of numeric values, although categorical and non-numeric values can also be
shown using glyphs after conversion to numeric format. A data set is made up of one
or more records, allowing for normalization through calculated minimum and
maximum values. Dimensions can be independent or dependent, suggesting the need
for grouping or consistent mapping based on data type.

7.3 Mappings

Many authors have created lists of graphical attributes for mapping data values,
including position, size, shape, orientation, material, line style, and dynamics. These
attributes allow for various mappings for data glyphs, classified as one-to-one
mappings, one-to-many mappings, and many-to-one mappings.

One-to-one mappings pair each data attribute with a different graphical attribute,
leveraging the user's knowledge for intuitive understanding. One-to-many mappings
use redundant mappings to improve interpretation, like mapping population to both
size and color for clearer analysis. Many-to-one mappings help compare values across
different dimensions for the same record. This paper mainly discusses one-to-one and
many-to-one mappings, though the principles also apply to other types.

7.4 Examples of Existing Glyphs

The following list (from Ward, 2002) includes some glyphs found in literature or
common use. Some are specific to applications like fluid flow visualization, while
others are general purpose. Later, we analyze these mappings to identify their
strengths and weaknesses.
The list above shows that many possible mappings exist, many of which are not yet
suggested or assessed. The question is which mapping will best fit the task's purpose,
data characteristics, and the user's knowledge and perception skills. These issues are
explained in the sections below.

7.5 Biases in Glyph Mappings

One common criticism of data glyphs is the implicit bias in mappings, where some
attributes are easier to perceive than others. For instance, in profile glyphs, adjacent
dimensions are easier to measure than separated ones, and in Chernoff faces, certain
attributes are perceived more accurately than others. This section categorizes these
biases, using previous studies and our own research, highlighting the need for more
work to measure and correct these biases in glyph design and data analysis.

Perception-based bias
Certain graphical attributes are easier to see and compare than others. Experiments
show that length along a common axis is measured more accurately than angle,
orientation, size, or color. Different mappings of the same data illustrate this, with
profile glyphs being the easiest and pie glyphs being the hardest to interpret.

Proximity-based bias
In most glyphs, it's easier to see and remember relationships between data dimensions
that are next to each other than those that are not. No experiments have quantified this
bias, but Chernoff and Rizvi (1975) reported a 25% variance due to data
rearrangement. The bias likely varies with the glyph type.

Grouping-based bias
Graphical attributes that are not next to each other but can be grouped may introduce
bias. For instance, mapping two variables to ear size can reveal relationships clearer
than mapping one to eye shape and one to ear size.

7.6 Ordering of Data Dimensions/Variables

Each dimension of a data set corresponds to a specific graphical feature. Changing the
order of these dimensions while keeping the mapping type can create different data
views. There are N! possible orderings, leading to unique views. It is crucial to
identify which orderings best support the task. This section will discuss several
dimension-ordering strategies that can help create informative views compared to
random ordering.

-> Correlation-driven

Many researchers suggest using correlation and similarity measures to better organize
dimensions for visualization. Bertin’s reorderable matrix demonstrated how
rearranging rows and columns in a table can reveal groups of related records. Erst et
al. used cross-correlation and a heuristic search to rearrange dimensions for clarity.
Friendly and Kwan proposed effect ordering, where the order of graphical objects is
based on observable trends. Borg and Staufenbiel compared traditional glyphs with
factorial suns, showing improved data interpretation for users.
-> Symmetry-driven

Gestalt principles show that people prefer simple shapes and are better at recognizing
symmetry. Peng et al. (2004) studied star glyphs' shapes based on two qualities:
monotonicity and symmetry. They found an ordering that produced more simple and
symmetric shapes, which users preferred. The idea is that simpler shapes are easier to
recognize and help in spotting small variations and outliers, but more formal
evaluations are needed to confirm this. See Fig. 7. 3 for an example.

-> Data-driven
Another option is to base the order of the dimensions on the values of a single record
(base), using an ascending or descending sorting of the values to specify the global
dimension order. This allows users to see similarities and differences between the
base record and all other records. It is especially good for time-series data sets to
show the evolution of dimensions and their relationships over time. For example,
sorting the exchange rates of ten countries with the USA by their relative values in the
first year of the time series exposes a number of interesting trends, anomalies, and
periods of relative stability and instability. In fact, the original order is nearly reversed
at a point later in the time series.

-> User-driven

As a final strategy, we can let users use their knowledge of the data set to order and
group dimensions in various ways, such as by derivative relations, semantic
similarity, and importance. Derivative relations show that some dimensions may come
from combinations of others. Semantic similarities relate to dimensions with similar
meanings, even if their values don't correlate well. Lastly, some dimensions may be
more important for a specific task, so highlighting these can improve task
performance.

7.7 Glyph Layout Options

The position of glyphs can show various data attributes like values, order,
relationships, and derived aspects. This section describes a taxonomy of glyph layout
strategies based on several factors: whether placement is data-driven or structure-
driven, whether glyph overlaps are allowed, the balance between efficient screen use
and white space, and the possibility of adjusting glyph positions for better visibility.
Understanding the trade-offs between accuracy and clarity is crucial for interpreting
glyphs effectively.

-> Data-driven Placement

Data-driven glyph placement positions a glyph based on data values linked to a


record. There are two types: one uses original data values directly, like placing
markers in a scatterplot, and the other derives positions through computations, such as
using PCA for coordinates. Some researchers have also used advanced methods for
placing glyphs in fluid dynamics flow fields.

The direct method in simulations has clear meanings for positions, helping to
highlight or replace data dimensions. Derived methods can enhance the display's
information and reveal hidden relationships in data. However, data-driven methods
often cause glyph overlap, leading to misunderstandings and unnoticed patterns.
Various techniques exist to resolve this issue by distorting position information.
Random jitter is frequently used for data with limited values. Other approaches, like
spring methods, aim to reduce overlaps and displacement. Woodruff et al. introduced
a relocation algorithm for consistent display density. Users should control distortion
levels through maximum displacement settings or animations showing glyph
movement.

-> Structure-driven Placement


Structure-driven glyph placement relies on certain structural traits in the data to
determine positioning. One common structure is the ordering relationship found in
time-series or spatial data, where the order is used to generate positions. Another
structure is cyclic relationships, where each glyph connects to both adjacent glyphs
and those in previous or subsequent cycles. Hierarchical or tree-based structures can
also influence positioning. These structures can derive from fixed data attributes like
computer file systems or be created through hierarchical clustering algorithms.
Different techniques exist for calculating positions based on these hierarchical
structures.

Placement strategies based on structure can vary in overlap. A grid layout can avoid
overlaps in ordered data, while tree and graph layouts in dense datasets might lead to
significant overlap. To address overlap, distortion methods help maintain structure
visibility even with movement of glyphs. Nonlinear distortion techniques allow users
to focus on specific data areas without occlusion, enhancing the separation of data
groups as well. This shows a blend of structure and data-driven approaches.

Linked Views for Visual Exploration


8.1 Visual Exploration by Linked Views
The main issue in data visualization is the limitation of presenting information in two
dimensions, like on paper or computer screens. There are four main ways to tackle
this problem:

1. Create a virtual or pseudo-3D environment to display higher-dimensional data in a


3D space.
2. Use data-reduction methods like principal component analysis to represent high-
dimensional data on a 2D system.
3. Implement nonorthogonal coordinate systems, such as parallel coordinates, which
are less confined by two dimensions.
4. Use lined low-dimensional displays.

Lined views have been proposed to overcome flat 2D limitations. Identical plot
symbols and colors keep track of similar cases in static displays. This idea was first
used in 1982 to link observations across scatterplots. Currently, "scatterplot brushing”
is a well-known method to connect data in scatterplots.

Lined views offer benefits like easy graphical displays and quick ways to explore
different data aspects, which are crucial in early data analysis. For instance,
combining a barchart with a histogram allows for comparisons across groups without
altering the original data. The dataset discussed comes from an international survey
assessing the math and science performance of 13- and 14-year-old students in
Germany, encompassing various continuous and categorical variables. Lined views
also work well with complex data, especially in geographic contexts.

Anselin (1999), Wills (1992), and Roberts (2004) discuss the importance of lined
displays in exploring spatial data. These displays help in the statistical exploration of
datasets by allowing users to investigate distributional characteristics, identify
unusual behaviors, and detect patterns and relationships. Lined views are
particularly beneficial for categorical data and offer easy conditional views. For
instance, a spine plot can reveal male students' reading habits, highlighting that they
are underrepresented in medium reading categories. While flexibility in data
visualization is essential, it is equally important to have a stabilizing element that
ensures patterns observed are genuine data features. The subsequent sections will
outline a systematic approach to lined views, focusing on essential characteristics for
effective dataset exploration.

8.2 Theoretical Structures for Linked Views

Lining views means that two or more plots share and exchange information. To do
this, a lining procedure must create a relationship between the plots. Once this
relationship is set up, it’s important to decide what information is shared and how it is
shared. To explore different lining schemes, we look at data displays as suggested by
Wilhelm (2005). A data analysis display consists of a frame, a type, a set of graphical
elements, and scales, along with a model and a sample population.

According to this definition, a data analysis display D consists of a frame F, a type,


and its associated set of graphical elements G as well as its set of scale representing
axes sG, a model X and its scale ((X , sX ) sX , Ω , and a sample population Ω, i.e., )
is the data part and (F, (G, sG)) D = (F is the plotting part
For lined views to work, there must be a communication scheme between plots. The
external lining structure manages the sharing of information. Generally, one plot is
labeled “active” and the others “passive. ” The active plot sends a message, while
passive plots respond. The concept of lining allows us to see relations among different
components of displays, but only relations between similar layers are typically useful.

There are four main types of lining structures: linking frames, linking types, linking
models, and linking sample populations. These can be divided into data linking and
scale linking. Information can be shared in two ways: directly from one layer to
another or through an internal process involving the sample population layer, with
sample population lining being the most common method.

-> Linking Sample Populations

Sample population lining connects two displays and serves as a platform for user
interactions. It defines a mapping that links elements of one sample space to another.
This method is used to create subsets of data and analyze conditional distributions,
ensuring a joint sample space for proper definition of these distributions.

Identity Linking
The easiest and most common case of sample population lining, known as empirical
lining, uses the identity mapping id Ω Ω. This lining scheme aims to show the
connection between observations taken from the same individual or case. It helps to
utilize the natural connection between features observed for the same set of cases.
Identity lining is built into common data matrices where each row is a case and each
column is a measured variable. It is not limited to identical sample populations, as
any two variables of the same length can be combined in one data matrix, leading
software programs to treat them as if observed with the same individuals. Care must
be taken when interpreting these artificially lined variables.

Hierarchical Linking
Databases for analysis come from different sources and use different units. However,
when analyzed together, they usually show some connection between the sample
populations. Often, there is a hierarchy among these populations that ranges from
individual persons to various social groups, and even to different societies. This is
similar for spatial data measured on different geographical levels. It is useful to
visualize these connections through hierarchical aggregation displays. A relation must
be established to map elements between different sample population spaces.

Neighborhood or Distance Linking


A special case arises when we wor with geographic data where quite oten the most
important display is a (chorochromatic) map and the focus is on investigating local
effects. It is thus quite oten desirable to see differences between one location and its
various neighbors. So here the lin ing scheme points also toward the same display
and establishes a self-reference to its sample population. A variety of neighborhood
definitions are used in spatial data analysis, each one leading to a somewhat different
linking relation of the kind . Each
definition of neighborhood or distance leads to a new variant of lin ing relation, but
the main principles remain the same.

-> Linking Models

Models, as discussed by Wilhelm (2005), are symbols that represent variable terms
and identify the data to be shown in displays. They are essential for defining data
visualization and specify the information to be presented. For instance, a histogram
for a quantitative variable uses a categorization model defined by a vector C =
(C0, . . . , Cc), which segments the variable's range. It counts the frequency of
observations in each segment. The histogram's scale includes the categorization
vector, the order of values in C, and the maximum counts per bin, with the vertical
axis starting at zero.

A model linking for the example can be created through the set of observations or the
scale. Linking scales can lead to three cases: linking the categorization vector, linking
the order of categorization values, and linking the maximum count for one bin. If the
categorization operator model is shown as a histogram, the third case involves linking
the vertical axis scales, while the other two cases link the horizontal axis scales,
focusing on bin width and anchor point.
In Manet, histogram scales can be linked, as illustrated in a figure where two
histogram scales are aligned. The left plot is the active one, propagating its scale to
the plot on the right. Manet defines a histogram with five parameters: horizontal scale
limits, bin width, number of bins, and maximum bin height. Any two of the first four
parameters combined with the fifth are enough to define a histogram fully, allowing
parameters to be shared. It's important to also use the same frame size for accurate
comparison, not just the same scales.

Examples show three histograms for the same variable, with the left plot being active,
the bottom right plot unlinked, and the top right plot sharing scales but differing in
frame size. Linking scale information is essential, notably in the form of sliders,
which are 1-D graphical representations of model parameters that users can adjust.
Moving a slider changes the underlying model, which updates all related plots. Sliders
assist in dynamic queries, helping filter and analyze data visually.

The order of categorization values matters less for continuous data in histograms but
is crucial for nominal categories lacking a clear order. Linking this scale is common
in bar charts and mosaic plots. The categorization vector is part of both the
observation and scale components, and linking models generally means that plots
share the same variables. All plots in a specific example represent the same variable,
contributing to a comprehensive view of the dataset. Systems have been developed
that combine various views of the same dataset, offering multiple perspectives and
allowing for effective exploration of related variables.

The model layer of a data display is adaptable, encompassing complex models such as
regression, grand tours, and principal components. A basic model link involves
interconnected plots showing raw observations, models, and residual information.
Young et al. introduced this connected structure using grand tour plots, showcasing a
spread plot with a rotating plot and scatterplots for residual information, updated
promptly with any model changes.

-> Linking Types

The type layer in graphical displays represents the model as closely as possible, but
not all models can be shown without losing information due to limited space and
resolution. The relationship between the type level and model level is strong, meaning
similarities in two displays usually come from aligned models. For instance,
histograms that use the same categories will have the same bin widths. Direct links
between type levels of displays without corresponding model links are rare. Color and
size are key attributes of graphical elements that can be aligned, often without linking
to the model. In pie charts, different colors for slices help distinguish categories and
can be assigned arbitrarily. If slices are ordered meaningfully, such as alphabetically
or by size, the color can reflect model information. Using a consistent color scheme
can reduce misinterpretation.
Aligning axis information means that all displays use the same parameters, typically
reflecting model scales. Different scales can lead to ineffective plot use since some
space remains empty, which might not be noticed if scales are matched. The same
axes can highlight varying observation ranges. Properly linking type information is
crucial for comparing plots effectively. Incorrect visual representations often arise
from closely adjusting axis parameters to scale parameters without considering their
visual differences.

-> Linking Frames

The frame level controls the basic shape and size of a plot window, which is
important for saving screen space and making accurate graphical comparisons.
Different frame sizes can confuse analysts and result in incorrect conclusions. While
attributes like background color are less critical, setting them consistently helps in
data analysis.

8.3 Visualization Techniques for Linked Views

The lining paradigm supports sharing information between displays. This occurs
when a new display is created using data from existing displays. Interactive
environments also share information when users modify plots. Roberts et al. (2000)
identify three strategies for exploration: replacement, overlay, and replication.

-> Replacement
In the replacement mode, old information is lost and replaced by new data. This
method works for plot parameters but not for subsetting and conditioning, as it loses
important information on marginal distributions. It is effective in scatterplots with
individual plot symbols where user interaction changes some attributes. However,
users cannot compare the current plot with previous versions directly, only with a
mental copy, which can distort comparisons. Keeping track of changing scenarios and
versions is useful, and a history system that records plot changes can be very helpful.

-> Overlaying
In direct manipulation graphics, overlaying is a common method for showing
conditional distributions in area plots. A histogram can be overlaid on a barchart to
represent data points of selected classes, which helps compare conditional and
marginal distributions. This approach has limitations, such as restricting parameter
choices since the new plot inherits from the original. It can also cause occlusion,
where part of the original display is hidden due to the overlay, especially when the
data varies significantly between subsets and the total sample. This issue is critical in
complex plots like boxplots.

-> Repetition

Repetition is the third strategy for visualizing linked interactions. This strategy allows
users to see repeated and different views of the same data at once. The advantage is
that users get a complete overview of the data, seeing how changes in parameters and
user interactions affect the visualizations. However, the disadvantage is that users
might feel overwhelmed by the many slightly changed views. This strategy needs a
simple way to track these changes and an effective system to organize the displays on
the screen. A condensed form of repetition, called juxtaposition, places a plot for a
selected subgroup next to the original plot rather than on top of it. This keeps
important features of the original plot visible while allowing easy comparison
between the two. Juxtaposition is well-known for static plots but hasn't been widely
used in interactive systems due to challenges in rearranging plots and redrawing them
after each interaction. However, modern computer capabilities can support this
process and allow for a better view of user interactions. Juxtaposition can also be
applied to statistical models, enabling comparisons between results for different
subsets of data.

-> Special Forms of Linked Highlighting

Different problems arise when the lining scheme is not a straightforward 1-to-1 lining
but a more complicated form like m-to-1 lining, which occurs in hierarchical lining.
Consider a hierarchy with two levels: the macro level, such as a group of counties,
and the micro level, which consists of towns within those counties. When some towns
in a county are selected, it's helpful to show this partial selection of the county
through partial highlighting. If the macro level is represented with regular shapes,
partial highlighting can occur by dividing the shape into selected and nonselected
parts. A broader approach is to use varying intensities of the filling color in graphical
elements to depict the different proportions selected. This method is effective for
graphical elements with nonrectangular layouts and is generally easier to understand.

Although this paper discusses unidirectional lining schemes, it also mentions


bidirectional lining, which allows for the exchange of information between plots. This
would be beneficial for adjusting plot parameters that affect the overall layout and
size. In a unidirectional setup, one plot accepts axis limits from another, which could
lead to misrepresentation if the limits are too small. Ideally, both plots should adjust
parameters to represent their models accurately. In the Manet system, a combination
of 1-to-n and m-to-1 lining is implemented, where selections in either plot highlight
related points in the other.
Linked Data Views

9.1 Motivation: Why Use Linked Views?

A “data view” refers to any way of viewing data to understand it better. While it is
often associated with charts like bar graphs or scatterplots, it also includes other forms
like regression analysis results, neural network predictions, or geographic information
like maps. Additionally, a family tree displaying relationships is a type of data view.
The term encompasses various forms, such as graphs, diagrams, and visualizations,
but for clarity, we stick to calling it a “data view. ”

A linked data view is one that changes in response to modifications made in another
view. An example is a scroll bar in a text editor that adjusts to show which part of the
document is being viewed when it’s moved. This concept is common in user
interfaces and software involving data analysis. A figure in the text demonstrates this
idea using baseball statistics from 1871 to 2004, showing the relationship between
players' salaries and their batting averages in a scatterplot while also displaying a
histogram of the years.

Linking views allows users to select parts of one view, affecting the other views to
highlight data connections. In the example provided, black shows selected data while
gray indicates unselected data. This linking illustrates salary trends over years but
does not seem to change the connection between batting average and salary.

A key question when using visualization is: “Why should I use this? ” Analysts may
explore data further if they notice interesting patterns or relationships, like changes in
salary and years played. They often consider other views or add variables to explain
findings better. For instance, they can create additional scatterplots or use advanced
techniques to explore data in higher dimensions.
Despite its benefits, this approach has significant issues that limit its effectiveness.
The main problems are as follows:

As plots get more complex, they are harder to understand. While simple 1-D plots are
easy for most people, multi-dimensional visualizations like 3-D point clouds and
multivariate projections are less intuitive.

Furthermore, monolithic data views struggle to handle different types of data. High-
dimensional techniques often assume variables are numeric, making it difficult to add
a numeric variable to a table of two categorical variables without switching to a
different display type, like a trellis.

Some data types, specific to certain domains, cannot be directly integrated. Analyzing
relationships in multivariate data from geographic locations, graphs, or text
documents is challenging, often requiring the use of separate software tools that
complicate the analysis process.

The lined views approach can address these challenges by creating several simpler
views that are interconnected. When a user interacts with one view, the others update
automatically, making interpretation easier and allowing for more specialized data
integration.

However, lined data views are not always better than single complex views. In some
cases, a unified multivariate technique is essential to identify specific features, and
presenting results from an interactive exploration can be harder. Still, for numerous
situations, especially those focused on conditional distributions, lined data views are
very effective.

For example, the histogram now shows years in the league, resembling a Poisson
distribution. Players with five or more years of experience not only earn higher
salaries but also show a stronger linear relationship between batting average and
log(salary). For younger players, performance may not significantly influence pay
unless their batting average is above average.

9.2 The Linked Views Paradigm

In Sect. 9. 1, examples of the lined views paradigm were discussed. This section will
define it more precisely and explain how to implement a lined views system. An
interactive environment is necessary for this system, as lining involves interaction
with graphical data representations. A lined views environment should have multiple
views that meet certain conditions.

First, at least one view must detect user interaction and translate it into a degree of
interest in the displayed data, distinguishing between different data subsets based on
that interaction. Second, there must be a mechanism to share the degree of interest
from the first view to other views. Third, another view must be able to respond to the
interest measure by changing its appearance to reflect the degree of interest
concerning the data it displays.

The “degree of interest” concept measures how interesting a user finds specific data
subsets. For example, if a user selects bars for five or more years in a league, they
indicate interest in those data rows. Each subset must have a numerical interest
measure so that results can be aggregated. An aggregated view represents multiple
data cases with a single graphic item, like a histogram. In contrast, an unaggregated
view shows each data row individually, as seen in scatterplots. Data cases and data
rows refer to single data observations, and graphic items are visually distinct units that
can be identified separately from others.

A simpler version of the degree of interest can be used, where each data case is
assigned a degree of interest value, typically ranging from 0 to 1. A value of 0 means
no interest, while 1 indicates maximum interest. For a subset of data, the average of
these values can represent the interest measure. However, other summary functions
might be useful depending on the context. For instance, a maximum summary
function helps identify outliers since they have high interest, which can be missed by
average measures.

Another simplification states that any view defining a degree of interest must assign it
as either 0 or 1, separating selected cases from unselected ones. This binary system is
common, but more complex scenarios will be discussed later.

To meet the requirement of indicating user interest, there are various methods,
including brushing, where users drag a shape over data to select it, setting the degree
of interest to 1 for those items. Rectangle or rubber-band selection allows users to
click and drag to create a shape, selecting items inside it. Lassoing enables users to
create a polygon selection by clicking to define the shape and selecting the
intersecting items.

To fulfill the requirement for displaying degrees of interest, a view must represent the
main data alongside the interest value. This value can be treated like other variables in
graphic design. Different types of variables can be used to define a view's layout and
appearance.
In the provided figures, different methods of showing degrees of interest are
illustrated. For one method, a 3D barchart uses interest as a continuous variable along
the z-axis. Additionally, each bar could be split to show the proportions of selected
and unselected data.

Showing the interest degree through a brightness scale could also be implemented.
The last view in one figure divides data into selected and unselected subsets, fitting
into established faceting schemes. This demonstrates how binary selection can be
integrated into existing structures, as seen in a baseball context with varying leagues
that illustrate different clusters of data points over time. Further investigation into this
topic will be provided later.

9.3 Brushing Scatterplot Matrices and Other Nonaggregated Views

One of the first widely recognized techniques for data visualization is the scatterplot
brushing method by Becker, Cleveland, and Wilks from 1987. This technique
arranges scatterplots of multiple variables in a matrix format, allowing quick
comparison of how one variable relates to others. The method is enhanced by using a
brush, which highlights data points in different colors across the matrix when
selected. This creates a visual connection among the scatterplots.

The effectiveness of this technique comes from the clear link between data points and
their graphical representation, allowing flexible use of colors and symbols for each
data item. Unlike aggregated data, where multiple values are combined, each data row
corresponds directly to a specific graphical element in the scatterplot. This is
illustrated in the related figures, showing how selecting data points can change their
display without losing clarity.

Although scatterplots are an obvious choice for this method, other graphic tools, like
XGobi, also provide linked views using brushes. GGobi is the latest version of this
software. Unaggregated views, including raw data tables, can also utilize linked
selection, allowing users to focus on selected data points in a refined table view,
commonly known as a "drill-down" view.
Parallel coordinates views, introduced by Inselberg in 1985, represent high-
dimensional data as lines in a 2-D space. Though they work best with smaller
datasets, they are suitable for linked views as they show the distinctions between
selected and unselected lines, even if crowded.

Brushing utilizes different modes, such as transient, which resets selections when the
brush moves away; additive, keeping selections active; and subtractive, which
deselects items. There are various combinations of these modes, although some are
less practical. Effective usage of these brushing modes enhances interaction and data
analysis.

9.4 Generalizing to Aggregated Views

The unaggregated approach struggles when dealing with large datasets. For example,
with tens of thousands of data points, using aggregated views like bar charts and
histograms is more effective. In this chapter, a dataset with around 16,000 players and
86,000 player seasons is examined to show how summary views are easier to
understand. The chapter highlights techniques for linking these aggregated views.

One of the first tools that featured linking in views was the Data Des software,
initially created for teaching statistical analysis but now a full-featured package. It
offers linked views in both aggregated and unaggregated data, allowing users to
explore unusual cases and modify models quickly. The section focuses specifically on
how Data Des links different aggregated views.
Figure 9. 6 illustrates linking from a scatterplot to a bar chart. Two methods show
data: one divides bars into sections for selected and non-selected data, while another
uses brightness to show data for each case. LispStat, another tool, allows users to
create their functions for linking views due to its interpreted language.

In the visual display, one technique stacks sections of bars to show selected items or
uses summary statistics for a single bar value. Both methods are important for
displaying linked views clearly. Figures 9. 7 and 9. 8 present these techniques with
extensive data and help to illustrate complex relationships among multiple variables.
The text also touches upon designated hitters in baseball, which have a specific role
and can impact player statistics. The document emphasizes how different graphical
methods can highlight various aspects of the data, particularly focusing on
distinguishing between selected and unselected subsets. It concludes that different
visualization approaches serve different analytical purposes.

9.5 Distance-based Linking

In Section 9. 2, a simplification is proposed for implementing a degree-of-interest


value, which can be either zero or one. While this is common, other methods exist that
allow for more flexibility. One interesting method is distance-based lining. This
approach focuses on a location in the data display rather than a predefined region. It
measures how close each item is to that point. For example, an analyst looks at the
relationship between body shapeheight and weightand fielding position, creating a
chart of two fielding statistics aligned with a height/weight scatterplot. Distance lining
is used to connect these, utilizing a brightness scale to show interest levels. Other
variations of distance-based lining are possible, including using data space distance
for selection. The transfer function, which converts distance into degree of interest, is
crucial and should assign maximum interest at zero distance and decrease as distance
increases. Various functions can be used, though no optimal choice has been
established.

9.6 Linking from Multiple Views

In Section 9. 3, we talked about combining an existing degree-of-interest measure


with a new selection, focusing on a binary degree of interest. The aim was to let users
change the degree of interest through repeated interactions with the same view. This
principle can also link multiple views together, which we call a memoryless system.
In a memoryless system, no history of previous selections is kept; only the current
degree of interest is tracked with no knowledge of how it was achieved. Therefore,
when a selection is made, only the previous degree of interest and the current
selection are used to create the new degree of interest. In contrast, a memory system
remembers each selection operation, changing one selection impacts all previous
selections.

An example of this is called dynamic queries in the FilmFinder, which allows each
variable to keep its own selection state. Memory-based linking, as shown in the
baseball data, enables detailed exploration. The memory-based system allows easy
multi-variable queries and is tolerant of mistakes, while a memoryless system
provides power and adaptability.

However, memory-based systems can be less intuitive, making it challenging to


coordinate selections across different views. Observations indicate that memory-based
systems are ideal for directed queries, while memoryless systems excel at discovering
data structure. Finally, Figure 9. 12 illustrates a complex example with a nonbinary
degree of interest that reveals additional insights that would otherwise be missed.
9.7 Linking to Domain-specific Views
One of the attractions of the lined views paradigm is that it allows for easy integration
of views useful for specific data types into a general system. A view only needs to
meet certain requirements to be included. For spatial data, Unwin and Wills created a
system that merged statistical views with geographical views. REGARD enabled
users to manipulate maps with layers of geographical data, such as towns, rivers, and
countries. These layers contained entities with statistical data, which let users create
data views on various variables and connect the views through the lining system.

A simple way to add a geographic view to a lined views system is by coding


selections by brightness to create a choropleth map. This method works for different
levels of interest and provides an intuitive view. There is a strong correlation between
marriage patterns and location, which readers are invited to interpret by guessing the
state identified by higher marriage rates across all age groups.

REGARD also advanced view lining in networks, which was further explored in
NicheWorks. This system focused on nodes and links in a graph and used the lining
mechanism to examine relationships between these data sets. Distance-based lining
was applied, defining distance by graph-theoretic connections.

Another significant area discussed is modeling results. In earlier sections, Data Des
included text descriptions of models within the lining paradigm. Developing model-
specific views, like hierarchical clustering, can be beneficial. Such clustering uses
similarities and can be shown in tree representations. Each leaf node has a degree of
interest, indicated by selections from parallel coordinates. Different representations of
clustering trees can be shown, including a treemap design that divides a rectangle
based on the size of children nodes, providing clear visibility of the data.
Visualizing Trees and Forests
10.1 Introduction

Tree-based models are a strong alternative to traditional models for various reasons.
They are easy to understand, can work with both continuous and categorical data,
handle missing values, perform variable selection, and model interactions effectively.
Common types of tree-based models include classification, regression, and survival
trees.

Visualization is key for tree models because they can be interpreted easily. Decision
trees, shown as decision rules, are intuitive to understand. They provide insights into
the data, such as cut point quality and prediction reliability. This chapter introduces
tree models and offers visualization techniques, including hierarchical views and
treemaps. It also discusses split stability and tree ensembles, and ways to visualize
multiple tree models.

10.2 Individual Trees

The basic principle of all tree-based methods is to recursively divide the data space to
create subgroups for prediction. This process starts with the full dataset, using a rule
to split the data into separate parts. It continues until no further splitting rules exist.

Classification and regression trees use simple decision rules, evaluating one data
variable at a time. For continuous variables, splits create two partitions based on a
constant. Categorical variables split based on assigned categories.

This partitioning is represented by a tree, where the root node is the first split, and the
leaves are the final partitions. Each partition has a prediction, with classification trees
predicting classes and regression trees predicting constants. A decision tree consists of
rules in inner nodes, regardless of the prediction type in the leaves.

-> Hierarchical Views

To visualize a tree model effectively, we need to demonstrate its hierarchical structure


clearly. A tree, as defined in graph theory, is a set of nodes and edges that connect
them. Decision trees are a specific type of tree that is both connected and acyclic,
meaning they have a single root node without any parent nodes. Each node carries
information: inner nodes show how to split data, while terminal nodes provide
predictions. Visual representations of tree models aim to make this information clear
while displaying the tree's structure.

An example tree model is based on data from Italian olive oils, which includes
different fatty acids and their concentrations and categorizes the oils by region. The
model aims to show how olive oil composition varies across Italy, specifically across
five regions: Sicily, Calabria, Sardinia, Apulia, and North. The visualizations of this
tree vary, even with the same underlying model.

There are several tasks involved in visualizing a decision tree: placing nodes, visually
representing nodes and edges, and adding annotations for clarity. Different
representations can convey additional information. For instance, some plots may use
simple tick marks for nodes, while others might use rectangles sized based on data
cases. Colors may indicate class proportions within nodes too.

Advanced visualization techniques can enhance hierarchical views, such as using


censored zooming for tree node sizes. Nodes represent data, with roots containing all
data and partitions occurring until terminal nodes are reached. To help users see small
nodes better, a sizing factor can be applied, while still maintaining a maximum size to
prevent distortion.

Placement of nodes has generated much discussion in graph visualization. For small
trees, straightforward methods are effective, but larger trees can complicate layouts.
In many cases, the information contained in nodes is more crucial than the exact
layout. The use of interactive tools allows users to explore large trees more easily.
The basic node placement divides available space equidistantly, but some approaches
consider the quality of splits for node positioning. Comparisons of class proportions
can also be enhanced by rotating the tree for clarity.

Edge representation is usually limited to lines, but it can also include polygons that
reflect the flow of data through a tree. Annotations can add textual or symbolic
elements along nodes and edges, helping clarify predictions and rules. However,
excessive annotations can create clutter, distracting from key points and hindering
readability.

To maintain clarity without losing important information, extra tools like zooming
and toggling are necessary, especially for more in-depth analyses. There are two ways
to provide additional information: integrating it into the tree visualization or using
linked external graphics. Integrating data into the tree makes it immediately relevant
and easier to use, but is limited by screen space. Accordingly, displaying directly
related information is often preferred.

Conversely, external linked graphics offer more flexibility as only one graphic
represents multiple data points, avoiding crowding. Such graphics must be interpreted
more carefully since they are not directly visible in the tree structure. There are no
strict rules about what information should be included in or outside the tree, but a
guideline suggests that complex graphics work better externally, while simpler, tree-
specific information should be included directly in the visualization.

-> Recursive Views

In the introduction, we explained tree models as recursive partitioning methods. Thus,


it makes sense to show partitions created by the tree, offering a different way to
visualize tree models. Next, we will describe visualization methods that focus on the
partitioning aspect of the models rather than their hierarchical structure.

Sectioned Scatterplots

Splitting rules are set in the covariate space, making a tree model visualizable along
with its partition boundaries. For univariate partitioning rules, these boundaries appear
on hyperplanes that run parallel to the covariate axes. A simple scatterplot can serve
as a 2-D projection, clearly showing all splits for the two plotted variables. A
sectioned scatterplot with the first two split variables is displayed along with the tree
model, where different regions are colored and partition boundaries are indicated.

The first tree split uses an eicosenoic variable to distinguish oils from northern Italy
and Sardinia from other regions. This split is very clear in the scatterplot. The next
two inner nodes use a linoleic variable to further differentiate oils from Sardinia and
the northern parts of Italy, and then from Apulian oils to Sicily and Calabria. Further
splits may not be visible in this projection as they involve other variables, but they can
be analyzed through interactive techniques using a series of sectioned scatterplots.

These scatterplots should use variables that are closely linked in the tree and be
limited to data from nodes nearer to the root. Some cut points may show clear
separations, while others might be noisy. In an interactive setting, users can drill down
between scatterplots. Extensions to these plots include changing the opacity of
partition lines based on depth and shading the background for depth or predicted
value. Scatterplots work best with continuous variables, while categorical variables
may benefit from local treemaps, which group categories within the same node.
Treemaps

One way to show all partitions is through area-based plots, where each terminal node
is a rectangle. Treemaps are a type of these plots. The main idea is to divide the
available rectangular plot area according to how the tree model divides data. The
treemap’s area represents the full dataset, starting with a horizontal partition based on
the number of cases for each child node. Next, each partition is split vertically
according to the proportions for its child nodes. This process continues recursively,
alternating between horizontal and vertical splits until reaching terminal nodes. Each
rectangle in the final plot corresponds to a terminal node, with the area reflecting the
number of cases in that node. It helps to adjust spaces between partitions to indicate
their depth, showing larger gaps for splits closer to the root.

Treemaps are good for assessing tree model balance and can reveal when trees create
large terminal nodes in noisy scenarios. They also allow highlighting for comparing
groups within terminal nodes. Treemaps extend those in information visualization and
are related to mosaic plots. The key advantage of treemaps is their efficient use of
space, enabling comparisons while keeping context, although they do not show
splitting criteria or allow direct comparisons within nodes..
Spineplots of Leaves

Another useful plot for tree model visualization is the spineplot of leaves (SPOL).
Instead of alternating the partitioning direction like treemaps, SPOL uses horizontal
partitioning, showing all terminal nodes in one row. This fixed height allows for a
visual comparison of the sizes of the terminal nodes, which are proportional to the
width of the bars. Relative proportions of groups can also be easily compared using
highlighting or brushing. A sample SPOL is shown in Fig. 10. 7, where each bar
represents a leaf, and its width is proportional to the number of cases in that node. The
plot allows for clear visibility of group proportions within each node and can include
annotations like a dendrogram of the tree.

SPOLs are especially useful for comparing group proportions in terminal nodes,
similar to spineplots, but with differences in how categories and gaps are handled.
This section has discussed several techniques for visualizing tree models based on
recursive partitioning, focusing on visualization of splits and data application. All
techniques can be applied to various data subsets, including training and test data,
allowing for a comparison of model adaptability and stability. The next section will
address tree model construction and visualization methods that reflect split quality.

->Fitting Tree Models


So far, we have talked about ways to visualize tree models and the data they use.
However, there is more information at each node that can be visualized. To better
understand tree models, we need to learn about how they are created. While tree
models are easy to interpret, building them is not simple. Ideally, we would evaluate
all possible tree models and choose the one that fits the data best based on some loss
function. However, this is often impossible, especially as tree size increases, because
the computational cost grows rapidly.

To fit tree models, several methods have been suggested. The most common
algorithm is CART (Classification and Regression Trees), introduced by Breiman et
al. in 1984. This algorithm uses a greedy approach: for each node, it looks at all
possible splits and picks the one that most reduces the impurity of the child nodes
compared to the parent node using an impurity criterion. The split is made, and the
process repeats for each child node. Growth stops if certain rules are met, often
related to the minimum number of cases in a node or the required impurity decrease.
While pruning is a common practice in tree building, we will not discuss it here.
However, visualization can help with pruning, especially when parameters can be
adjusted interactively.

Impurity measures can be any convex functions, but common ones are entropy and
Gini index, both having theoretical support. It's important to note that this method
looks for local optima only, without considering multiple splits. This approach is
computationally lighter than a complete search and performs well in practice.

Focusing on local optima may make the model unstable. Small changes to training
data can lead to different splits, affecting the entire tree model. We aim to present a
visualization technique to understand decisions made at each node during tree fitting.

Mountain plots visualize impurity decrease across the full range of the split variable.
Using a binary classification example, one can see that there are multiple splits close
to the optimal cut chosen. The competition for best splits is not limited to a single
variable but can involve multiple variables. By comparing mountain plots of different
variables, we can evaluate the stability of a split. If one variable clearly stands out, the
split will be stable. Conversely, competing splits within the optimal range indicate
instability. Mountain plots help assess the quality of splits and can guide the
construction of tree models.

10.3 Visualizing Forests


We have been discussing how to visualize individual tree models. We have shown
that the choice of splits can change, affecting model stability. Thus, it is helpful to
grow multiple trees. Next, we will introduce tree ensemble methods and present
visualization methods for forests made up of multiple tree models.

Model ensemble methods use the weaknesses of individual models to enhance


prediction accuracy by combining multiple models. Bagging creates many tree
models using bootstrapping and aggregates their predictions through majority voting
for classification and averaging for regression. Random forests increase randomness
by selecting candidate split variables from a different random set at each node.
-> Split Variables

Bootstrapping models help analyze the features of fitted models and give insight into
the data. One key advantage of tree models is their implicit variable selection ability.
When a dataset is evaluated, the tree-growing algorithm creates a structure of splits,
and only the variables used in these splits are considered, effectively dropping others.
This analysis focuses on Wisconsin breast cancer data, using 20 trees generated
through bootstrapping with the CART algorithm.

The first visualization shows a global overview of variables used in the models. UCS
is the most frequently used variable, appearing 20 times, while Mts is the least used,
showing up only once. With a small number of variables available, none are left out
entirely. However, simply knowing how often a variable is used does not indicate its
importance, as deeper splits may involve fewer cases. Therefore, it is crucial to assess
the contribution of each split through a cumulative statistic like impurity decrease.

The second visualization shows the cumulative impurity decrease for each variable
across the 20 trees, ordered by importance. UCS stands out as the most influential,
followed by UCH and BNi. Care must be exercised when drawing conclusions from
this data, as variables that are correlated can affect results. The CART algorithm may
randomly select one of two correlated variables to use, which could leave out the less
significant one that might perform well alone.

To further study these behaviors and the differences among models, it is essential to
examine both the variables and individual trees. Two-dimensional diagrams of trees
and split variables are provided to reveal patterns and comparisons. Notably, four
model groups are identified based on gains, with UCS being the leading variable in
most models. The analysis underscores variable masking effects, confirming the
variability and instability of tree models in this dataset. Finally, for extensive trees,
parallel coordinate plots can be used to visualize the results, emphasizing the need for
careful ordering of the axes based on influential variables.

->Data View
The importance and use of variables in splits is a key part of tree models. In Section
10. 2. 2, another way to visualize trees was discussed: sectioned scatterplots. These
scatterplots can also show forests using semitransparent partition boundaries. An
example shows a forest with the olive oil data divided into nine regions, displaying
variables linoleic vs. palmitoleic and partition boundaries from 100 bootstrapped
trees. This technique aims to visualize all trees and their splits at once, unlike
individual trees that allow for detailed analysis.

-> Trace Plot

The text discusses a method for visualizing classification trees through trace plots,
which consist of a grid with split variables as columns and node depths as rows. Each
cell in the grid represents a tree node, and glyphs within the cells show possible split
points. Continuous variables are indicated by a horizontal axis with tic marks for
splits, while categorical variables are shown as boxes for split combinations. In the
example tree, the root node features a split on palmitoleic, with child nodes splitting
on linoleic and oleic.

The trace plot allows for the reconstruction of tree splits and their hierarchical
structure, eliminating ambiguity present in hierarchical views, as the order of child
nodes is fixed in the grid. An advantage of trace plots is the ability to display multiple
tree models on the same grid, as seen in a plot of 100 bootstrapped classification trees.
To avoid overplotting, semitransparent edges are used, making popular paths clearer.
The first split consistently uses palmitoleic, but subsequent splits show various
alternatives, indicating stable subgroups that can be reached in different ways. The
specific example highlights some instability due to it being a multiclass problem,
resulting in variable separation sequences.

You might also like