Polaris: A System For Query, Analysis, and Visualization of Multidimensional Relational Databases
Polaris: A System For Query, Analysis, and Visualization of Multidimensional Relational Databases
Polaris: A System For Query, Analysis, and Visualization of Multidimensional Relational Databases
1 INTRODUCTION
I
N the last several years, large databases have become
common in a variety of applications. Corporations are
creating large data warehouses of historical data on key
aspects of their operations. International research pro-
jects such as the Human Genome Project [20] and
Digital Sky Survey [31] are generating massive data-
bases of scientific data.
A major challenge with these databases is to extract
meaning from the data they contain: to discover structure,
find patterns, and derive causal relationships. The analysis
and exploration necessary to uncover this hidden informa-
tion places significant demands on the human-computer
interfaces to these databases. The exploratory analysis
process is one of hypothesis, experiment, and discovery.
The path of exploration is unpredictable and the analysts
need to be able to rapidly change both what data they are
viewing and how they are viewing that data.
The current trend is to treat multidimensional databases
as n-dimensional data cubes [16]. Each dimension in these
data cubes corresponds to one dimension in the relational
schema. Perhaps the most popular interface to multi-
dimensional databases is the Pivot Table [15]. Pivot Tables
allow the data cube to be rotated, or pivoted, so that
different dimensions of the dataset may be encoded as rows
or columns of the table. The remaining dimensions are
aggregated and displayed as numbers in the cells of the
table. Cross-tabulations and summaries are then added to
the resulting table of numbers. Finally, graphs may be
generated from the resulting tables. Visual Insights recently
released a new interface for visually exploring projections
of data cubes using linked views of bar charts, scatterplots,
and parallel coordinate displays [14].
In this paper, we present Polaris, an interface for the
exploration of multidimensional databases that extends the
Pivot Table interface to directly generate a rich, expressive
set of graphical displays. Polaris builds tables using an
algebraic formalism involving the fields of the database.
Each table consists of layers and panes and each pane may
contain a different graphic. The use of tables to organize
multiple graphs on a display is a technique often used by
statisticians in their analysis of data [5], [11], [38].
The Polaris interface is simple and expressive because
it is built upon a formalism for constructing graphs and
building data transformations. We interpret the state of
the interface as a visual specification of the analysis task
and automatically compile it into data and graphical
transformations. This allows us to combine statistical
analysis and visualization. Furthermore, all intermediate
specifications that can be created in the visual language
are valid and can be interpreted to create visualizations.
Therefore, analysts can incrementally construct complex
queries, receiving visual feedback as they assemble and
alter the specifications.
2 RELATED WORK
The related work to Polaris can be divided into three
categories: formal graphical specifications, table-based data
displays, and database exploration tools.
2.1 Formal Graphical Specifications
Bertin's Semiology of Graphics [6] is one of the earliest
attempts at formalizing graphing techniques. Bertin
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 8, NO. 1, JANUARY-MARCH 2002 1
. The authors are with the Computer Science Department, Stanford
University, Stanford, CA 94305.
E-mail: {cstolte, hanrahan}@graphics.stanford.edu, [email protected].
Manuscript received 17 Apr. 2001; accepted 10 July 2001.
For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number 114503.
1077-2626/02/$17.00 2002 IEEE
developed a vocabulary for describing data and the techni-
ques for encoding data in a graphic. One of his important
contributions is the identification of the retinal variables
(position, color, size, etc.) in which data can be encoded.
Cleveland[11], [12] usedtheoretical andexperimental results
to determine how well people can use these different retinal
properties to compare quantitative variations.
Mackinlay's APT system [26] is one of the first applica-
tions of formal graphical specifications to computer-
generated displays. APT uses a set of graphical languages
and composition rules to automatically generate 2D dis-
plays of relational data. The Sage system [29] extends the
concepts of APT, providing a richer set of data character-
izations and generating a wider range of displays.
Livny et al. [25] describe a visualization model that
provides a foundation for database-style processing of
visual queries. Within this model, the relational queries
and graphical mappings necessary to generate visualiza-
tions are defined by a set of relational operators. The Rivet
visualization environment [9] applies similar concepts to
provide a flexible database visualization tool.
Wilkinson [41] recently developed a comprehensive
language for describing traditional statistical graphs and
proposed a simple interface for generating a subset of the
specifications expressible within his language. We have
extended Wilkinson's ideas to develop a specification that
can be directly mapped to an interactive interface and that
is tightly integrated with the relational data model. The
differences between our work and Wilkinson's will be
further discussed in Section 8.
2.2 Table-Based Displays
Another area of related work is visualization systems that
use table-based displays. Static table displays, such as
scatterplot matrices [18] and Trellis [3] displays, have been
used extensively in statistical data analysis. Recently,
several interactive table displays have been developed.
Pivot Tables [15] allow analysts to explore different
projections of large multidimensional datasets by interac-
tively specifying assignments of fields to the table axes, but
are limited to text-based displays. Systems such as the Table
Lens [27] and FOCUS [32] visualization systems provide
table displays that present data in a relational table view,
using simple graphics in the cells to communicate quanti-
tative values.
2.3 Database Exploration Tools
The final area of related work is visual query and database
exploration tools. Projects such as VQE [13], Visage [30],
DEVise [25], and Tioga-2 [2] have focused on developing
visualization environments that directly support interactive
database exploration through visual queries. Users can
construct queries and visualizations directly through their
interactions with the visualization system interface. These
systems have flexible mechanisms for mapping query
results to graphs and all of the systems support mapping
database records to retinal properties of the marks in the
graphs. However, none of these systems leverages table-
based organizations of their visualizations.
Other existing systems, such as XmdvTool [40], Spotfire
[33], and XGobi [10], have taken the approach of providing
a set of predefined visualizations, such as scatterplots and
parallel coordinates, for exploring multidimensional data
sets. These views are augmented with extensive interaction
techniques, such as brushing and zooming, that can be used
to refine the queries. We feel that this approach is much
more limiting than providing the user with a set of building
blocks that can be used to interactively construct and refine
a wide range of displays to suit an analysis task.
Another significant database visualization system is
VisDB [22], which focuses on displaying as many data
items as possible to provide feedback as users refine their
queries. This system even displays tuples that do not meet
the query and indicates their distance from the query
criteria using spatial encodings and color. This approach
helps the user avoid missing important data points just
outside of the selected query parameters. In contrast,
Polaris provides extensive ability to drill-down and roll-
up data, allowing the analyst to get a complete overview of
the data set before focusing on detailed portions of the
database.
3 OVERVIEW
Polaris has been designed to support the interactive
exploration of large multidimensional relational databases.
Relational databases organize data into tables where each
row in a table corresponds to a basic entity or fact and each
column represents a property of that entity [36]. We refer to
a row in a relational table as a tuple or record and a column
in the table as a field. A single relational database will
contain many heterogeneous but interrelated tables.
We can characterize fields in a database as nominal,
ordinal, quantitative, or interval [6], [34]. Polaris reduces
this categorization to ordinal and quantitative by treating
intervals as quantitative and assigning an ordering to the
nominal fields and subsequently treating them as ordinal.
The fields within a relational table can also be partitioned
into two types: dimensions and measures. Dimensions and
measures are similar to independent and dependent
variables in traditional analysis. For example, a product
name or type would be a dimension of a product and the
product price or size would be a measure. The current
implementation of Polaris treats all ordinal fields as
dimensions and all quantitative and interval fields as
measures.
In many important business and scientific databases,
there are often many dimensions identifying a single entity.
For example, a transaction within a store may be identified
by the time of the sale, the location of the store, the type of
product, and the customer. In most data warehouses, these
multidimensional databases are structured as n-dimen-
sional data cubes [36]. Each dimension in the data cube
corresponds to one dimension in the relational schema.
To effectively support the analysis process in large
multidimensional databases, an analysis tool must meet
several demands:
. Data-dense displays: The databases typically con-
tain a large number of records and dimensions.
Analysts need to be able to create visualizations that
2 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 8, NO. 1, JANUARY-MARCH 2002
will simultaneously display many dimensions of
large subsets of the data.
. Multiple display types: Analysis consists of many
different tasks such as discovering correlations
between variables, finding patterns in the data,
locating outliers, and uncovering structure. An
analysis tool must be able to generate displays
suited to each of these tasks.
. Exploratory interface: The analysis process is often
an unpredictable exploration of the data. Analysts
must be able to rapidly change what data they are
viewing and how they are viewing that data.
Polaris addresses these demands by providing an inter-
face for rapidly and incrementally generating table-based
displays. In Polaris, a table consists of a number of rows,
columns, and layers. Each table axis may contain multiple
nested dimensions. Each table entry, or pane, contains a set
of records that are visually encoded as a set of marks to
create a graphic.
Several characteristics of tables make them particularly
effective for displaying multidimensional data:
. Multivariate: Multiple dimensions of the data can be
explicitly encoded in the structure of the table,
enabling the display of high-dimensional data.
. Comparative: Tables generate small-multiple displays
of information, which, as Tufte [38] explains, are
easily compared, exposing patterns and trends
across dimensions of the data.
. Familiar: Table-based displays have an extensive
history. Statisticians are accustomed to using tabular
displays of graphs, such as scatterplot matrices and
Trellis displays, for analysis. Pivot Tables are a
common interface to large data warehouses.
Fig. 1 shows the user interface presented by Polaris. In
this example, the analyst has constructed a matrix of
scatterplots showing sales versus profit for different
product types in different quarters. The primary interaction
technique is to drag-and-drop fields from the database
schema onto shelves throughout the display. We call a
given configuration of fields on shelves a visual specification.
The specification determines the analysis and visualization
operations to be performed by the system, defining:
. The mapping of data sources to layers. Multiple data
sources may be combined in a single Polaris
visualization. Each data source maps to a separate
layer or set of layers.
. The number of rows, columns, and layers in the
table and their relative orders (left to right as well as
back to front). The database dimensions assigned to
rows are specified by the fields on the y shelf,
columns by fields on the x shelf, and layers by fields
on the layer (z) shelf. Multiple fields may be dragged
onto each shelf to show categorical relationships.
. The selection of records from the database and the
partitioning of records into different layers and
panes.
STOLTE ET AL.: POLARIS: A SYSTEM FOR QUERY, ANALYSIS, AND VISUALIZATION OF MULTIDIMENSIONAL RELATIONAL DATABASES 3
Fig. 1. The Polaris user interface. Analysts construct table-based displays of relational data by dragging fields from the database schema onto
shelves throughout the display. A given configuration of fields on shelves is called a visual specification. The specification unambiguously defines the
analysis and visualization operations to be performed by the system to generate the display.
. The grouping of data within a pane and the
computation of statistical properties, aggregates,
and other derived fields. Records may also be sorted
into a given drawing order.
. The type of graphic displayed in each pane of the
table. Each graphic consists of a set of marks, one
mark per record in that pane.
. The mapping of data fields to retinal properties of
the marks in the graphics. The mappings used for
any given visualization are shown in a set of
automatically generated legends.
Analysts can interact with the resulting visualizations in
several ways. Selecting a single mark in a graphic by
clicking on it pops up a detail window that displays user-
specified field values for the tuples corresponding to that
mark. Analysts can draw rubberbands around a set of
marks to brush records. Brushing, discussed in more detail
in Section 5.3, can be performed within a single table or
between multiple Polaris displays.
In the next section, we describe how the visual
specification is used to generate graphics. In Section 5, we
describe the support Polaris provides for visually querying
and transforming the data through filters, sorting, and data
transformations. Then, in Section 6, we discuss how the
visual specification is used to generate the database queries
and statistical analysis.
4 GENERATING GRAPHICS
The visual specification consists of three components: 1) the
specification of the different table configurations, 2) the type
of graphic inside each pane, and 3) the details of the visual
encodings. We discuss each of these in turn.
4.1 Table Algebra
We need a formal mechanism to specify table configura-
tions and we have defined an algebra for this purpose.
When the analysts place fields on the axis shelves, as shown
in Fig. 1, they are implicitly creating expressions in this
algebra.
A complete table configuration consists of three separate
expressions in this table algebra. Two of the expressions
define the configuration of the x and y axes of the table,
partitioning the table into rows and columns. The third
expression defines the z axis of the table, which partitions
the display into layers.
The operands in this table algebra are the names of the
ordinal and quantitative fields of the database. We will use
A, B, and C to represent ordinal fields and P, Q, and R to
represent quantitative fields. We assign ordered sets to each
field symbol in the following manner: To ordinal fields, we
assign the members of the ordered domain of the field and,
to quantitative fields, we assign the single element set
containing the field name.
A = domain(A) = a
1
; . . . ; a
n
P = P:
This assignment of sets to symbols reflects the difference
in how the two types of fields will be encoded in the
structure of the tables. Ordinal fields will partition the table
into rows and columns, whereas quantitative fields will be
spatially encoded as axes within the panes.
A valid expression in our algebra is an ordered sequence
of one or more symbols with operators between each pair of
adjacent symbols and with parentheses used to alter the
precedence of the operators. The operators in the algebra
are cross (), nest (/), and concatenation (+), listed in order
of precedence. The precise semantics of each operator is
defined in terms of its effects on the operand sets.
4.1.1 Concatenation
The concatenation operator performs an ordered union of
the sets of the two symbols:
A B = a
1
; . . . ; a
n
b
1
; . . . ; b
m
= a
1
; . . . ; a
n
; b
1
; . . . ; b
m
A P = a
1
; . . . ; a
n
P
= a
1
; . . . ; a
n
; P
P Q = P Q
= P; Q:
4.1.2 Cross
The cross operator performs a Cartesian product of the sets
of the two symbols:
A B = a
1
; . . . ; a
n
b
1
; . . . ; b
m
= a
1
b
1
; . . . ; a
1
; b
m
;
a
2
b
1
; . . . ; a
2
b
n
; . . . ;
a
n
b
1
; . . . ; a
n
b
m
A P = a
1
; . . . ; a
n
P
= a
1
; P; . . . ; a
n
P:
4.1.3 Nest
The nest operator is similar to the cross operator, but it only
creates set entries for which there exist records with those
domain values. If we define R to be the dataset being
analyzed, r to be a record, and A(r) to be the value of the
field A for the record r, then we can define the nest operator
as follows:
A=B = a
i
b
j
[ r R st
A(r) = a
i
& B(r) = b
i
:
The intuitive interpretation of the nest operator is
B within A. For example, given the fields quarter and
month, the expression quarter/month would be interpreted as
those months within each quarter, resulting in three entries
for each quarter. In contrast, quarter month would result in
12 entries for each quarter.
Using the above set semantics for each operator, every
expression in the algebra can be reduced to a single set,
with each entry in the set being an ordered concatenation of
zero or more ordinal values with zero or more quantitative
field names. We call this set evaluation of an expression the
normalized set form. The normalized set form of an
expression determines one axis of the table: The table axis
is partitioned into columns (or rows or layers) so that there
is a one-to-one correspondence between set entries in the
4 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 8, NO. 1, JANUARY-MARCH 2002
normalized set and columns. Fig. 2 illustrates the config-
urations resulting from a number of expressions.
Analysts can also combine multiple data sources in a
single Polaris visualization. When multiple data sources are
imported, each data source is mapped to a distinct layer (or
set of layers). While all data sources and all layers share the
same configuration for the x and y axes of the table, each
data source can have a different expression for partitioning
its data into layers. Fig. 3b and Fig. 3c illustrate the use of
layers to compose multiple distinct data sources into a
single visualization.
4.2 Types of Graphics
After the table configuration is specified, the next step is to
specify the type of graphic in each pane. One option, typical
of most charting and reporting tools, is to have the user
select a chart type from a predefined set of charts. Polaris
allows analysts to flexibly construct graphics by specifying
the individual components of the graphics. However, for
this approach to be effective, the specification must balance
flexibility with succinctness. We have developed a taxon-
omy of graphics that results in an intuitive and concise
specification of graphic types.
When specifying the table configuration, the user also
implicitly specifies the axes associated with each pane. We
have structured the space of graphics into three families by
the type of fields assigned to their axes:
. ordinal-ordinal,
. ordinal-quantitative,
. quantitative-quantitative.
Each family contains a number of variants depending on
how records are mapped to marks. For example, selecting a
bar in an ordinal-quantitative pane will result in a bar chart,
whereas selecting a line mark results in a line chart. The
mark set currently supported in Polaris includes the
rectangle, circle, glyph, text, Gantt bar, line, polygon, and
image.
Following Cleveland [12], we further structure the space
of graphics by the number of independent and dependent
variables. For example, a graphic where both axes encode
independent variables is different than a graphic where one
axis encodes an independent variable and the other encodes
STOLTE ET AL.: POLARIS: A SYSTEM FOR QUERY, ANALYSIS, AND VISUALIZATION OF MULTIDIMENSIONAL RELATIONAL DATABASES 5
Fig. 2. The graphical interpretation of several expressions in the table algebra. Each expression in the table algebra can be reduced to a single set of
terms and that set can then be directly mapped into a configuration for an axis of the table.
a dependent variable (y = f(x)). By default, dimensions of
the database are interpreted as independent variables and
measures as dependent variables.
Finally, the precise form of the data transformations, in
particular, how records are grouped and whether aggre-
gates are formed, can affect the type of graphic. Some
graphic types best encode a single record, whereas others
can encode an arbitrary number of records.
We briefly discuss the defining characteristics of the
three families within our categorization.
4.2.1 Ordinal-Ordinal Graphics
The characteristic member of this family is the table, either
of numbers or of marks encoding attributes of the source
records.
In ordinal-ordinal graphics, the axis variables are
typically independent of each other and the task is focused
on understanding patterns and trends in some function
f(O
x
; O
y
) R, where R represents the fields encoded in the
retinal properties of the marks. This can be seen in Fig. 3a,
where the analyst is studying sales and margin as a function
6 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 8, NO. 1, JANUARY-MARCH 2002
Fig. 3. (a) A standard Pivot Table except with graphical marks rather than textual numbers in each cell, showing the total sales in each state
organized by product type and month. By using a graphical mark with the total sales encoded as the size of the circle, trends may be easier to spot
than by scanning a table of numbers. (b) Gantt charts showing the correspondence between wars and major scientists, organized by country. Note
that this display is using two different data sources, one for the wars and one for the scientists, each corresponding to a different layer. Because both
data sources have the country and time dimensions, Polaris can be used to visually join the two data sets into a single display with common x and y
table axes. (c) Maps showing flights across the United States. This display has multiple data sources corresponding to different layers. The USA
data source contains data corresponding to the underlying state outlines. The Airport data source contains the coordinates of numerous airports in
the US. Finally, the Flights data source describes flights in the continental United States, including data about the region in which the flight
originated. When a field that only appears in a single data set (flights) is used to partition a layered display, the data from the other data sources
(airports and states) is replicated into each pane formed by the partitioning. Thus, each pane displays all states and airports, but only a subset of the
flights. (d) A small-multiple display of line charts overlaying dot plots. Each chart displays the profit and sales over time for a hypothetical coffee
chain, orgainized by state. The display is constructed using two layers, where each layer contains a copy of the same data set. The layers share the
same table structure, but use different marks and data transformations. As a result, the line chart and dot plot can be at different levels of detail. In
this case, each line chart shows average profit across all products per month, whereas the dot plot has an additional grouping transform specified
that results in separate dots displaying average profit per product per month.
of product type, month, and state for the items sold by a
hypothetical coffee chain. Fig. 7a presents another example
of an ordinal-ordinal graphic. In this figure, the analyst is
investigating the performance of a graphics rendering
library. The number of cache misses attributable to each
line of source code has been encoded in the color of the line.
The cardinality of the record set in each pane has little
effect on the overall structure of the table. When there is
more than one record per pane, multiple marks are
shown in each display, with a one-to-one correspondence
of mark to record. The marks are stacked in a specified
drawing order and the spatial placement of the marks
within the pane conveys no additional information about
the record's data.
4.2.2 Ordinal-Quantitative Graphics
Well-known representatives of this family of graphics are
the bar chart, possibly clustered or stacked, the dot plot, and
the Gantt chart.
In an ordinal-quantitative graphic, the quantitative
variable is often dependent on the ordinal variable and
the analyst is trying to understand or compare the proper-
ties of some set of functions f(O) Q. Fig. 6c illustrates a
case where a matrix of bar charts is used to study several
functions of the independent variables product and month.
The cardinality of the record set does affect the structure of
the graphics in this family. When the cardinality of the
record set is one, the graphics are simple bar charts or dot
plots. When the cardinality is greater than one, additional
structure may be introduced to accommodate the additional
records (e.g., a stacked bar chart).
The ordinal and quantitative values may be independent
variables, such as in a Gantt chart. Here, each pane
represents all the events in a category; each event has a
type as well as a begin and end time. In Fig. 3b, major wars
over the last 500 years are displayed as Gantt charts,
categorized by country. An additional layer in that figure
displays pictures of major scientists plotted as a function of
the independent variables country of birth and date of birth.
Fig. 7c shows a table of Gantt charts, with each chart
displaying the thread scheduling and locking activity on a
CPU within a multiprocessor computer. To support Gantt
charts, we need to support intervals as an additional type of
field that exists in the meta-data only, allowing one field
name to map to a pair of columns in the database.
4.2.3 Quantitative-Quantitative Graphics
Graphics of this type are used to understand the distribu-
tion of data as a function of one or both quantitative
variables and to discover causal relationships between the
two quantitative variables. Fig. 6a illustrates a matrix of
scatterplot graphics used to understand the relationships
between a number of attributes of different products sold
by a coffee chain.
Fig. 3c illustrates another example of a quantitative-
quantitative graphic: the map. In this figure, the analyst is
studying how flight scheduling varies with the region of the
country in which the flight originated. Data about a number
of flights between major airports has been plotted as a
function of latitude and longitude; this data has been
composed with two other layers that plot the location of
airports and display the geography of each state as a
polygon.
It is extremely rare to use this type of graph with a
cardinality of one, not because it is not meaningful, but
because the density of information in such a graphic is
very low.
4.3 Visual Mappings
Each record in a pane is mapped to a mark. There are two
components to the visual mapping. The first component,
described in the previous section, determines the type of
graphic and mark. The second component encodes fields of
the records into visual or retinal properties of the selected
mark. The visual properties in Polaris are based on Bertin's
retinal variables [6]: shape, size, orientation, color (value
and hue), and texture (not supported in the current version
of Polaris).
Allowing analysts to explicitly encode different fields of
the data to retinal properties of the display greatly enhances
the data density and the variety of displays that can be
generated. However, in order to keep the specification
succinct, analysts should not be required to construct the
mappings. Instead, they should be able to simply specify
that a field be encoded as a visual property. The system
should then generate an effective mapping from the domain
of the field to the range of the visual property. These
mappings are generated in a manner similar to other
visualization systems. We discuss how this is done for the
purpose of completeness. The default mappings are illu-
strated in Fig. 4.
Shape: Polaris uses the set of shapes recommended by
Cleveland for encoding ordinal data [11]. We have extended
STOLTE ET AL.: POLARIS: A SYSTEM FOR QUERY, ANALYSIS, AND VISUALIZATION OF MULTIDIMENSIONAL RELATIONAL DATABASES 7
Fig. 4. The different retinal properties that can be used to encode fields of the data and examples of the default mappings that are generated when a
given type of data field is encoded in each of the retinal properties.
this set of shapes to include several additional shapes to
allow a larger domain of values to be encoded.
Size: Analysts can use size to encode either an ordinal or
quantitative field. When encoding a quantitative domain as
size, a linear map from the field domain to the area of the
mark is created. The minimum size is chosen so that all
visual properties of a mark with the minimum size can be
perceived [23]. If an ordinal field is encoded as size, the
domain needs to be small, at most four or five values, so
that the analyst can discriminate between different cate-
gories [6].
Orientation: A key principle in generating mappings of
ordinal fields to orientation is that the orientation needs to
vary by at least 30