0% found this document useful (0 votes)
70 views8 pages

Querying and Creating Visualizations by Analogy

A query-by-example interface and a mechanism for semiautomatically creating visualizations by analogy are proposed. Provenance metadata collected during the creation of pipelines can be reused to suggest similar content in related visualizations. The framework enables both expert and non-expert users in performing data exploration through visualization. The authors describe an implementation of these techniques in visTrails, a publicly-available, open-source system.

Uploaded by

vthung
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views8 pages

Querying and Creating Visualizations by Analogy

A query-by-example interface and a mechanism for semiautomatically creating visualizations by analogy are proposed. Provenance metadata collected during the creation of pipelines can be reused to suggest similar content in related visualizations. The framework enables both expert and non-expert users in performing data exploration through visualization. The authors describe an implementation of these techniques in visTrails, a publicly-available, open-source system.

Uploaded by

vthung
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

1560 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 13, NO.

6, NOVEMBER/DECEMBER 2007

Querying and Creating Visualizations by Analogy


Carlos E. Scheidegger, Huy T. Vo, David Koop, Juliana Freire, Member, IEEE, and Cláudio T. Silva, Member, IEEE

Abstract— While there have been advances in visualization systems, particularly in multi-view visualizations and visual exploration, the
process of building visualizations remains a major bottleneck in data exploration. We show that provenance metadata collected during
the creation of pipelines can be reused to suggest similar content in related visualizations and guide semi-automated changes. We
introduce the idea of query-by-example in the context of an ensemble of visualizations, and the use of analogies as first-class operations
in a system to guide scalable interactions. We describe an implementation of these techniques in VisTrails, a publicly-available,
open-source system.
Index Terms—visualization systems, query-by-example, analogy

1 I NTRODUCTION
Over the last 20 years, visualization research has emerged as an ef- requires expertise in both visualization techniques and the domain of
fective means to help scientists, engineers, and other professionals the data being explored. We propose a new framework that enables
extract insight from raw data. Visualization techniques are key to un- the effective reuse of this knowledge to aid both expert and non-expert
derstanding complex phenomena, and the field has grown into a mature users in performing data exploration through visualization.
area with an established research agenda [23]. Software systems have The framework consists of two key components: an intuitive
been developed that provide flexible frameworks for creating complex interface for querying dataflows and a novel mechanism for semi-
visualizations. These systems can be broadly classified as turnkey ap- automatically creating visualizations by analogy. The query interface
plications (e.g., ParaView, VisIt, Amira) [15, 5, 22] and dataflow-based supports both simple keyword-based and selection queries (e.g., find vi-
systems (e.g., VTK, SCIRun, AVS, OpenDX) [27, 24, 11, 29]. In sualizations created by some user), as well as complex, structure-based
this paper, we focus on dataflow systems, since they are more general queries (e.g., find visualizations that apply simplification before an
and often serve as the foundation of turnkey applications (e.g., both isosurface computation for irregular grid data sets). The query engine
ParaView and VisIt are based on VTK). is exposed to the user through an intuitive query-by-example interface
Most dataflow-based systems have sophisticated user interfaces with whereby users query dataflows through the same familiar interface they
visual programming capabilities that ease the creation of visualizations. use to create the dataflows (see Figure 1). This simple, yet powerful ap-
Nonetheless, the path from “data to insight” requires a laborious, trial- proach lets users easily search through a large number of visualizations
and-error process, where users successively assemble, modify, and and identify pipelines that satisfy user-defined criteria.
execute pipelines [30]. In the course of exploratory studies, users often While the query interface allows users to identify pipelines (and
build large collections of visualizations, each of which helps in the sub-pipelines) that are relevant for a particular task, the visualization by
understanding of a different aspect of their data. A scientist working on analogy component provides a mechanism for reusing these pipelines
a new computational fluid dynamics application might need a collection to construct new visualizations in a semi-automated manner—without
of visualizations such as 3-D isosurface plots, 2-D plots with relevant requiring users to directly manipulate or edit the dataflow specifications.
quantitative information, and some direct volume rendering images. As Figure 2 illustrates, our technique works by determining the differ-
Although in general each of these visualizations is implemented in a ence between a source pair of analogous visualizations, and transferring
separate dataflow, they have a certain amount of overlap (e.g., they this difference to a third visualization. This forms the basis for scalable
may manipulate the same input data sets). Furthermore, for a particular updates: the user does not need to have knowledge of the exact details
class of visualizations, the scientists might generate several different of the three visualization dataflows to perform the operation. Together,
versions of each individual dataflow while fine tuning visualization these contributions are a step towards scalable pipeline development
parameters or experimenting with different data sets. and refinement as an integral part of visualization systems.
In previous work, we proposed a new provenance model that uni-
Contributions and Outline. To the best of our knowledge, this is
formly captures changes to pipeline and parameter values during the
the first work that leverages provenance information to simplify and
course of data exploration [1, 4]. We showed that this detailed his-
automate the construction of new visualizations. The paper is organized
tory information, combined with a multi-view visualization interface,
as follows. We review related work in Section 2. In Section 3, we define
simplifies the exploration process. It allows users to navigate through
a set of basic operations over sets of dataflows. These operations include
a large number of visualizations, giving them the ability to return to
computing the difference between two pipelines, updating pipeline
previous versions of a visualization, compare different pipelines and
definitions, and matching similar pipelines. For the latter, we describe
their results, and resume explorations where they left off.
a new algorithm based on neighborhood similarities (Section 5.3). The
In this paper, we show how this provenance information can also
basic operations are used both in the query-by-example interface and in
be used to simplify and partially automate the construction of new
creating visualizations by analogy, which are presented in Section 4. An
visualizations. Constructing insightful visualizations is a process that
implementation of the proposed framework is discussed in Section 5. In
Section 6, we present case studies that illustrate how our new dataflow
• Carlos E. Scheidegger, Huy T. Vo, David Koop and Cláudio T. Silva are with manipulations streamline the process of constructing visualizations,
the Scientific Computing and Imaging (SCI) Institute at the University of and provide scalable mechanisms for exploring a large number of
Utah. email: {cscheid, hvo, dakoop, csilva}@sci.utah.edu. visualizations. We discuss the potential impact of our work on existing
• Juliana Freire is with the School of Computing at the University of Utah. visualization systems in Section 7. We conclude in Section 8 where we
email: {juliana}@cs.utah.edu outline directions for future work.
Manuscript received 31 March 2007; accepted 1 August 2007; posted online 27
2 R ELATED W ORK
October 2007. Published 14 September 2007.
For information on obtaining reprints of this article, please send e-mail to: Visualization systems have been quite successful at bringing visualiza-
[email protected]. tion to a greater audience. Seminal systems such as AVS Explorer and
Data Explorer [29, 11] enabled domain scientists to create visualiza-
1077-2626/07/$25.00 © 2007 IEEE Published by the IEEE Computer Society

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 5, 2009 at 02:36 from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 13, NO. 6, NOVEMBER/DECEMBER 2007 1561

Fig. 1. Querying by example. The interface for building a query over an ensemble of pipelines is essentially the same as the one for constructing and
updating pipelines. In fact, they work together: portions of a pipeline can become query templates by directly pasting them onto the Query Canvas.
In this figure, the user is looking for a volume rendered image of a file whose name contains the string “4877”. The system highlights the matches
both at the visualization level (version tree, shown in the middle) and at the module level (shown in the right insets).

tions with minimal training and effort. The early success of these sys- 3 P IPELINE O PERATIONS
tems led to the development of several alternative approaches. SCIRun Below, we review some terminology and introduce basic pipeline oper-
[24] focuses on computational steering: the intentional placement of ations that serve as the basis for query-by-example and visualization by
visualization and human intervention in the process of generating simu- analogy.
lations. The Visualization Toolkit [27] is a library that directly exposes
a powerful dataflow API for several programming languages. Definitions. A visualization system is a system that provides func-
tionality for graphically displaying data according to a specific set of
However, as scientific visualization becomes more widely used, sev-
rules. The programmatic rules for displaying this data constitute a
eral scalability issues have arisen, which range from ensuring good
pipeline. Executing the pipeline in the visualization system produces
performance, handling large amounts of data, capturing provenance,
a visualization. The pipeline is composed of modules which define
and providing interfaces to interact with a large number of visual-
specific operations and connections which specify the conceptual flow
izations. Distributed, parallel systems [5, 3] have been developed to
of data between modules. Each connection links an output port of
address performance and dataset size concerns. Such systems provide
one module (the source) with an input port of another module (the
a scalable architecture for creating and running visualization pipelines
destination). Module state is represented by module parameters. We
with large data.
denote the set of all visualization pipelines as V.
Another important requirement that has come to the attention of
developers and users of visualization systems is the ability to record Operations as functions on V. One important observation that
provenance so that computational experiments can be reproduced and we leverage throughout the text is that every operation performed on
validated. Provenance-aware scientific workflow systems have been a pipeline (adding and deleting modules, connections and parameters,
developed that record provenance both for data products (i.e., how etc.) can be directly expressed as a (potentially partial) function f :
a given result was generated) and for the exploratory process, the V → V. Many of our results depend on making these functions first-
sequence of steps followed to design and refine the workflows used class elements in the visualization system.
to process the data [25, 26, 4]. Provenance mechanisms have also
been proposed for visualization-specific systems. Kreuseler et al. [16] 3.1 Computing Pipeline Differences
proposed a history mechanism for keeping track of parameter values Dataflow-based systems allow users to create a variety of pipelines,
in visual data mining, and Jankun-Kelly et al. [13] recently proposed a rather than being restricted to a predefined set of visualizations. In
formal calculus for parameter changes. VisTrails uses a scheme that the process of deriving insightful visualizations, a series of pipelines
uniformly captures both parameter and pipeline changes [1, 4]. is often created by iterative refinement. To understand this process as
well as the derived visualizations, it is useful to compare the different
As multiple workflows are manipulated in exploratory processes, it is
pipelines. The standard representation of a pipeline is a directed graph,
important to provide interfaces that allows users to compare their results.
with labeled vertices representing operations. Given a pair of such
Jankun-Kelly and Ma [12] have proposed a spreadsheet-like interface
pipelines, we want to determine the difference between the visualiza-
for quickly exploring the parameter space of a visualization. In the area
tions they generate. In the following, we show how to describe and
of user interfaces, Kurlander et al. [18, 17] have presented approaches
manipulate differences between pipelines.
to streamline the repetitive tasks users often face. The seminal example
We define δ : V → V as a function on the space of visualizations,
of a system which uses the same interface to both manipulate and query
and Δ : V × V → δ as a function that takes two pipelines pa and pb
data is the Query-By-Example database language [31]. We propose a
and produces another function that will transform pa to pb . For brevity,
similar approach for querying visualization ensembles in Section 4.2.
let δab = Δ(pa , pb ). From now on, we will use δ to refer to an arbitrary
Graph searching and query languages have also been investigated in
Δ(a, b). It is clear that δ is not unique: even though δab (pa ) = pb is a
database systems [28].
necessary constraint, there are no further restrictions. In some sense,
The algorithm we describe for matching two pipelines is similar we would like to pick the δab that minimally changes all other pipelines.
to a technique developed to match database schemas [21]. It is also We define the distance between pa and pb as the number of changes
reminiscent of well-known variations of PageRank, which is the basis necessary to perform the transformation. We then look for the minimal
for Google’s successful ranking algorithm [2, 19]. Our visualization- set of operations that takes pa to pb . As we discuss in section 4.1, this
by-analogy mechanism shares some of the objectives of programming- is computationally impractical, so we relax the minimality requirement
by-example techniques [20]. and instead use heuristics.

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 5, 2009 at 02:36 from IEEE Xplore. Restrictions apply.
1562 SCHEIDEGGER ET AL: QUERYING AND CREATING VISUALIZATIONS BY ANALOGY

Fig. 2. Visualization by analogy. The user chooses a pair of visualizations to serve as an analogy template. In this case, the pair represents a
change where a file downloaded from the WWW is smoothed. Then, the user chooses a set of other visualizations that will be used to derive new
visualizations, with the same change. These new visualizations are derived automatically. The pipeline on the left reflects the original changes, and
the one on the right reflects the changes when translated to the last visualization on the right. The pipeline pieces to be removed are portrayed in
orange, and the ones to be added, in blue. Note that the surrounding modules do not match exactly: the system figures out the most likely match.

We also restrict our initial analysis to the simple case where pb 3.3 Matching Pipelines
is derived from pa — the user created pb by applying a finite set of While computing pipeline differences is an integral part in reason-
changes to pa . We denote this relationship as pa < pb . Then, a system ing about multiple visualizations, another important operation is to
with some knowledge of how the pipelines were constructed should be match similar pipelines, i.e., we wish to find correspondences between
able to determine the differences between related pipelines using this pipelines. The result of pipeline matching can either be a binary deci-
history. We demonstrate such an implementation in Section 5. sion (whether the pipelines match) or a mapping between the two inputs.
When pa < pb , we can then say δab is the sequence of operations Note that different metrics and thresholds can be used to determine the
that was used to derive pb from pa . However, few pairs of pipelines similarity of two pipelines. In the remainder of this section, we discuss
respect this property, and we would like Δ to be completely general. an approach for finding the best mapping between two pipelines.
We start with a simple extension: if δab exists, so should δba . In fact, Let D represent the set of all domain contexts and define map :
we would like V × V → (D → D) as a function which takes two pipelines, pa and pb ,
δab δba = e as input and produces a (partial) map from the domain context of pa
where e is the identity function. We can achieve this if our sequence of to the domain context of pc . The map may be partial in cases where
changes consists of invertible atomic operations. Specifically, suppose elements of pa do not have a match in pb or vice versa. Notice that
δab = fn ◦· · ·◦ f1 where each fi has a well-defined inverse. For example, if pa < pb , map(pa , pb ) = mapab is the identity on all elements that
if fi is the operation of adding a module to the pipeline, fi−1 is the were not added or deleted in the process of deriving pb .
operation of deleting that module from the pipeline. Then, To construct such a mapping, we formulate the problem as a
weighted graph matching problem. Let Ga = (Va , Ea ) be the graph
−1
δba = δab = f1−1 ◦ · · · ◦ fn−1 corresponding to the pipeline pa . In a straightforward definition, Va
would be the modules in pa and Ea the connections in pa . However,
From now on, we assume that any function that operates on V has an one could consider other definitions such as the dual of this representa-
inverse (note that both functions might still be partial). tion. For Va , we define a scoring function s : Va ×Vb → [0.0, 1.0] that
Our ultimate goal is to apply the pipeline difference result δ to defines the compatibility between vertices. For example, the similarity
pipelines other than those used to create it. To analyze where δ is score of two modules that are exactly the same can be set to 1.0 and
applicable, we introduce the domain and range context of δ . Formally, the score of modules M1 and M2 such that M1 is a subclass of M2 may
the domain context of δ , D(δ ), is the set of all pipeline primitives be set to 0.6.
required to exist for δ to be applicable. We represent these contexts We define a matching between Ga and Gb as a set of pairs of vertices
as sets of identifiers. For example, if δ is a function that changes the M = {(va , vb )} where va ∈ Va and vb ∈ Vb . A matching is good when
file name parameter of a module with id 32, D(δ ) is the set containing
the module with id 32. Similarly, the range context of δ , R(δ ), is the ∑ s(va , vb )
set of all pipeline primitives that were added or modified by δ . Note (va ,vb )∈M
that D(δ −1 ) = R(δ ), which provides an easy way to compute range
contexts. is maximized. A good matching on pipelines is one that corresponds
to a good matching on their representative graphs. Given a good
3.2 Updating Pipelines matching M, we can define a mapping from pa to pb as va → vb for all
Finding differences is not only a useful technique for analyzing (va , vb ) ∈ M.
pipelines, but it can also be used to create new visualizations. The
4 S CALABLE P IPELINE M ANIPULATION P RIMITIVES
idea is similar to applying patches in software systems: the difference
results can be applied to modify an existing pipeline. Given a δ , it is Using the concepts defined in the previous section, we now introduce
straightforward to apply it to a pipeline. Recall that δ is a sequence of two new primitives for manipulating visualizations.
actions that transform a pipeline. Thus, updating a pipeline pa is as
simple as computing δ (pa ). Note, however, that δ can fail if an ele- 4.1 Complexity Analysis
ment of D(δ ) does not exist in pa . However, if we allow δ to continue The operations described in this section are theoretically hard to com-
despite one or more operations failing, we can still achieve a partial pute. Computing a minimal Δ(pa , pb ), or matching two pipelines pa
update. and pb , is, in general, as hard as solving a subgraph isomorphism.

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 5, 2009 at 02:36 from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 13, NO. 6, NOVEMBER/DECEMBER 2007 1563

Fig. 3. Query-by-example and analogy-based updates provide a simple way for users to manipulate multiple pipelines simultaneously. In this
example, the user selects parts of a query result and updates them all with an analogy that introduces a preprocessing module to the pipeline that
supersamples the original dataset.

This problem is trivially reducible from the MAX-CLIQUE problem, a given visualization by adjusting parameter so they match a published
well-known NP-complete problem. Additionally, MAX-CLIQUE is a result. The user might also simply want to switch to a different vi-
particularly hard problem: there is no approximation algorithm for it sualization algorithm. In either case, there usually exists an example
with a subpolynomial approximation factor [9]. Since we cannot get a that demonstrates the given technique. A user can infer the necessary
good approximation, heuristics for both problems are well justified. changes, and then apply them to a particular visualization. This ana-
In this work, we make use of the information stored in δ functions logical reasoning is very powerful, and we show that visualization by
both to reduce the search space and to increase the effectiveness of analogy can be (partly) automated. Figure 2 illustrates the process of
these heuristics. creating visualizations by analogy.
Two ordered pairs are analogous if the relationship between the first
4.2 Query-By-Example pair mirrors the relationship between the second pair. Therefore, if we
Working with multiple visualizations is problematic if they have to know what the relationship is between the first pair, and are given the
be treated individually. In the process of visualizing different data first entity of the second pair, we should be able to determine the other
sets, trying different techniques, and tweaking parameters, a user may entity of that pair. More concretely, given a difference between δ two
create a large number of visualizations. It is clearly impractical to pipelines, we should be able to modify an arbitrary pipeline so that the
locate those that match certain criteria by examining each individually. resulting changes mirror δ .
To solve the problem of locating pipelines, we introduce the idea of To automate this operation, we need to compute the difference be-
query-by-example for visualizations. Instead of formulating the search tween two pipelines and apply this difference to another (possibly
criteria in a structured language, a user builds a pipeline fragment that unrelated) pipeline. Suppose that we have three pipelines pa , pb , pc ,
contains the desired features. The exact same interface used in building and wish to compute pd so that pa : pb as pc : pd . We discussed the
a pipeline can be used for building a query, which means that a user problem of finding the difference in Section 3.1, but recall that updat-
familiar with building pipelines can easily query them. Figure 1 shows ing a pipeline pc with an arbitrary δ will fail if pc does not contain
an excerpt of the query-by-example functionality. the domain context of δ . When this is the case, we need to map the
Our algorithm is based on the observation that searching all pipelines difference so that it can be applied to pc .
for a given pattern is equivalent to determining whether a candidate We wish to express δab so that δab (pc ) succeeds. This is exactly
pipeline matches the pattern. Once a query, represented as a pipeline what mapac does; recall that to construct this operator we need to find
fragment, is constructed, we can use the pipeline matching algorithm a match between pa and pc , as described in Section 3.3. More precisely,
on each candidate pipeline to determine if it satisfies the query. De- we first compute δcb ∗ = map (p , p ) and then find δ ∗ (p ).
ac a b cb c
pending on user preferences, we can require an exact or an approximate In summary, our algorithm is:
match. While each element of the query pipeline (modules, connec-
tions, parameters, etc.) needs to be included in the match, a candidate 1. Compute the difference: δab = Δ(pa , pb )
pipeline that contains more elements than those in the query pipeline 2. Compute the map: mapac = map(pa , pc ).
still satisfies the query. ∗ = map (δ )
3. Compute the mapped difference: δcb ac ab
It should be noted that differences can help optimize our matching.
∗ (p )
4. Compute pd = δcb
For example, suppose that we have a given query pipeline pq and two c
candidate pipelines pa and pb . If we find that pa satisfies the query,
and we know δab , we can check to see if the domain context D(δab ) 5 I MPLEMENTATION
contains any elements that matched pq . If it does not, we know that To implement the scalable manipulation primitives introduced in Sec-
pb also matches. Similarly, if pa does not match pq and R(δab ) does tion 4, we use the freely available VisTrails system. VisTrails automati-
not contain necessary elements for matching pq , we know that pb will cally captures the evolution of workflows which allows straightforward
not satisfy the query. Thus, we can determine all pipelines that satisfy implementations of the pipeline operations presented in Section 3. We
our query by iteratively matching and updating the matches based on provide a quick overview of some key concepts in VisTrails; the reader
differences. is referred to [1, 4] for more details.
In VisTrails, as a user constructs a visualization, the entire history of
4.3 Visualization by Analogy manipulation is transparently stored in the version tree (the term vistrail
When creating visualizations, users often have to integrate new features is used interchangeably). Each action f that modifies the pipeline
into existing pipelines. For example, a user may wish to improve a (e.g., adding or deleting a module, connecting modules, or changing

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 5, 2009 at 02:36 from IEEE Xplore. Restrictions apply.
1564 SCHEIDEGGER ET AL: QUERYING AND CREATING VISUALIZATIONS BY ANALOGY

Fig. 4. Example of an analogy between pipelines where there is no Fig. 5. Example matching generated by the pipeline matching algorithm.
perfect module matching. The difference in the left pipeline pair is Thicker edges correspond to stronger correspondences. Notice that the
transferred to the right pipeline pair. Note, however, that the module correspondences get progressively better as the algorithm iterates. This
names are not the same—the system must find the most likely pairing matching corresponds to Example 2 in Section 6.
based on the similarity measure described in Section 5.3.
where ports(v) denotes the ports of the module corresponding to the
a parameter) is represented explicitly as a function f : V → V, where vertex v. This measure emphasizes port matching: it gives higher scores
V is the space of all possible visualizations. A pipeline is then the to modules that can be more easily substituted for each other. Such a
composition of these functions and is materialized by applying the substitution depends solely on the compatibility of the input and output
resulting function to the empty visualization. ports and not on module name or functionality. Figure 4 shows an
example of such an approximate matching.
5.1 Pipeline Differences Notice that this scoring function is defined only for nodes, and there-
fore, it does not help us in comparing the topologies of the pipelines.
In a vistrail, the straightforward application of the action-based for-
While a simple maximum bipartite matching [6] between nodes may
malism allows the computation of simple differences. When pa < pb ,
succeed in finding a map between nodes, we would like to enforce
Δ(pa , pb ) is the sequence of actions from pa to pb which can be read
some connectivity constraints on the graphs. Intuitively, we want to
directly from the vistrail. In addition, we have implemented the inverse
define the similarity between vertices as a weighted average between
operation of f for each type of operation in VisTrails so δba is also
how compatible the modules are and how similar their neighborhoods
easily constructed. However, it is likely that we wish to compute a dif-
are. The similarity score strikes a balance between the locality of pair-
ference between pipelines that are not related in such a simple manner.
wise compatibility and the overall similarity of the neighborhood. This
Specifically, suppose that pa < pb and pb < pa . Note that there exists
definition seems circular, but, surprisingly, it leads to a very simple and
some pc (possibly the empty pipeline, which is in general the least
elegant matching technique based on the dominant eigenvector of a
common ancestor of both pa and pb ) such that pc < pa and pc < pb .
Markov chain [19].
Then,
−1 We create a graph G = Ga × Gb that combines both Ga and Gb . In
δab = δac δcb = δca δcb this graph, we define a vertex va,b for each pair of vertices va ∈ Va , vb ∈
Thus, we can find Δ(pi , p j ) for any two pipelines, even if they are not Vb . Similarly, an edge vi, j ∼ vk, exists when vi ∼ vk in Ga and v j ∼ v
directly related. in Gb . (G is the graph categorical product of Ga and Gb .) Notice
that the connectivity of G encodes the pairwise neighborhoods of the
5.2 Pipeline Updates vertices in Ga and Gb . We now want to translate our intuitive algorithm
from the previous paragraph into an iterative algorithm. First, we need
Pipeline updates are also easily computed by taking the action-based the following notation:
representation of a pipeline and appending the new actions given by
a δ . Although it is always possible to append the new actions in a • πk (G) is the measure of pairwise similarity after k steps
vistrail, the resulting sequence of actions may be invalid. As noted
earlier, this update can fail if the domain context of δ does match pa . • A(G) is the adjacency matrix of G normalized so that the sum of
More specifically, each operation in δ can succeed or fail based on each row is one (a row with sum zero is modified to be uniformly
whether the elements to be modified or deleted exist in pa . distributed)
• c(G) is the normalized vector whose elements are the scores for
5.3 Pipeline Matching the paired vertices in G: c(G) = (c(va , vb ), va ∈ Ga , vb ∈ Gb )
In our matching algorithm, we use the standard graph representation • α is a user-defined parameter that determines the trade-off be-
where vertices correspond to modules and edges to connections. In tween pairwise scoring and connectivity
addition, even though we still discriminate between input and output
ports, we do not enforce directionality on the edges so that we can To iteratively refine our estimate, we diffuse the neighborhood similarity
diffuse similarity along them. according to the following formula:
Recall that our goal in pipeline matching is to determine a mapping
from the context of one pipeline to another. To do so, we convert the πk+1 = αA(G)πk + (1 − α)c(G)
(1)
pipelines to labeled graphs and define a scoring function for nodes = MG πk
based on their labels. With a graph for each pipeline, we compute the
mapping by pairing nodes that score well and enforcing connectivity The final pairwise similarity between modules is given by π∞ =
constraints between these pairs. limk→∞ πk . For our purposes, c(G) gives a good measure of similarity
Let Ga and Gb be the graphs corresponding to pa and pb . For our so A(G) is used mainly to break ties between two alternatives. Thus,
implementation, we define modules as vertices and connections as we choose a small weight for the neighborhood in our implementation
edges. Denote a connection between two vertices a and b as a ∼ b and (α = 0.15). Though this formulation makes intuitive sense, we want to
define the scoring function that measures the pairwise compatibility of ensure that repeated iteration always converges and does so quickly.
vertices by It is clear that MG in Equation 1 is a linear operator; therefore, if π
| ports(va ) ∩ ports(vb )| converges, it does so to an eigenvector. The theory of Markov chains
c(va , vb ) = tells us that because of the special structure of MG , it has spectrum
| ports(va )| + | ports(vb )|

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 5, 2009 at 02:36 from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 13, NO. 6, NOVEMBER/DECEMBER 2007 1565

The computation of the analogy mirrors the algorithm described


in Section 4.3. More concretely, for pipelines pa and pb defining the
analogy and the pipeline to be updated pc , we derive δab using the
version tree. We then match Ga and Gc using the algorithm described
in Section 5.3 to obtain mapac and use this function to compute δcb∗
which can then be applied to pc to produce a new pipeline pd .

6 C ASE S TUDIES
We present three examples that illustrate the proposed primitives.
Example 1: Updating Inputs in Multiple Pipelines In this sce-
nario, we want to compare different isosurface extraction techniques.
Fig. 7. Switching the rendering technique by analogy. The analogy In particular, we wish to investigate how resilient the techniques are
template on the left specifies that volume rendering modules should be to subsampled or oversampled data. Typically, the techniques are first
replaced by isosurfacing ones. The analogy target and the resulting compared using raw inputs. The task, then, is to update the pipelines
pipeline are shown, together with the resulting visualization. with the new test data.
There are several ways to address this problem. The most straight-
(1, α, α 2 , · · · ) [19], and so the iteration is exactly the power method forward one is to develop a preprocessing script that converts the files.
[8] for eigenvalue calculation. Hence, the iteration will converge to Although this is feasible, it is not desirable since it puts the burden to
the single dominant eigenvector, and each iteration will improve the manage the data on the user. At the least, it requires explicit manage-
estimate linearly by 1 − α. Since we are using a small α, this ensures ment of intermediate files, and it does not provide an explicit record of
quick convergence. From this iteration, we obtain π∞ which contains the desired experiment. A better alternative is to directly create new
the relative probabilities of va ∈ Ga and vb ∈ Gb matching for each dataflows that exercise the test regime. It is clear, however, that this can
possible pair. For each vertex in va , the vertex in vb whose pair has the be time consuming if the specialist must first examine each pipeline to
maximum value in π∞ will be considered the match. Figure 5 illustrates determine whether it needs to be updated and only then perform the
how the matchings are refined as the mapping algorithm iterates. required modifications.
It is therefore desirable to automate this process. With query-by-
5.4 Query-by-example example, we can find all matching pipelines with one operation. With
analogies, we can perform the desired update once, capture that change
Recall that the benefit of query-by-example is that users do not have to as an analogy and apply it to the matching pipelines. Not only does this
learn a query language or a new interface to find matching pipelines. save time and effort, but it ensures that all pipelines are updated. In
Our implementation presents the same interface used in building a addition, each update is done in a similar manner; the possibility that
pipeline as it does for querying an ensemble of pipelines, as shown in the updates are inconsistent is reduced.
Figure 1. A user constructs a query pipeline by dragging modules from In this example, we construct a query template by copying the
a list of available modules or copying and pasting pieces of existing relevant portion of the pipeline onto the Query Canvas. This procedure
pipelines. Parameters and connections can also be specified in a similar returns a set of pipelines (highlighted in Figure 3) which we need
manner. When the user executes the query, the system searches the to update. We first update one of the pipelines by directly adding the
current version tree for all pipelines that match that query. resampling step. Then, we define an analogy template using the original
As discussed in Section 4, we want to find pipelines that contain the pipeline and the updated one. We apply this analogy to automatically
query pipeline. Currently, this matching is computed on a per pipeline update the other pipelines that match the query. As result, several new
basis. Specifically, for each pipeline, we topologically sort the vertices results are produced without requiring the user to manually update each
of the graph induced by the pipeline and match the vertices of the query individual pipeline.
graph. If all vertices match, we return the candidate pipeline as a match.
All matches are selected and highlighted in the version tree so that Example 2: Changing a rendering algorithm In this example,
users can quickly see query results. Selecting a version will display we show a moderately complex change in a pipeline that replaces an
the pipeline with the portion of the pipeline that matched the query entire rendering technique with another. When designing an effective
highlighted. visualization, one algorithm tends to perform better than the alternatives.
It is natural, then, that a single visualization will be tried with different
5.5 Visualization by analogy algorithms. When the best result is identified, the user must change the
other visualizations to reflect this. In this example, we show that it is
There are two steps involved in applying an analogy to a pipeline. First, possible to replace an entire algorithm by analogy.
the user defines the analogy template by selecting the two pipelines
The visualization portrayed in Figure 7 renders an ITK [10] scalar
whose difference is to applied to another pipeline. Second, the user
field in VTK [27], using the Teem tools [14] to generate the appropriate
selects another pipeline and applies the analogy to that pipeline, creating
data format. While in the original change, there was only one generated
a new pipeline. In VisTrails, these operations can be executed in
view, in the analogy target there are two renderings, so the system must
either the version tree pane of the builder window or the visualization
correctly decide the proper one to modify.
spreadsheet. In either case, the application of the analogy creates a new
version in the vistrail. Example 3: Chaining Analogies We have discussed that one can
In the version tree, an analogy is defined by dragging the version modify a pipeline by analogy as a single update operation. However,
representing initial pipeline to the version representing the desired one can also use analogies to quickly combine multiple examples. In
result. This operation displays the difference between the pipelines this example, illustrated in Figure 6, we show how three different
and the user is able to click a button to create an analogy from these techniques can be combined to transform a very simple pipeline into a
pipelines. To apply the analogy, the user right-clicks on the version visualization that is not only more complicated but also more useful.
representing the pipeline and selects the desired analogy. In many scientific fields, the amount of data and the need for inter-
Creating and applying analogies in the VisTrails Spreadsheet is action between researchers across the world has led to the creation of
similar but even easier to use. The spreadsheet supports a viewing online databases that store much of the domain information required.
mode and a composition mode. In the composition mode, a user can Scientists are concerned not only with using data from these centralized
create an analogy by dragging one cell into another cell. To apply the repositories but also publishing their own results for others to view. In
analogy, the user drags the pipeline to be modified to a new cell, at this example, we show how analogies can be used to modify a sim-
which point the analogy is applied and the new visualization displayed. ple pipeline that visualizes protein data stored in a local file to obtain

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 5, 2009 at 02:36 from IEEE Xplore. Restrictions apply.
1566 SCHEIDEGGER ET AL: QUERYING AND CREATING VISUALIZATIONS BY ANALOGY

Fig. 6. Creating complex pipelines by chaining simple analogies. From three simple examples, the user creates a complex visualization that creates
a web page with enhanced molecule rendering, whose results are fetched from the Protein Database, an online macromolecular database.

data from an online database, create an enhanced visualization for that


protein, and finally publish the results as an HTML report.
We begin with a vistrail that contains pipelines that accomplish
each of the individual tasks outlined above. Specifically, we have a
simple pipeline p0 that reads a file with protein data and generates a
visualization of that data. We also have pipelines p1 and p 1 where the
difference between the two is that p1 reads a local file and p 1 reads
data from an online database, pipelines p2 and p 2 where p2 features
a simple line-based rendering and p 2 improves the rendering to use
a ball-and-stick model. Finally, p3 displays a visualization while p 3
generates an HTML report that contains the visualized image.
To create the new pipeline, we compute the analogy between p1 and
p 1 and apply it to p0 . Then, we compute the analogy between p2 and
p 2 and apply that the result of the previous step. Finally, we compute
the analogy between p3 and p 3 and apply it. The new pipeline p∗0
prompts the user for a protein name, uses that information to download
the data for that protein, creates a ball-and-stick visualization of the
data, and embeds that image in an HTML report.
The benefits of using analogies to generate this new pipeline not
only include faster results but also a lower level of knowledge needed to
modify pipelines. One can imagine a scientist who executes a pipeline
to create a visualization downloading a pipeline which publishes data
to the web and adding the same capability to their pipeline via analogy.
Instead of trying to find the correct modules and manually modifying
the pipeline, the scientist can use the analogy from the example pipeline
to add the new feature automatically.
Fig. 8. A situation where creating pipelines by analogy fails. The in-
tended effect when defining the analogy was to replace the raw file with
7 D ISCUSSION a preprocessing step. Note, however, that there still is one lingering
We argue that both query-by-example and visualization by analogy are connection, highlighted in red.
useful operations that provide efficient solutions for what are otherwise
manual, time-consuming tasks. The basic operations introduced in
Section 3 rely both on the graph structure of pipelines and on pipeline labeled graphs.
modification history. As discussed, global comparisons of graphs are As with most heuristics-based approaches, our approach to matching
intractable in general, but the fact that visualization pipelines translate is not foolproof, and there are cases where it may fail to produce the
to labeled graphs where the nodes are largely distinct allows us to results a user expects. For example, if a user applies an analogy to
define effective heuristics. We believe that this framework can be used a pipeline that shares little or no similarity with the starting pipeline,
to develop additional primitives that significantly reduce the amount of the matching algorithm will return a mapping which is likely to be
work required to maintain and integrate ensembles of visualizations. meaningless. However, when application of an analogy fails or produce
The proposed primitives can be easily implemented in dataflow- poor results, the user can either discard or refine the resulting pipeline:
based visualization systems that provide undo/redo capabilities. As analogies always construct new pipelines—they do not modify existing
long as undo/redo operations are represented explicitly in the system pipelines.
(for example, using the Command design pattern [7]), a straightforward Analogies can be highly subjective. In some cases, applying an
serialization of these would achieve the wanted capabilities. Module analogy can lead to ambiguity and derive multiple results. Figure 8
and connection representations may vary across systems, but the frame- shows an example of an analogy that is supposed to resample an input
work and techniques apply as long as the elements can be translated to file before continuing with the rest of the pipeline. Instead of removing

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 5, 2009 at 02:36 from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 13, NO. 6, NOVEMBER/DECEMBER 2007 1567

a connection from the raw file to downstream modules, the application [7] E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design Patterns:
keeps the old connection in addition to adding the new connection to Elements of reusable object-oriented software, chapter 5. Addison-Wesley,
the resampling module. In this case, a user might have to “clean up” 1995.
the results of the pipeline. Our current pairwise similarity score tries to [8] G. H. Golub and C. F. V. Loan. Matrix computations. Johns Hopkins
establish a compromise in the absence of domain-specific knowledge University Press, Baltimore, MD, USA, 3rd. edition, 1996.
about modules. Formulating and incorporating such knowledge into [9] J. Hastad. Clique is hard to approximate within n1−ε . Acta Mathematica,
the matching is certainly possible and desirable. An interesting avenue 182:105–142, 1999.
for future work is to investigate how to acquire this information in an [10] L. Ibanez, W. Schroeder, L. Ng, and J. Cates. The ITK Software Guide.
Kitware, Inc. ISBN 1-930934-15-7, 2nd. edition, 2005.
unobtrusive way, for example, by taking user feedback about derived
[11] IBM. OpenDX. http://www.research.ibm.com/dx.
analogies into account. as an avenue for future work. Furthermore, our
[12] T. Jankun-Kelly and K.-L. Ma. Visualization exploration and encapsulation
current implementation finds the best mapping in a greedy fashion, on via a spreadsheet-like interface. IEEE Transactions on Visualization and
a per-module basis. There are alternative ways of using π∞ , and this Computer Graphics, 7(3):275–287, July/September 2001.
investigation is part of future work. [13] T. Jankun-Kelly, K.-L. Ma, and M. Gertz. A model and framework
One important consideration when introducing new manipulation for visualization exploration. IEEE Transactions on Visualization and
primitives is the impact on how users interact with them. Query-by- Computer Graphics, 13(2):357–369, March/April 2007.
example represents an intuitive way for users to query pipelines. One [14] G. Kindlmann. Teem. http://teem.sourceforge.net.
could imagine a querying tool that narrows results as the query is built [15] Kitware. ParaView. http://www.paraview.org.
(e.g., similar to auto-completion). Also, to extend our analogy tool, [16] M. Kreuseler, T. Nocke, and H. Schumann. A history mechanism for
users’ input could be used to guide the matching process, especially in visual data mining. In Proceedings of IEEE Information Visualization
cases where the automatic construction fails. Constraint information Symposium, pages 49–56, 2004.
might be incorporated into the matching, allowing it to generate better [17] D. Kurlander and E. A. Bier. Graphical search and replace. In Proceedings
results in situations where the information in the pipeline definitions is of SIGGRAPH 1988, pages 113–120, 1988.
not sufficient. Along the same lines, it may be useful to allow users to [18] D. Kurlander and S. Feiner. A history-based macro by example system.
explore the results of many possible matchings. In Proceedings of UIST 1992, pages 99–106, 1992.
[19] A. N. Langville and C. D. Meyer. Google’s PageRank and Beyond: The
8 C ONCLUSIONS AND F UTURE W ORK Science of Search Engine Rankings. Princeton University Press, 2006.
[20] H. Lieberman, editor. Your Wish is My Command: Programming by
We have described a new framework that leverages visualization prove- Example. Morgan Kaufmann, 2001.
nance to simplify the construction of new visualizations. This frame- [21] S. Melnik, H. Garcia-Molina, and E. Rahm. Similarity flooding: A versa-
work provides scalable and easy-to-use primitives for querying pipeline tile graph matching algorithm and its application to schema matching. In
ensembles and for creating multiple visualizations by analogy. We have Proceedings of the 18th International Conference on Data Engineering,
also proposed efficient algorithms and intuitive interfaces for realizing pages 117–128, 2002.
these primitives in a visualization system. [22] Mercury Computer Systems. Amira. http://www.amiravis.com.
There are many avenues for future work. The use of domain-specific [23] T. Munzner, C. Johnson, R. Moorhead, H. Pfister, P. Rheingans, and T. S.
distance measures between pipelines and modules may be useful for Yoo. NIH-NSF visualization research challenges report summary. IEEE
customizing analogy generation in some domains (for example, for Computer Graphics and Applications, 26(2):20–24, 2006.
transfer function design and comparison). We are currently investigat- [24] S. G. Parker and C. R. Johnson. SCIRun: a scientific programming envi-
ronment for computational steering. In Proceedings of the International
ing machine learning techniques for automatically determining com-
Conference for High Performance Computing, Networking, Storage and
mon pipeline operations on a large database of visualizations, allowing
Analysis (Supercomputing), 1995.
templates to also be determined automatically. [25] Provenance challenge. http://twiki.ipaw.info/bin/view/Challenge.
[26] C. Scheidegger, D. Koop, E. Santos, H. Vo, S. Callahan, J. Freire, and
ACKNOWLEDGMENTS C. Silva. Tackling the provenance challenge one layer at a time. Concur-
We acknowledge the generous help of many colleagues and collabo- rency and Computation: Practice and Experience, 2007. To appear.
rators. Suresh Venkatasubramanian helped with discussions on graph [27] W. Schroeder, K. Martin, and B. Lorensen. The Visualization Toolkit.
matching and complexity. Erik Anderson and João Comba graciously Kitware Inc, 2007.
provided their vistrails for this work. Chems Touati and Steven Calla- [28] D. Shasha, J. T.-L. Wang, and R. Giugno. Algorithmics and applications
han helped produce the video and figures. This work uses a number of of tree and graph searching. In Proceedings of the ACM Symposium on
Principles of Database Systems, 2002.
existing open-source software and data repositories, including Teem
[29] C. Upson, J. Thomas Faulhaber, D. Kamins, D. Laidlaw, D. Schlegel,
(http://teem.sourceforge.net), VTK (http://www.vtk.org), ITK (http://www.itk.org),
J. Vroom, R. Gurwitz, and A. van Dam. The application visualization
trimesh2 (http://www.cs.princeton.edu/gfx/proj/trimesh2/), and the RCSB Pro- system: A computational environment for scientific visualization. IEEE
tein Database. This work was funded by the National Science Founda- Computer Graphics and Applications, 9(4):30–42, 1989.
tion, the Department of Energy, and an IBM Faculty Award. [30] J. J. van Wijk. The value of visualization. In Proceedings of IEEE
Visualization, pages 79–86, 2005.
R EFERENCES [31] M. Zloof. Query-by-example: a data base language. IBM Systems Journal,
[1] L. Bavoil, S. Callahan, P. Crossno, J. Freire, C. Scheidegger, C. Silva, and 16(4):324–343, 1977.
H. Vo. VisTrails: Enabling interactive, multiple-view visualizations. In
Proceedings of IEEE Visualization, pages 135–142, 2005.
[2] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search
engine. Computer Networks and ISDN Systems, 30(1–7):107–117, 1998.
[3] K. Brodlie, D. Duce, J. Gallop, M. Sagar, J. Walton, and J. Wood. Vi-
sualization in grid computing environments. In Proceedings of IEEE
Visualization, pages 155–162, 2004.
[4] S. Callahan, J. Freire, E. Santos, C. Scheidegger, C. Silva, and H. Vo.
Managing the evolution of dataflows with VisTrails. In IEEE Workshop
on Workflow and Data Flow for Scientific Applications (SciFlow), 2006.
[5] H. Childs, E. S. Brugger, K. S. Bonnell, J. S. Meredith, M. Miller, B. J.
Whitlock, and N. Max. A contract-based system for large data visualiza-
tion. In Proceedings of IEEE Visualization, pages 190–198, 2005.
[6] T. Cormen, C. Leiserson, R. Rivest, and C. Stein. Introduction to Algo-
rithms, chapter 26. MIT Press, 2001.

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 5, 2009 at 02:36 from IEEE Xplore. Restrictions apply.

You might also like