0% found this document useful (0 votes)
27 views12 pages

QIS: A Framework For Biomedical Database Federation: Model Formulation

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 12

Journal of the American Medical Informatics Association Volume 11 Number 6 Nov / Dec 2004 523

Model Formulation j

QIS: A Framework for Biomedical Database Federation

LUIS MARENCO, MD, TZUU-YI WANG, PHD, GORDON SHEPHERD, MD, DPHIL,
PERRY L. MILLER, MD, PHD, PRAKASH NADKARNI, MD

A b s t r a c t Query Integrator System (QIS) is a database mediator framework intended to address robust data
integration from continuously changing heterogeneous data sources in the biosciences. Currently in the advanced
prototype stage, it is being used on a production basis to integrate data from neuroscience databases developed for the
SenseLab project at Yale University with external neuroscience and genomics databases. The QIS framework uses
standard technologies and is intended to be deployable by administrators with a moderate level of technological
expertise: It comes with various tools, such as interfaces for the design of distributed queries. The QIS architecture is
based on a set of distributed network-based servers, data source servers, integration servers, and ontology servers, that
exchange metadata as well as mappings of both metadata and data elements to elements in an ontology. Metadata
version difference determination coupled with decomposition of stored queries is used as the basis for partial query
recovery when the schema of data sources alters.
j J Am Med Inform Assoc. 2004;11:523–534. DOI 10.1197/jamia.M1506.

A major long-term goal of the national Human Brain Project HBP. We discuss its current status, plans for future work, and
(HBP),1 a loosely knit consortium of neuroscience researchers, lessons learned.
is interoperability between the ‘‘federation’’ of databases pro-
duced by its members. The challenge is to devise a robust Background
approach that, among other things, enhances the ability to
Barriers to Federation of Bioscience Databases
answer complex research questions. At the most basic level,
Considerable technical expertise is required to set up an effec-
database interoperation means querying multiple databases
tive federated-query infrastructure. Bioscience database sche-
in a single logical operation to retrieve data of interest from
mas evolve significantly and rapidly because of scientific
each.
progress, changing research goals, and system redesign as
This paper outlines barriers to interoperability of bioscience better ways of representing data are discovered. For perfor-
databases, summarizes previous interoperation approaches, mance and safety reasons, participating databases typically
and then describes Query Integrator System (QIS), a system do not support unrestricted Structured Query Language
developed to allow multidatabase network-based queries in (SQL) queries. Instead, predefined, parameterized queries
the biosciences. QIS is based on a distributed architecture provide commonly requested results. Here, alterations in da-
and is designed to facilitate maintenance of query integrity tabase structure may cause predefined queries to ‘‘break.’’
as the underlying database schemas evolve over time. QIS
In a database federation, especially one involving interna-
has been developed in the context of SenseLab,2,3 an ongoing
tional collaborations, different groups may use subtly or
neuroinformatics project at Yale University supported by the
overtly different names to refer to the same class of data (syn-
onymy). Conversely, different groups may use the same name
in subtly different ways (polysemy). These differences in
meaning, if not documented explicitly, complicate interopera-
Affiliations of the authors: Center for Medical Informatics (LM, PLM,
PN), Department of Anesthesiology (LM, PLM, PN), Department of
tion. To address these issues, one needs shared controlled vo-
Molecular, Cellular, and Developmental Biology (PLM), Department cabulary support. Specifically, both data and metadata (‘‘data
of Neurobiology (GS), Yale University, New Haven, CT; and that describe other data’’) must be mapped to concepts in con-
Turboworx, Inc., (T-YW) Shelton, CT. trolled vocabularies. Such curator-intensive mapping efforts
Supported by NIH grants P01 DC04732, G08 LM05583, and U01 yield only modest benefit if existing standard vocabularies
ES10867. provide insufficient domain coverage or if the field progresses
faster than corresponding curatorial efforts at vocabulary en-
The authors thank David Tuck of the Yale Department of Pathology,
Kei Cheung of the Yale Center of Medical Informatics, and Mihail hancement. It therefore may be necessary to create federation-
Bota at the University of Southern California, part of the BAMS specific (‘‘local’’) vocabularies and to devise mechanisms to facil-
group. itate the identification of candidates for new concepts within
The existing QIS code base will be made freely available on request individual databases at both the metadata and data level.
to the first author. Individual participating schemas must also be accompanied
Correspondence and reprints: Luis Marenco, MD, Center for Medical by detailed textual annotations. Such annotations must
Informatics, Yale University School of Medicine, PO Box 208009, be far more extensive and lucid than documentation devel-
New Haven, CT 06520-8009; e-mail: <[email protected]>. oped for internal purposes because they must provide consid-
Received for publication: 11/24/03; accepted for publication: erable overview for researchers who are unfamiliar with
06/23/04. a particular group’s interests and experimental methodology.
524 MARENCO ET AL., QIS: Biomedical Database Federation

Annotations must often exist at two levels: metadata and tween master and satellite sites may not be feasible in rela-
data. Creating uniformly high-quality annotations is time- tively loose research federations.
consuming, and, even when they exist, they may not be Other mediators6–8 limit themselves to metadata exchange
enough. Kans and Oullette4 state: ‘‘As good as annotations and leave the data in their original databases and format.
can be, they will never surpass a published article in fully rep- Here, queries are described in a common language and are
resenting large amounts of biology. It is therefore imperative translated into the specific data source syntax, and results
to ensure the proper link between a research publication and are converted into a common output format. Some of these
the primary data it will cite.’’ systems are commercially available, e.g., IBM’s Discov-
In addition, federated search mechanisms must appropriately eryLink7 and Genetic Exchange’s discoveryHub6,9; they are
exclude data that are still preliminary and not available for limited with respect to the ranges of data sources accessed.
public access beyond the research group creating an individ- Further, they do not expressly address the issue of schema
ual database. Some of the work required to establish feder- evolution. Although they are valuable for dealing with data
ated search mechanisms (such as describing the semantics source connectivity in domains where the individual logical
of data carefully and defining which data/metadata are pub- schemas evolve very slowly, if at all, the applicability of these
lic) is unavoidable no matter what approach is used. A robust systems to federations where continual schema evolution is
framework, however, can minimize other barriers to effective the rule is doubtful. We contend that, in the bioscience do-
retrieval from federated databases. main, one must address heterogeneous data mediation and
schema adaptation together to achieve robust evolvable data
Existing Approaches for Database Federation integration.
The first step in data interoperability involves data source ac-
cess. The simplest method is based on downloadable files, Current Research in Schema Evolution and
which are impractical for large, volatile datasets. Remote (di- Database Federation
rect) database access through vendor-independent standards From the vast literature, we focus only on papers dealing with
such as ODBC (Open Database Connectivity) carries implica- problems directly related to this paper’s theme. McBrien and
tions such as increased resource administration, security Poulovassilis10 treat a database schema as a graph structure,
risks, and institutional firewall restrictions. For data sources defining changes in terms of operations (such as addition and
exposed only as Web pages, the limitation of HTML (inter- removal) on the graph’s nodes and edges. This paper does not
leaving content with formatting) makes content extraction te- differentiate between nodes representing tables and nodes
dious, nonscalable, and fragile as data proliferate and Web representing a table’s columns; also, transformations such
page cosmetics change without notice. Second-generation as column datatype changes are not considered. The treat-
Web sites, such as National Center for Biotechnology ments by Ram and Shankaranarayanan11 and Li and Tari12
Information’s Entrez and the current version of SenseLab, cir- are much more comprehensive. The former tries to automate
cumvent this problem by allowing users to get data as ex- the transformation of one schema to another by storing the
tended markup language (XML) (‘‘pure’’ content) using atomic changes in computable form. The second associates
programmable ad hoc query interfaces. The use of site-spe- schema evolution with version management, identifying the
cific XML data alone, however, is not intended to address fed- former as an application of the latter and defining ‘‘versioning
eration across multiple sites. algebra’’ in terms of operations on schema elements.
A mechanism to relate contents in federated databases uses Determination of differences between two schemas, or between
the global schema approach based on an agreed-on (and infre- two versions of a schema, is the inverse of the schema-evolu-
quently revised) standard definition of domain-specific data tion problem. Kim and Seo13 devised a taxonomy of schema
types and classes and their interrelationships. Anyone who differences at the class/table, attribute/column, and do-
uses this standard (or a subset) must not deviate from it. main/data type levels. Sheth and Kashyap14 have modified
Microsoft’s ‘‘BizTalk’’5 E-commerce specification exemplifies this taxonomy to emphasize semantic differences.
this approach, which is appropriate in situations in which In a loose federation, however, individual schemas tend to
the organization of domain knowledge does not change very evolve in a far more unplanned and nonorderly fashion than
fast. Its broad applicability in rapidly evolving areas such as conceived in the above-mentioned papers. Specifically, indi-
bioscience seems doubtful. vidual groups’ system architects may freely alter their sche-
In contrast to global schema-based systems, mediator systems mas as needed but may communicate details of such
allow a single query to be translated into the language recog- changes to the federation much later. In such scenarios, it is
nized by heterogeneous databases, extracting their informa- desirable to minimize and/or streamline communication ef-
tion and integrating the results in a single dataset. One type forts by enabling discovery of schema version differences by
of mediator uses a single repository that stores both the the federation mediator.
schema description (metadata) as well as data of every in- Structural differences between schema versions (addition or
cluded database. This approach, often used within a single removal of tables/columns, changes in column properties,
geographically dispersed organization, has the advantage and changes in table relationships) can be discovered in
that queries against the integrated data run relatively fast be- a fairly straightforward manner using existing database con-
cause one does not attempt to do ‘‘joins’’ of geographically nectivity technology. The database administrator must only
separated tables over the Internet. However, it becomes enor- allow a mediator program read-only access to the database.
mously more complicated in the presence of highly heteroge- However, the meaning of the differences in terms of the do-
neous schemas that are maintained autonomously and may main, or evolving domain-specific needs, cannot generally
change often; the extremely close coordination required be- be inferred automatically, even if detailed annotations are
Journal of the American Medical Informatics Association Volume 11 Number 6 Nov / Dec 2004 525

provided because of the generally unstructured and narrative d to support interoperation with a straightforward separa-
nature of the latter. tion between public and private data
d to facilitate recording of system semantics through an infra-
Previous Efforts in Bioinformatics structure that integrates ontologies and detailed text anno-
The TAMBIS (Transparent Access to Multiple Bioinformatics tations with federation and allows sites to contribute
Information Sources)15,16 project creates a bioinformatics candidate concepts to an ontology development group in
domain ontology using the GRAIL Description Logic Lan- a seamless fashion
guage17, mapping concepts to existing information sources. d to devise a low-cost and lightweight system that requires
Queries against that ontology access individual sources in a relatively modest infrastructure that researchers with lim-
a user-transparent manner. Although this approach is novel, ited informatics skills can operate and that can eventually
scalability and expressivity are a concern. The TAMBIS team be distributed as an open source
uses custom function libraries for each information source,
which provide a function-based view of the source. Although QIS addresses the following issues: data source connectivity,
its underlying query language, Collection Programming Lan- heterogeneous schema mapping, common query formula-
guage,18 supports issuing of SQL commands that are passed tion, query adaptation, and data delivery. QIS also features
untranslated to a database, it is well known that function- simplified deployment, easy maintenance, enforced security,
based perspectives of data that are essentially tabular are firewall independency, alert systems, and domain indepen-
not appropriate when one wishes to execute the equivalent dency.
of SQL statements that join several tables arbitrarily. Further,
the depth and quality of the TAMBIS ontology, or its over-
lap with existing bioinformatics ontologies such as Gene
System Description
Ontology (http://www.geneontology.org), are also difficult QIS belongs to the class of mediator systems that limit them-
to evaluate because the ontology contents are not contributed selves to metadata exchange. It uses a distributed architecture
currently to a source such as the Unified Medical Language that is composed of three main functional units: integrator
System.19 servers (ISs), data source servers (DSSs), and the ontology
server (OS). These units form the system’s middle tier, con-
The PQL structured language (PQL) approach of Mork necting ‘‘data consumers’’ (client applications requesting query
et al.20,21 relies on metadata describing the entities and rela- execution) with ‘‘data providers’’ (back-end data sources
tionships (‘‘links’’) between entities in a federated schema containing the data) and knowledge sources (Ontologies)
plus additional metadata on intended semantics or judgments (Fig. 1). All servers use a database management system
about database curation quality to discover multiple ways of (DBMS) (currently Microsoft SQL Server) in addition to a
answering loosely defined queries, such as finding all proteins Web server (Microsoft Internet Information Server). The an-
‘‘closely related’’ to a disease. The PQL query language resem- notated schemas of each unit are described in Appendix 1
bles the XQuery language used to query XML documents. It (available as an online data supplement at www.jamia.org).
provides access to several data sources, including nonstruc-
tured textual data such as Online Mendelian Inheritance in DSSs provide access to various data sources within a single
Man.22 PQL’s metadata appear to be created manually by group (or cooperating groups within an institution). In addi-
the system developers: Certainly, there is no other way to cre- tion to traditional relational databases and EAV/CR data-
ate metadata about topics such as curation quality. This ap- bases that are built on top of relational technology, they also
proach is interesting and valid. It is geared, however, to access XML files and flat files (spreadsheets, text). DSS ad-
discovering ways to answer loosely defined queries rather ministrators add definitions of data sources to the DSS
than performing consistency checking on existing queries that through a Web interface.
are well defined, which is the focus of our work. Schema capture is the process of capturing metadata about
a database (e.g., its table, column, and relationship defini-
System Design Objectives tions) into structured form. This process is partly automated
The objectives underlying the creation of the QIS were as through connectivity technologies that query a database’s
follows: system data dictionary, or an XML schema. However, the cap-
tured metadata need to be manually enriched by mapping
d to integrate data sources within the HBP that use technolo- schema elements, where possible, to elements in ontologies
gies and approaches not supported by commercial media- to assist automatic schema discovery and by detailed textual
tor systems, e.g., our own Entity-Attribute-Value, with annotations to provide semantic overview.
Classes and Relationships (EAV/CR) approach23,24 and Each DSS itself is like a ‘‘metadatabase’’ that accesses one or
the ‘‘common data model for neuroscience’’25 more individual databases at a site. The administrators of in-
d to devise a scalable approach that can work in a loosely dividual databases must define the subset of the data and
coupled database federation that explicitly addresses the is- metadata within their own schema that is ‘‘public.’’ This is
sue of schema evolution within the participating databases done through a three-step process for each data source:
d to devise robust mechanisms for metadata exchange
d to address the ‘‘federated query fragility problem’’ by devis- d A special account with restricted privileges is created for
ing mechanisms that use differences in metadata versions to the DSS. This account cannot alter data and can access only
facilitate automatic detection of schema evolution and as- a limited set of tables or views.
sess their impact on existing stored queries and to use such d For tables containing both public and private data, an extra
impact assessment, where possible, for partial or complete Boolean column is added to indicate whether the row
recovery of stored queries when target schemas have altered in question is publicly accessible, with a default of
526 MARENCO ET AL., QIS: Biomedical Database Federation

tion is replicated on the DSS, mappings at both the metadata


and data levels are also forwarded to the IS. The OS can now
act as an information source map (ISM).27 Such information
makes it possible to ask questions of varying granularity, such
as ‘‘which data sources contain information on neurons’’
(where mappings are likely to exist at the metadata level)
and ‘‘which data sources contain information on cerebellar
Purkinje cells’’ (where mappings will likely exist at the
data level). The mapping to specific schema elements in a data
source allows assisted query composition against that data
source that would actually return the desired data.
Data and metadata elements from DSSs that are candidates
for local concept creation are exported to the OS in a facilitated
fashion, as discussed later. Other potential services envisaged
for the OS are term translation and unit conversion. The OS
also lets ontology curators collaborate with DSS curators to
jointly define new ‘‘local’’ (federation-specific) concepts: This
is necessary when existing ontologies offer insufficient cover-
age. This infrastructure can also be used to submit new, cu-
rated concepts to a standard vocabulary for inclusion in
F i g u r e 1. Query Integration System—architectural over- a new release.
view. The QIS architecture is based on three middle-tier
servers. The data source server (DSS) connects to disparate Communication between the various QIS nodes is XML en-
supported data sources. The Integrator Server (IS) stores, coded and HTTP delivered to support communication
coordinates query execution, and returns query results to through network firewalls. Asynchronous processing is sup-
Web applications. The Ontology Server (OS) maps data ported using customized queuing services. Other software
sources’ metadata and data elements to concepts in stan- technologies used by the system are Microsoft Active Data
dard vocabularies. EAV/CR = Entity-Attribute-Value, with Objects for data and schema access, Extended Markup
Classes and Relationships; RDBMS = Rational Database Language Document Object Model, and SAX (Simple
Management System; XML = extended markup language;
UMLS = Unified Medical Language System. Application Programming Interface for XML) for dataset ma-
nipulation, and Scalable Vector Graphics28 for standardized
entity relationship (ER) diagram generation.
‘‘No/False’’ so rows that are to be made public must be
manually set.
d The administrator creates views that define the subset of System Features
public columns/tables. Where the views use tables with Dealing with Schema Evolution
both public and private data, the view must specify a filter For robustness of the federated query infrastructure, changes
that the ‘‘public’’ flag must be ‘‘True.’’ The DSS account is to the physical or conceptual schema at the individual data
now given permission to these views. sources must be propagated efficiently to the IS. The DSS per-
forms periodic automated schema extraction from its individ-
For non–account-oriented data, such as XML or text files,
ual data sources and computes a schema version difference
which must be accessed in their entirety, the DSS stores the
by comparing the new schema with the old. This computation
URL of the source.
uses the principle underlying the well-known diff algo-
ISs store ‘‘public’’ metadata from DSSs as well as queries that rithm,29 the basis of source-code control systems.30 Diff is an
access single or multiple data sources. They allow building of example of an algorithm that determines the ‘‘edit distance’’31
queries against the DSSs through a graphical user interface. between two objects (text files, DNA sequences), where
QIS is primarily intended to allow other Web-based applica- change is defined as the series of additions, deletions, or re-
tions to execute predefined queries on the IS through Web- placements required to transform one object to the other. In
service26 mechanisms. That is, the IS operates ‘‘behind the QIS, changes are computed first at the aggregate (class/ta-
scenes,’’ and the federation’s end users connect to such an ap- ble/view) level and then at the atomic/column level.
plication rather than to the IS directly. IS administrators per-
Not all replacements can be inferred automatically. For exam-
form tasks such as registering new DSSs and registering
ple, changes in data type, length, and precision of a column
individuals with domain expertise who can design queries.
are inferred reliably, but the less common renaming of a col-
An OS maintains an integrated schema, plus content, of one umn/table appears as a combination of an addition and a de-
or more controlled vocabularies used within the federation. letion, and it is typically up to a curator to note that the two
Alternatively, it may provide a gateway to relate these vocab- differences can be merged into one: Algorithms that try to in-
ularies to standardized content maintained elsewhere, or it fer replacements based on synonymy between old and new
may replicate such content. Parts of the OS schema are re- names are not guaranteed to work reliably for tables or col-
corded redundantly at the IS level. umns whose names may be abbreviated or use characters
The OS supports mapping of elements in individual data such as underscores. In the case of EAV/CR data sources, au-
sources to vocabulary elements by curators who specialize tomatic schema evolution is facilitated due to the metadata
in ontology development. More important, once this informa- identification’s preservation regardless of element renaming.
Journal of the American Medical Informatics Association Volume 11 Number 6 Nov / Dec 2004 527

The DSS computes the ‘‘deltas’’ (differences) between the The curated deltas are sent from the DSS to the IS. At the IS,
old and new schemas. The discovery of deltas is used to no- they are used to update (synchronize) the metadata for that
tify the responsible DSS curators. The deltas can then be an- data source. Although metadata updates are intended to be
notated and/or curated to identify replacements. Figure 2 largely automatic, periodic reports on the delta audit trail
shows a screenshot of the curation/annotation interface. can be the basis for a dialog between the IS curators and
Replacements result in a version increment of the affected DSS curators. This may happen if, for example, the annota-
element; deletions result in a version change of the parent tion of particular elements is insufficient for the IS curators’
element. In general, all ancestors’ versions are increased understanding. Note that the deltas only identify structural
by one for the changed elements, and deltas contain differences; attributing meaning to these changes in terms of
information about the elements whose version differs. Ob- the domain is beyond their scope. That is why human anno-
solete metadata entries are moved to a metadata ‘‘history’’ tations are necessary. Automatic delta identification, how-
table. ever, is critically important because identifying differences

F i g u r e 2. The metadata maintenance toolset. In the rear window, the schema viewer shows the Membrane Properties
Resource metadata description in tabular form. This information can alternatively be shown as an Entity relationship diagram or
extended markup language. Selection of the ‘‘Receptor_properties’’ grid shows the detailed metadata annotation: description,
concept identification, and semantic relationship (only for columns). The administrator can preview the underlying data to clarify
the content. The front window shows the versioning update tool showing the differences (deltas) between previous and current
versions with controls to resolve them.
528 MARENCO ET AL., QIS: Biomedical Database Federation

and presenting these to a human expert facilitate their com- A QIS query description derives from SQL-like languages
prehensive annotation. but is represented in XML to facilitate syntax validation and
Composing Queries: Preserving Integrity future feature extensibility. The query is basically decom-
of Federated Queries posed (‘‘preparsed’’) into its constituent elements, repre-
The IS, as a query repository, provides a tool (Fig. 3) to sup- sented in terms of their metadata repository ‘‘unique
port the creation, maintenance, storage, and execution of data identifiers.’’ Further, for atomic/column elements in a query,
queries. To create federated queries, one first creates several the IS records whether the element is part of the output (i.e., to
single data source queries and then combines them into be displayed), whether it is used in the equivalent of a ‘‘join’’
a query of queries by specifying intersection, union, or differ- to bridge between two tables, and whether it is part of a query
ence operations. criterion/filter (in SQL parlance, part of the WHERE clause).

F i g u r e 3. Query designer tool at the integrator server showing information about the ‘‘getReceptorGeneChromosomeProtein
_structure’’ query. This particular query extracts information from the ‘‘Receptor_properties’’ table in the Membrane Properties
Resource (MPR) database. The extracted information includes the following fields: subtype, gene chromosome, and structure.
This is a parameterized query that requires a specific text string pattern to fit the subtype field. Note that the tool refers to grid and
column rather than tables and fields that are specific to Rational Database. This tool interacts with the user in three different ways:
visual query by example (current), extended markup language, and entity-relationship graph. The query can be checked for
correctness, executed in preview mode, and saved. The Graphic User Interface is based on a drill-down metaphor in which users
pick one or more tables/sets from one or more data sources and then select columns/attributes from these. (When making
a selection, choices are dynamically presented as pull-down menus, as shown for item #c3). Constraints are then specified for
columns/attributes of interest.
Journal of the American Medical Informatics Association Volume 11 Number 6 Nov / Dec 2004 529

Further details of the query language are beyond the scope of M1506.asp). The site invokes sample queries stored in an IS
this paper but are available at http://ycmi.med.yale.edu/ and displays code samples demonstrating how the system
qis/qis_main.htm#Query. can be used from other Web sites.
Because this information is stored and indexed by metadata Neuroscience
identifiers, delta computation determines which queries are These examples are ‘‘behind-the-scenes’’ queries used in pro-
affected by a change in a particular element. The authors of duction SenseLab. (The SenseLab application acts as a ‘‘client’’
the affected queries are now alerted, so that the query may of QIS.)
be fixed manually. However, the severity of this impact can The first query combines data from SenseLab’s CellPropDB,
also be determined automatically as described below. In which stores experimental data on neuronal cell properties,
many cases, even before fixing, ‘‘self-repairing’’ mechanisms and the membrane properties inventory resource (MPIR),
can be activated dynamically so that the query can still run, a standalone MS-Access database independently developed
returning partial or complete results. and maintained by a Yale researcher, which stores different
d Atomic elements (columns) used in joins will typically information on neuronal membrane data. The CellPropDB
break the query irreparably if they are missing in a new query extracts receptor and ion channels information for
schema version because the column’s omission would a user-specified cell type, which is the parameter to this query.
cause a pathological Cartesian join. The MPIR query extracts gene information associated with
d Changes in data type are tolerable if the join condition is a list of receptors and ion channels (the parameter to the
modified dynamically by a type-conversion function so as second query). The join yields to the genes expressed in a
to temporarily restore the old data type. Query perfor- particular cell. This example is in use by the production
mance, however, may be impaired because this step can version of SenseLab. For the thalamic relay neuron’s in-
be computationally expensive. formation, go to http://senselab.med.yale.edu/senselab/
d Changes in length or precision of a column will not affect CellPropDB/GeneData.asp?cellid=262.
a query. The second query integrates data from CellPropDB and the
d Atomic elements used in a query filter will, if removed, University of Southern California’s Brain Architecture Man-
make the query less selective. In the worst case, one may re- agement System (BAMS)32 (http://brancusi.usc.edu/bkms),
turn an entire table instead of the desired rows: This is gen- to which we have restricted access as part of a collaboration.
erally unacceptable. Data type changes in such elements can A query of brain structure (such as the hippocampus) from
be compensated temporarily by dynamic type conversion. CellPropDB is augmented with neuroscience nomenclature
d Atomic elements that are only displayed have the least im- provided by BAMS. To view this query, go to http://senselab.
pact. Such elements can simply be dropped from the query med.yale.edu/senselab/CellPropDB/cpdbRegions.asp?sr=1,
and replaced with placeholders indicating that they are and follow the hyperlink under ‘‘hippocampus’’ in the third
now missing. Similarly, if the query is part of a union query column. Here, the program code uses the QIS application-
that reaches out to multiple data sources or even multiple programming interface to execute a specific query (of BAMS),
DSSs, data from other (unaltered) sources can still be re- and merge its results into a local (SenseLab) result set.
turned.
d If a table has changed by addition of new columns, queries Genomics
do not need to be altered. However, query authors using The ‘‘Genomics-Microarray’’ query example from the demon-
that table are informed about the new columns, so that stration QIS-Client Web site accesses three genomic data-
these can be used if necessary. Queries on single data bases, maintained by different Yale groups:
sources that return all the columns in a table/view can
automatically take advantage of the new columns. d the Yale Microarray Database (YMD), a large repository of
d Delta determination is ordinarily a batch process, and a user institutionally generated experimental information
may execute a federated query against a changed schema d a local gene annotation database (GAD) that contains
before delta computation, and the query may fail. The curated genomic data on approximately 70,000 genes from
DSSs will, however, sense this failure through standard er- the National Center for Biotechnology Information’s Locus
ror detection mechanisms. The failure triggers delta com- Link and Unigene datasets
putation and subsequent metadata update. If possible, d a Yale installation of the well-known Gene Ontology (GO)
the query will be automatically reissued, relying on the database, which is curated by the Gene Ontology consor-
self-repairing mechanisms described above to return par- tium (http://www.geneontology.org/GO.doc.html)
tial or complete results, albeit after some delay. (This sce-
nario is illustrated later.) The query is intended to get microarray experiment results
for any genes with cytokine activity. It takes about 2 minutes
Pilot Implementation: Neuroscience and Genomics to run and operates as follows. All gene ontology accession
The QIS components have been built on the Microsoft numbers where the descriptions containing ‘‘cytokine’’ are
Windows platform using a Web server (MS Internet In- pulled from GO. GAD is then queried to fetch the GenInfo
formation Server), the MS SQL server DBMS (although our ID, gene symbol, and gene name for these accession numbers.
database access code is vendor neutral), and the Micro- YMD is finally queried to fetch summaries of microarray ex-
soft .NET platform. Communication between servers is periments that were indexed by these GenInfo IDs.
XML based. We have implemented a demonstration QIS- Query Adaptation
Client Web site that displays the information provided This example is based on an actual case from the SenseLab
by one IS (at http://ycmi-hbp.med.yale.edu/QISClient-IIS/ project. The membrane properties resource database records,
530 MARENCO ET AL., QIS: Biomedical Database Federation

for each receptor molecule, the gene from which it is de- tailed information on each object. In Figure 5, the user has
rived and its chromosomal cytogenetic location. The latter searched the UMLS for terms beginning with ‘‘pyramidal.’’
was originally expressed in the string form in which it A list of matching terms is returned: Clicking on
is typically recorded in the literature, e.g., ‘‘11q12-q13.’’ ‘‘Pyramidal Cells’’ shows details (taken from UMLS’s
(The hyphen indicates a region of uncertainty within MRDEF table) as well as local objects mapped to that con-
two cytogenetic bands.) We decided to partition the loca- cept. The resulting page shows the pyramidal cell found in
tion information into its three components: chromosome, two databases in the federation: the Cortical Neuron
upper (short-arm, p-terminal) location extent, and lower Database (located at the Gardner Laboratory at Cornell
(long-arm, q-terminal) location extent. In the above exam- University) and SenseLab. Clicking on the hyperlink associ-
ple, the values of the three fields would be 11, q12, ated with the second row then takes the user to details of the
and q13. The extents can be converted to numeric frac- neocortical pyramidal neuron within SenseLab. Other fea-
tions, which allow searching by location range as well as tures of the OS include finding common concepts between
generation of graphics (‘‘ideograms’’) showing uncertainty a pair of federated databases and vocabulary creation by
regions. promoting concepts within those databases not available in
Figure 4A shows the results for the query ‘‘show me the genes UMLS.
and chromosomal location information for all muscarinic re- The OS is described in depth, with an online tutorial, at the
ceptors’’ with the original database structure. Figure 4B following URL: http://ycmi.med.yale.edu/QIS/components.
shows the query after the structure has been altered—the htm#OS/.
query has failed—with an error message about the query
being out of date. As result, the DSS initiates an automated
metadata refresh: The system now realizes that the
‘‘ChromosomeLocation’’ field is missing and shows only Current Status
Gene information (Fig. 4C). (Note that partial results are still Individual parts of the QIS framework are already being
returned rather than the query failing completely; these assist used on a production basis in SenseLab, which receives an
troubleshooting by the query designer.) average of 3,000 hits per day, excluding Web-bots. The DSS,
IS, and OS are currently all housed on a single CPU in sepa-
Figure 4D and E shows the result of human intervention.
rate physical databases. Therefore, there is currently no need
Figure 4D shows the result after preliminary exploration,
for a protocol for the communication of deltas between the
where the field ‘‘Chromosome’’ replaces ‘‘Chromosome Lo-
different server units. In the geographically separated server
cation.’’ It can be seen that the results returned lack cyto-
scenario ultimately envisaged, however, such a protocol will
genetic information. Figure 4E shows the result of a correct
be necessary.
query rewrite, where three fields replace the initial field.
d Query features: Some query aggregate functions and subque-
ries are allowed in some data sources that implicitly sup-
OS Operation
port them. Because many databases implement them
The OS currently hosts a replicated copy of the National
differently, we allow their use in a limited fashion, resem-
Library of Medicine’s Unified Medical Language System
bling pass-through queries in ODBC.
(UMLS). Approximately a third of the concepts (metadata 1 d Client application interfacing: The system currently provides
data) in SenseLab have been mapped to UMLS concepts in
XML-HTTP requests. We currently do not support the
batch mode using a tailored version of an algorithm origi-
Simple Object Access Protocol (SOAP) because QIS is still
nally described,33 with results manually verified.
in too early a stage to merit the creation of specific contrac-
The OS currently provides an important function for tual interface names and parameters. Supporting evolution
SenseLab, which, although residing within a single physical of the underlying database schemas will require SOAP
database, is divided into a number of ‘‘virtual’’ databases or meta-interfaces rather than interfaces that hard code the
portals to provide direct access to data of interest to a variety current view of domain knowledge. To avoid burdening
of neuroscience communities (e.g., neuronal modelers, olfac- the reader with these technical details, sample code and
tory receptor researchers). Although this division is conve- documentation can be found at http://ycmi.med.yale.
nient for the regular visitor of SenseLab, it is a barrier to the edu/QIS/interfacing.htm/.
new user who wants to directly access objects of interest with- d Query optimization: Little to no optimization of multidata-
out first having to know in which virtual database they might base queries is currently performed. The query designer
lie. Further, neuroscience has numerous synonyms (‘‘5-HT,’’ must specify the order of operations on individual data
‘‘serotonin,’’ ‘‘5-hydroxytryptamine’’ are the same molecule, sources, such as in the genomics example. Only multidata-
as are ‘‘norepinephrine’’ and ‘‘noradrenaline’’). It is desirable base queries in which the composite query is explicitly des-
to use UMLS’s synonymy information to facilitate query ex- ignated as a union operation can be optimized by having
pansion during searching. each of the component queries run in parallel. Caching of
We therefore allow search of UMLS terms based on partial specific intermediate data sets, based on query usage statis-
phrases that the user enters: UMLS terms matching the tics, may also improve performance. QIS performance
query are returned; when the user selects the term of inter- depends on several factors: query execution, data trans-
est, the details of the matching concept(s)s are returned ference, and multidata-source query processing. Query ex-
(note: some terms are ambiguous and map to more than ecution can take from a few seconds to minutes. For this
one concept). Any mapped objects in our local databases reason, all processing is asynchronous, and the system’s
are also returned. Associated hyperlinks lead the user to de- tracing mechanisms inform the client about elapsed events.
Journal of the American Medical Informatics Association Volume 11 Number 6 Nov / Dec 2004 531

F i g u r e 4. Query adaptation: this query asks for genes and chromosome location information for all muscarinic membrane
receptors. (A) The query runs against the original schema of a particular data source server and returns intended results. (B) The
schema has changed; the result field ‘‘chromosome location’’ is partitioned into three fields (chromosome and upper and lower
chromosomal location extents). When the query runs again, it fails, triggering an automatic metadata refresh at the integrator
server. (C) Partial recovery: gene information is returned, but chromosome location is not. (D) Initial exploration and use of data
preview show that the ‘‘chromosome’’ field returns only the initial part of the location information. (E) Incorporating the
additional two fields returns the original information. Only a user with query-creation privileges sees the screens in B through D
(which allow troubleshooting and query repair). To users without such privileges, an error message would simply state that the
query is obsolete and that an administrator has been notified.

As in all distributed queries, performance is influenced by network addresses. For sensitive data, encryption is imple-
network bandwidth, CPU performance, data storage mech- mented using secure socket layer and server certificates.
anisms, and RAM availability.
d Integrity and Security: To protect access to restricted func-
tions such as query authoring and metadata annotation, Discussion
both IS and DSS use log in–based access control. Client ap- Many mainstream DBMSs are excessively fragile even when
plications can be restricted with passwords or to specific dealing with a single (nonfederated) database. A well-known
532 MARENCO ET AL., QIS: Biomedical Database Federation

F i g u r e 5. Using the ontology server. The user searches the Unified Medical Language System (UMLS) for terms beginning
with ‘‘pyramidal.’’ A list of matching terms is returned (EN in the figure indicates that the term is in English). Clicking on
Pyramidal Cells shows details (taken from UMLS’s MRDEF table) as well as local objects mapped to that concept. Clicking on the
hyperlink associated with the second row (o265 is the unique internal identification of the object) takes the user to details of the
neocortical pyramidal neuron within SenseLab. One may then inspect additional information on the neuron, such as the receptors
and currents associated with individual compartments in the neuron, by clicking on further hyperlinks (details not shown). This
query involves all three types of servers: the ontology server provides access to the UMLS and also replicates the mapping of
objects in local databases to UMLS concepts: the integrator server actually mediates the query that fetches information from
SenseLab (by making a request of SenseLab’s data source).

example is the Oracle DBMS: Adding a new column to a table itate consistency maintenance. The research contribution of
causes all views defined on that table (which use all of that QIS is in the development of an explicit dependency model
table’s columns) to be rendered invalid; any operations in the context of federated schemas.
accessing this view will fail. These views must be manually
‘‘recompiled,’’ a tedious process requiring identifying the Issues of Scalability
numerous views in the database that are bad and then fixing We provide both theoretical reasons and benchmarks to ar-
these, typically by text editing. The use of such technology by gue that the QIS architecture, which is based on metadata
itself in a distributed database scenario requires considerable exchange between the three kinds of servers, is highly scal-
augmentation. A more robust dependency model could facil- able.
Journal of the American Medical Informatics Association Volume 11 Number 6 Nov / Dec 2004 533

d In any database, metadata are typically a small fraction of We also plan to improve query responsiveness. Nonparame-
data. Thus, a table may contain millions of rows, but its terized queries of relatively static data can be scheduled to
metadata describe a fixed, small number of columns in that run periodically, and their results cached on ISs to avoid un-
table. necessary reprocessing. This solution is particularly useful to
d Metadata are significantly less volatile than data; that is, al- automatically populate client tables containing information
though changes in schema definition are common in a scien- from multiple data sources (genomic, ontological, or publica-
tific database, schema changes do not occur every day. tion data).
d Schema capture and delta computation are distributed over
several DSSs, each of which is concerned only with its reg-
istered data sources. Only the deltas propagate between the Lessons Learned
servers. Based on a preliminary version of an XML-based
protocol that we are devising for delta propagation, we d Optimizing the Use of Metadata: Metadata improve the un-
have estimated that the encoding of a single column change derstandability of a database’s contents. Current database
in a table should not take more than 300 bytes (including engines provide limited metadata annotation facilities.
the XML tags). Such limitations hinder the ability to formulate distributed
d Mapping of data/metadata elements in data sources to queries. For the DSS alone, these metadata are being used
controlled-vocabulary concepts is manually intensive. to generate real-time data previews from a particular table
This type of task is well known in clinical informatics or column. Data preview minimizes the number of itera-
settings and is a critical part of system integration efforts. tions required for correct query formulation. The use of ex-
The process is performed against local copies of standard plicit relationships in data sources that implicitly do not
vocabularies (such as UMLS) and is not communication support them also improves the understanding of the data-
bandwidth intensive. The time/bandwidth required to base structure.
transmit information about mapped concepts to an OS is d Ontology Mapping: The use of ontological annotations im-
negligible in comparison. proves on ordinary metadata descriptions, facilitating the
localization of data of interest. One contribution of this pa-
per is in integrating ontology-based approaches with feder-
ated query technology. We believe that National Library of
Benchmarks Medicine’s idea of the ‘‘Information Sources Map,’’ which
The schema capture/delta computation process takes less was emphasized during the early years of UMLS develop-
than 2 seconds against all of SenseLab (76 classes modeled ment, is one that needs to be resurrected and fleshed out.
in EAV/CR, 238 attributes across all classes). We have also Rather than simply indicating that a particular database
arranged for the Yale DSS to access, as a data source, the contains information about a particular topic of interest,
BAMS database at the University of Southern California the mapping of vocabulary terms to actual metaschema/
(USC). This MySQL database, which uses a conventional re- data elements goes a step further in fetching contextual in-
lational database structure, has a total of 22 tables and 144 formation. The problem of ‘‘meaning’’ of data has often
columns. To prevent contention for resources with internal been somewhat ignored in the computer science literature
users, BAMS deliberately ‘‘throttles’’ queries from non- on database federation, which has often discussed unrealis-
USC users (i.e., runs them with low priority). Live schema tic scenarios such as mapping elements in different data-
capture/delta computation of BAMS from the Yale DSS bases to each other based on their names. In the future,
takes approximately 20 seconds using a dual Pentium ontological mappings could play a crucial role in mediated
Xeon 2 GHz with 1 GB RAM that also hosts several other ontological queries (queries based on ontologies that are
databases. translated to structured queries in ontologically annotated
database schemas).
Future Work d Query Adaptation by Decomposition and Versioning: The sec-
Local ontology development is currently in its infancy within ond contribution of this paper is the use of versioning ap-
the HBP. We intend to provide extensive infrastructure sup- proaches and decomposition of queries into their atomic
port for the development and maintenance of local, domain- components to achieve the goals of metadata refresh and
specific vocabularies. We are also implementing ‘‘semantic the offline flagging of queries depending on altered meta-
queries’’ in which a user can identify elements of interest in data elements well before they are actually executed.
the ontology, and the system can compose appropriate Schema synchronization and versioning allow rapid deter-
queries against data sources in an automated or assisted fash- mination of which queries are affected by a change to a spe-
ion. The specifications of such queries can be saved for reuse, cific element and the extent to which automated recovery
so that even if there are currently very few data of interest to mechanisms can operate successfully.
a specific query within the federation, the same query may re- Current systems often break without any indication or rea-
turn more results when rerun in future as the contents of the son of their failure. More important, many mainstream
federated databases expand. DBMSs are excessively fragile even when dealing with
We need to devise efficient protocols for delta transmission. a single (nonfederated) database. With a more robust de-
There are several kinds of deltas: metadata differences be- pendency model, consistency maintenance could be per-
tween DSS schema versions (transmitted between DSS and formed automatically.
IS), ontology version differences (between OS and IS), and
changes in ontological mappings at the metadata and data We are continuing to accumulate experience with QIS, and re-
levels (between DSS and OS). leasing the framework through open-source mechanisms will
534 MARENCO ET AL., QIS: Biomedical Database Federation

help us evolve it based on the needs of a variety of groups that gration System. In: Proceedings of the 11th International Confer-
choose to experiment with it. ence on Scientific and Statistical Databases (SSDBM), 1999. Los
Alamitos, CA: IEEE Press, 1999, pp 138–47.
16. Stevens R, Baker P, Bechhofer S, et al. TAMBIS: Transparent ac-
References j
cess to multiple bioinformatics information sources. Bioinfor-
1. Koslow S, Huerta M. Neuroinformatics: An Overview of the matics. 2000;16:184–6.
Human Brain Project. Mahwah, NJ: Lawrence Erlbaum Associ- 17. Rector AL, Bechhofer S, Goble CA, Horrocks I, Nowlan WA, Sol-
ates, 1997. omon WD. The GRAIL concept modeling language for medical
2. Miller PL, Nadkarni P, Singer M, Marenco L, Hines M, Shepherd terminology. Artif Intell Med. 1997;9:139–71.
G. Integration of multidisciplinary sensory data: a pilot model of 18. Wong LS. The Collection Programming Language. 1996. Avail-
the human brain project approach. J Am Med Inform Assoc. able at: http://citeseer.ist.psu.edu/wong96collection.html/. Ac-
2001;8:34–48. cessed Apr 20, 2004.
3. Shepherd GM, Healy MD, Singer MS, et al. Senselab: a project in 19. Lindberg DA, Humphreys BL, McCray AT. The Unified Medical
multidisciplinary, multilevel sensory integration. In: Koslow SH, Language System. Methods Inf Med. 1993;32:281–91.
Huerta MF, editors. Neuroinformatics: An Overview of the Hu- 20. Mork P, Halevy A, Tarczy-Hornoch P. A model for data integra-
man Brain Project. Mahwah, NJ: Lawrence Erlbaum Associates, tion systems of biomedical data applied to online genetic data-
1997, pp 21–56. bases. AMIA Fall Symp. 2001:473–7.
4. Kans JA, Ouellette BFF. Submitting DNA sequences to the data- 21. Mork P, Shaker R, Halevy A, Tarczy-Hornoch P. PQL: a declara-
bases. In: Baxevanis AD, Ouellette BFF, editors. Bioinformatics: tive query language over dynamic biological schemata. AMIA
A Practical Guide to the Analysis of Genes and Proteins. New Fall Symp. 2002:533–7.
York: John Wiley & Sons, 1998. 22. OMIM. Online Mendelian Inheritance in Man. 2002. Available
5. Li P. BizTalk Server Developer’s Guide. New York: Osborne/ at: http://www.ncbi.nlm.nih.gov/omim/. Accessed Nov 10,
McGraw-Hill, 2002. 2002.
6. Genetic Exchange Inc. Exploiting the life science data explosion 23. Nadkarni PM. Management of evolving map data: data struc-
to speed new drug discovery. Turn Massive Amounts of Data in- tures and algorithms based on the framework map. Genomics.
to Gems of Knowledge Using discoveryHub. Available at: 1995;30:565–73.
http://www.geneticxchange.com/v3/product/whitepapers/ 24. Marenco L, Tosches N, Crasto C, Shepherd G, Miller PL, Nad-
WPexplosion.pdf. Accessed Dec 17, 2002. karni PM. Achieving evolvable Web-database bioscience appli-
7. Hass LM, Schwarz PM, Kodali P, Kotlar E, Rice JE, Swope WC. cations using the EAV/CR framework: recent advances. J Am
DiscoveryLink: A system for Integrated Access to Life Science Med Inform Assoc. 2003;10:444–53.
Data Sources. IBM Syst J. 2001;40:489. 25. Gardner D, Knuth KH, Abato M, et al. Common data model for
8. Josifovski V, Risch T. Query decomposition for a distributed ob- neuroscience data and data model exchange. J Am Med Inform
ject-oriented mediator system. Dist Parallel Databases. 2002;11: Assoc. 2001;8:17–33.
307–36. 26. Kaye D. Loosely Coupled: The Missing Pieces of Web Services.
9. Chung SY, Wong L. Kleisli: a new tool for data integration in bi- Kentfield, CA: RDS Press, 2003.
ology. Trends Biotechnol. 1999;17:351–5. 27. Masys D. An evaluation of the source selection elements of the
10. McBrien P, Poulovassilis A. Schema evolution in heterogeneous prototype UMLS Information Sources Map. In: Proceedings of
database architectures, a schema transformation approach. In: the Annual Symposium on Computer Applications in Medical
Pidduck AB, editor. 14th International Conference on Advanced Care. 1992:295–8.
Information Systems Engineering (CAiSE 2002). Berlin: Springer- 28. Andersson O, Armstrong P, Axelsson H, et al. Scalable Vector
Verlag, 2002, pp 484–99. Graphics (SVG) 1.1 Specification. 2003. Available at: http://
11. Ram S, Shankaranarayanan G. Dynamically Managing Schema www.w3.org/TR/2003/REC-SVG11-20030114/. Accessed Jul 7,
Changes in a HDE—Pitfalls and Possibilities. 2003. Available 2003.
at: http://smgnet.bu.edu/smgnet/css/staff/pub/GetFile.CFM/ 29. Haertel M, Hayes D, Stallman R, Tower L, Eggert P. GNU DIFF.
Shankar,_G_11.pdf?did=229&Filename=Shankar,_G_11.pdf. Ac- In. 2.7 ed. Cambridge, MA: Free Software Foundation, 1992. Pro-
cessed Jan 25, 2004. gram and documentation available at: ftp://prep.ai.mit.edu/
12. Li X, Tari Z. Class versioning for schema evolution. In: Proceed- pub/gnu/diffutils-2.7.
ings of the Australian Database Conference (ADC), Perth, Aus- 30. Krinke J, Zeller A. Linux/Unix Programming Toolset: Version
tralia, 1998, pp 117–28. Control, Construction, Testing, and Debugging. New York: John
13. Kim W, Seo J. Classifying schematic and data heterogeneity in Wiley & Sons, 2001.
multidatabase systems. IEEE Comput. 1991;24:12–8. 31. Sankoff DE, Kruskal JB. Time Warps, String Edits, and Macro-
14. Sheth AP, Kashyap V. So far (schematically) yet so near (seman- molecules: The Theory and Practice of Sequence Comparison.
tically). In: Proceedings of the International Federation for Infor- Reading, MA: Pearson Addison-Wesley, 1983.
mation Processing (IFIP) Working Group on Database Semantics 32. Bota M, Dong HW, Swanson LW. From gene networks to brain
Conference on Interoperable Database Systems (DS-5) 1992, pp networks. Nat Neurosci. 2003;6(8):795–9.
283–312. 33. Nadkarni PM, Chen RS, Brandt CA. UMLS concept indexing for
15. Paton NW, Stevens R, Baker PG, Goble CA, Bechhofer S, Brass production databases: a feasibility study. J Am Med Inform As-
A. Query processing in the TAMBIS Bioinformatics Source Inte- soc. 2001;8:80–91.

You might also like