CPS - Data Intensive Distributed Computing
CPS - Data Intensive Distributed Computing
Application Example
Dhiraj Tawani
Jyoti Tejwani
NamrataShownkeen
[email protected]
[email protected]
[email protected]
Smt. S. H. Mansukhani Institute of Technology
Ulhasnagar, Dist. Thane, Maharashtra, India
Abstract
Modern scientific computing involves organizing, moving, visualizing, and
analyzing massive amounts of data from around the world, as well as employing largescale computation. The distributed systems that solve large-scale problems will always
involve aggregating and scheduling many resources. Data must be located and staged,
cache and network capacity must be available at the same time as computing capacity,
etc.
Every aspect of such a system is dynamic: locating and scheduling resources,
adapting running application systems to availability and congestion in the middleware
and infrastructure, responding to human interaction, etc. The technologies, the
middleware services, and the architectures that are used to build useful high-speed, wide
area distributed systems, constitute the field of data intensive computing. This paper
explores some of the history and future directions of that field, and describes a specific
medical application example.
Computational Grids enable the sharing of a wide variety of geographically
distributed resources including supercomputers, storage systems, databases, data sources
and specialized devices owned by different organizations in order to create virtual
enterprises and organizations. They allow selection and aggregation of distributed
resources across multiple organizations for solving large-scale computational and data
intensive problems in science, engineering and commerce. The parallel processing of
applications on wide-area distributed systems provide a scalable computing power.
This enables exploration of large problems with huge data sets, which is essential for
creating new insights into the problem. Molecular modeling for drug design is one of the
scientific applications that can benefit from the availability of a large computational
capability.
Introduction
The advent of shared, widely available, high-speed networks is providing the
potential for new approaches to the collection, storage, and analysis of large data-objects.
Two examples of large data-object environments, that despite the very different
application areas have much in common, are health care imaging information systems
and atomic particle accelerator detector data systems.
Health care information, especially high-volume image data used for diagnostic purposes
- e.g. X-ray CT, MRI, and cardio-angiography - are increasingly collected at tertiary
(centralized) facilities, and may now be routinely stored and used at locations other than
the point of collection. The importance of distributed storage is that a hospital (or any
other instrumentation environment) may not be the best environment in which to
maintain a large-scale digital storage system, and an affordable, easily accessible, highbandwidth network can provide location independence for such storage. The importance
of remote end-user access is that the health care professionals at the referring facility
(frequently remote from the tertiary imaging facility) will have ready access to not only
the image analyst's reports, but the original image data itself.
This general scenario extends to other fields as well. In particular, the same basic
infrastructure is required for remote access to large-scale scientific and analytical
instruments, both for data handling and for direct, remote-user operation. In this paper we
describe and illustrate a set of concepts that are contributing to a generalized, high
performance, distributed information infrastructure, especially as it concerns the types of
large data-objects generated in the scientific and medical environments. We will describe
the general issues, architecture, and some system components that are currently in use to
support distributed large data-objects. We describe a health care information system that
has been built, and is in prototype operation.
The concept of a high-speed distributed cache as a common element for all of the
sources and sinks of data involved in high-performance data systems has proven very
successful in several application areas, including the automated processing and
cataloguing of real-time instrument data and the staging of data from an Massive Storage
System (MSS) for high data-rate applications. For the various data sources and sinks, the
cache, which is itself a complex and widely distributed system, provides:
A standardized approach for high data-rate interfaces;
A unit of high-speed, on-line storage that is large compared to the available disks
of the computing environments, and very large (e.g., hundreds of gigabytes)
compared to any single disk.
The model for data intensive computing, shown in Figure 1, includes the following:
Data sources deposit data in a distributed cache, and consumers take data from the
cache, usually writing processed data back to the cache when the consumers are
intermediate processing operations;
A tertiary storage system manager typically migrates data to and from the cache.
The cache can thus serve as a moving window on the object/dataset, since, depending
on the size of the cache relative to the objects of interest, only part of the object data may
be loaded in the cache - though the full objection definition is present: that is, the cache is
a moving window for the off-line object/data set; the native cache access interface is at
the logical block level, but client-side libraries implement various access I/O semantics e.g., Unix I/O (upon request available data is returned; requests for data in the dataset, but
not yet migrated to cache, cause the application-level read to block or be signaled);
the application from tertiary storage systems and instrument data sources. Many large
data sets may be logically present in the cache by virtue of the block index maps being
loaded even if the data is not yet available. Data blocks are declustered (dispersed in such
a way that as many system elements as possible can operate simultaneously to satisfy a
given request) across both disks and servers. This strategy allows a large collection of
disks to seek in parallel, and all servers to send the resulting data to the application in
parallel shown in Figure2. In this way processing can begin as soon as the first data
blocks are generated by an instrument or migrated from tertiary storage.
The DPSS provides several important and unique capabilities for a data intensive
computing environment. It provides application-specific interfaces to an extremely large
space of logical blocks; it offers the ability to build large, high-performance storage
systems from inexpensive commodity components; and it offers the ability to increase
performance by increasing the number of parallel disk servers. Various cache
management policies operate on a per-data set basis to provide block aging and
replacement.
Automatic cataloguing of the data and the metadata as the data is received (or as
close to real time as possible);
manage the data on mass storage systems. For subsequent use, the data components may
be staged to a local disk and then returned as usual via the Web browser, or, as is the case
for several of our applications, moved to a high-speed cache for access by specialized
applications (e.g., the high-speed video player illustrated in the right-hand part of the
right-hand panel in Figure3. The location of the data components on tertiary storage, how
to access them, and other descriptive material are all part of the LDO definition. The
creation of object definitions, the inclusion of standardized derived-data-objects as part
of the metadata, and the use of typed links in the object definition, are intended to provide
a general framework for dealing with many different types of data, including, for
example, abstract instrument data and multi-component multimedia programs. WALDO
was used in the Kaiser project to build a medical application that automatically manages
the collection, storage, cataloguing, and playback of video-angiography data 2 collected
at a hospital remote from the referring physician.
Using a shared, metropolitan area ATM network and a high-speed distributed data
handling system, video sequences are collected from the video-angiography imaging
system, then processed, catalogued, stored, and made available to remote users. This
permits the data to be made available in near-real time to remote clinics (see Fig u re3).
The LDO becomes available as soon as the catalogue entry is generated
derived data is added as the processing required to produce it completes. Whether the
storage systems are local or distributed around the network is entirely a function of
optimizing logistics.
In the Kaiser project, cardio-angiography data was collected directly from a Philips
scanner by a computer system in the San Francisco Kaiser hospital Cardiac
Catheterization Laboratory. This system is, in turn, attached to an ATM network provided
by the NTON and BAGNet testbeds. When the data collection for a patient is complete
(about once every 2040 minutes), 5001000 megabytes of digital video data is sent
across the ATM network to LBNL (in Berkeley) and stored first on the DPSS distributed
cache (described above), and then the WALDO object definitions are generated and made
available to physicians in other Kaiser hospitals via BAGNet.
Auxiliary processing and archiving to one or more mass storage systems proceeds
independently. This process goes on 810 hours a day.
experiment were that a sustained 57 megabytes/sec of data were delivered from datasets
in the distributed cache to the remote application memory, ready for analysis
algorithms to commence operation. This experiment represents an example of our data
intensive computing model in operation.
The prototype application was the STAR analysis system that analyzes data from
high energy physics experiments. Shown in fig [3]. A four-server DPSS located at LBNL
was used as a prototype front end for a high-speed mass storage system. A 4-CPU Sun E4000 located at SLAC was a prototype for a physics data analysis computing cluster, as
shown in Figure 2. The National Transparent Optical Network test bed (NTON see [7])
connects LBNL and SLAC and provided a five-switch, 100-km, OC-12 ATM path. All
experiments were application-to-application, using TCP transport.
Multiple instances of the STAR analysis code read data from the DPSS at LBNL
and moved that data into the memory of the STAF application where it was available to
the analysis algorithms. This experiment resulted in a sustained data transfer rate of 57
MBytes/sec from DPSS cache to application memory. This is the equivalent of about 4.5
TeraBytes / day. The goal of the experiment was to demonstrate that high-speed mass
storage systems could use distributed caches to make data available to the systems
running the analysis codes. The experiment was successful, and the next steps will
involve completing the mechanisms for optimizing the MSS staging patterns and
completing the DPSS interface to the bit file movers that interface to the MSS tape
drives.
Conclusion
We believe this architecture, and its integration with systems like Globus, will enable the
next generation of configurable, distributed, high-performance, data-intensive systems;
computational steering; and integrated instrument and computational simulation. We also
believe a high performance network cache system such as the DPSS will be an important
component to these computational grids and metasystems.
[1] DPSS, The Distributed Parallel Storage System, http://www-didc.lbl.gov/DPSS/
[2] Globus, The Globus Project, http://www.globus.org/
[3] Greiman, W., W. E. Johnston, C. McParland, D. Olson, B. Tierney, C. Tull, HighSpeed Distributed Data Handling for HENP, Computing in High Energy Physics, April,
1997. Berlin, Germany. http://www-itg.lbl.gov/STAR/
[4] Grimshaw, A., A. Ferrari, G. Lindahl, K. Holcomb, Metasystems, Communications
of the ACM, November, 1998, Volume 41, no 11
[5] Foster, I., C. Kesselman, eds., The Grid: Blueprint for a New Computing
Infrastructure, Morgan Kaufmann, publisher. August, 1998.
[6] B. Fuller and I. Richer The MAGIC Project: From Vision to Reality, IEEE
Network, May, 1996, Vol. 10, no. 3. http://www.magic.net/
[7]
NTON,
National
Transparent
Optical
Network
Consortium.
See
http://www.ntonc.org/.
[8] Johnston, W., G. Jin, C. Larsen, J. Lee, G. Hoo, M. Thompson, B. Tierney, J.
Terdiman, Real-Time Generation and Cataloguing of Large Data-Objects in Widely
Distributed Environments, International Journal of Digital Libraries - Special Issue on
Digital Libraries in Medicine. November, 1997. (Available at http://wwwitg.lbl.gov/WALDO/)
[9] Thompson, M., W. Johnston, J. Guojun, J. Lee, B. Tierney, and J. F. Terdiman,
Distributed health care imaging information systems, PACS Design and Evaluation:
Engineering and Clinical Issues, SPIE Medical Imaging 1997. (Available at http://wwwitg.lbl.gov/Kaiser.IMG)
[10] Tierney, B., W. Johnston, B. Crowley, G. Hoo, C. Brooks, D. Gunter, The NetLogger Methodology for High Performance Distributed Systems Performance Analysis,
Seventh IEEE International Symposium on High Performance Distributed Computing,
Chicago, Ill., July 28-31, 1998. Available athttp://www-itg.lbl.gov/DPSS/papers.html.
[11] Tierney, B., W. Johnston, J. Lee, and G. Hoo, Performance Analysis in High-Speed
Wide Area ATM Networks: Top-to-bottom end-to-end Monitoring, IEEE Networking,
May 1996. (Available at
http://www-itg.lbl.gov/DPSS/papers.)