GRID DATA POOLING Exploitation of The Grid System For Environmental Applications
GRID DATA POOLING Exploitation of The Grid System For Environmental Applications
Abstract : The goal of this paper is the introduction to Grid Technology. The Grid is
a unified source of distributed resources through a fast network. The Grid
supplies its users with a network of computational and storage elements.
Our main goal is to create a single pool of reference for data that has been
collected from the Internet that is of use to geoscientists and is available to
them in a grid-like way. The catalogue involves items like GTOPO, ETOPO
databases and even bigger datasets such as free ozone data. When data is
registered to the Grid a unique file name is assigned to it, an lfn (logical file
name). The catalogue contains these file names with which Grid users can
access the data.
Introduction
The basic idea behind a Computing Grid architecture is that of the electric power
grid; a variety of resources contribute power into a shared "pool" for many consumers
to access on an as-needed basis [2]. The Grid concept goes well beyond simple
communication between computers and aims ultimately to turn the global network of
computers into one vast computational resource. Ultimately, Grid computing is a
form of distributed computing that involves coordinating and sharing computing,
application, data, storage, or network resources across dynamic and geographically
dispersed organizations. The LCG/EGEE Grid is a service for sharing computer
power and data storage capacity over the Internet [3].
Although the Grid was designed for the LHC experiment, the largest scientific
instrument in the world which will begin operations at CERN in year 2007 and
produce data at about 10 Petabytes per year [4]. Its use branches out into every field
and science that needs great computing power. Theoretically, the grid can be used in
any application that requires a computer, since essentially the grid is a very large
computer; due to that it introduces itself over time to many more scientific groups. In
reality, the first grid users are those with very demanding applications that cannot be
implemented in simple computer systems. So today the Grid is being used in High
Energy Physics, Astronomy, Biomedicine, Chemistry, Environmental Sciences and so
on. Scientists form collaborating communities called Virtual Organizations, also
known as VOs, eg. for LHC these include ATLAS, CMS, LHCB, ALICE, and there
exist more for other sciences, such as MAGIC, BIOMED, COMPCHEM, ESR etc.
Typical data analysis, eg. with climate applications, involve the collection, cross
correlation (intercomparison) and adaptation of model data, as well as validation
against observational data. Model data was until previously produced in dedicated
HPC centers and data was stored in dispersed data center facilities, with no logical
links between the different copies (replicas). Individual dataset sizes for ESR
normally range between 100s of Megabytes and 10s of Gigabytes, even when that
involves single files, and the overall data managed by individual data centers is often
in the order of Petabytes.
Thus the major challenge in ESR related data-analysis is the access to, and transfers
of, huge amounts of data; the grid can do that. The data formats used are self-
describing, at least the most common ones: NetCDF, HDF, GRIB. Typical data
analysis software used in pre-/postprocessing are: PINGO, CDO, AFTERBURNER
[5]. Such transfers can take a great deal of time, if using a personal computer for data
analysis. We can do better, if the data is already in place when processing takes place
on the Grid.
The Grid comes to provide the storage and computing resources that would reduce
the time consumed to a fraction of the time needed using one or two personal
computers. This is helped by replication of data.
Many applications concerning earth sciences use large data sets as the ones above.
The first ESR application ported on the grid was the comparison of Ozone profiles
obtained by different means: the GOME satellite and the ground based lidar stations.
As GOME produces much more ozone profiles than the ground lidars, the key factor
of the comparison was to be able to accurately find satellite data that can be compared
with the existing lidar data. Data coincidence is determined by two criteria: location
and date. The grid helps the scientific community analyzing the ozone retrieval by
enabling a single computing environment for the different steps, allowing the data to
be easily shared between "producer" and "consumer". The sharing of computing
resources allows time consuming calculations to be carried out faster for the benefit of
every one. In this case, the OPERA algorithm is the one that stresses the most the grid
environment by the calculating resources it requires and by the large number of files it
generates)[6].
This is why the metadata availability and security is crucial for the ozone experiment.
The community is heavily implied in the testing of grid solutions for relational
databases management.
Implementation
In order to address this issue our main goal is to create a single pool of reference
for data that are already available on the Internet, in a way which is going to be of use
to geoscientists, in a grid-like way. The proposed catalogue involves items like
GTOPO, ETOPO databases and similar and even bigger datasets such as free ozone
data.
The method of working involves the following steps: The datasets are first retrieved
and moved to a specific file repository in a Unix account within a User Interface,
which acts as the entrance door for the LCG/EGEE Grid. Most of the work has been
by using the tool wget [7]. Then, the file is copied using the function lcg-cr to a
Storage Element [8], which is being provided by the ESR-VO for its users. When
“registering” the file to the Grid a logical file name – LFN must be appointed for
further reference and use of the file. An example follows,
The ECWMF has a large data base of weather measurements for the last 40 years. We
retrieved a part of this data and registered it to the Grid as described above.
The complete list of files moved to the Grid and specifically to the storage element
se01.isabella.grnet.gr can be found at [9].
The files that were moved and registered to the Grid are only a small segment of the
available and needed data regarding Earth Sciences on the Internet. The point was to
portray the method of registering data to the Grid and of course doing so for a specific
range of data.
Future Improvements
At the time of our initial work the ESR-VO followed the RLS architecture,
Replica Location Service protocol, which had two downfalls:
first only flat files can be registered to the Grid and
second, a number of security issues that didn't allow for confidentiality.
The newer version LFC – LHC File Catalogue allows branched files to be created and
gives the user the right to add secure access to certain users through passwords ext.
Today the ESR-VO has already migrated to LFC. Registering files to the Grid for the
ESR-VO at this point must be done using the corresponding commands for the LFC,
which are mostly the same [10].
Apart from the drawbacks created by the RLS protocol, there were other obstacles in
the process of registering files to the grid. The procedure was quite time consuming
due to the fact that each file was migrated and registered to the grid individually.
Before registering the files new names for each one had to be chosen, names that
would give future users a good idea of what the contents of this file might be. One can
understand that when talking about a few such files this wouldn’t be an issue. In our
case though only from the data at ECWMF did we migrate and register 180 files to
the grid. In the future we hope to go around such obstacles with scripts where it is
possible.
The point of our work is to introduce the importance of data pooling in the Grid, not
only for environmental data but for other sciences as well. Future work would be to
register databases to the Grid in cooperation with scientists and other users seeking to
run their applications on the Grid. Ultimately, this work can be done for all sciences
and is going to find great use and potential outcomes in the implementation of such
applications.
References
[1] I. Foster, C. Kesselman, S. Tuecke, The Anatomy of The Grid
[2] I. Foster and C. Kesselman, Morgan Kaufmann Publishers 1998, The Grid: Blueprint for a
Future Computing Infrastructure, p. 3.
[6] Julian Linford, Technical Report, March 2005, The GOME Application Deployment on
EGEE
[7] Linux / Unix Command: wget, http://linux.about.com/od/commands/l/blcmdl1_wget.htm
[8] Antonio Delgado Peris, Patricia Mendez, Lorenzo, Flavia Donno, Andrea Sciab` a, Simone
Campana, Roberto Santinelli, 2004, LCG-2 User Guide, p. 64.
[9] Vayia Panagiotidi, Diploma Dissertation – NTUA, October 2005, Exploitation of the Grid
Systems for Environmental Applications, http://www.hep.ntua.gr/files/vayia.pdf
[10] Tony Calanducci, User Training and Induction, June 2005, LFC: The LCG File Catalog,
www.phenogrid.dur.ac.uk/howto/LFC.pdf