0% found this document useful (0 votes)
44 views

Grid Environment Setup

This document is a mini project report submitted by Abin Paul in partial fulfillment of the requirements for a Master's degree in Network Engineering. The project involved setting up a grid environment using the Globus Toolkit and executing jobs using the Grid Resource Allocation Manager (GRAM). The report describes installing and configuring the Globus Toolkit on two machines to build a simple grid. It discusses components of the Globus Toolkit like Grid Security Infrastructure (GSI) and GRAM. The results and performance of the grid setup are analyzed.

Uploaded by

ROSE TOM
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Grid Environment Setup

This document is a mini project report submitted by Abin Paul in partial fulfillment of the requirements for a Master's degree in Network Engineering. The project involved setting up a grid environment using the Globus Toolkit and executing jobs using the Grid Resource Allocation Manager (GRAM). The report describes installing and configuring the Globus Toolkit on two machines to build a simple grid. It discusses components of the Globus Toolkit like Grid Security Infrastructure (GSI) and GRAM. The results and performance of the grid setup are analyzed.

Uploaded by

ROSE TOM
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Mini Project Report On

Setting up Grid Environment and Executing GRAM

Submitted in partial fulfillment of the requirements for the


award of the degree of

Master of Technology
in

Network Engineering

By
Abin Paul
(Uni. Reg. No: 17001)

Under the guidance of


Mr.Biju Paul

Department of Information Technology


Rajagiri School of Engineering and Technology
Rajagiri Valley, Kakkanad, Kochi, 682039

February 2014
DEPARTMENT OF INFORMATION TECHNOLOGY
RAJAGIRI SCHOOL OF ENGINEERING AND TECHNOLOGY
RAJAGIRI VALLEY, KAKKANAD, KOCHI, 682039

CERTIFICATE

Certified that project work entitled “Setting up Grid Environment and Executing
GRAM” is a bonafide work done by Mr.Abin Paul University register number 17001
in partial fulfillment of the award of the Degree of Master of Technology in Network Engi-
neering from Mahathma Gandhi University, Kottayam, Kerala during the academic year
2013-2014.

Ms. Kuttyamma A. J. Ms. Preetha K G Mr.Biju Paul


Head of Department Project Coordinator Project Guide
Dept. of IT Asst.Professor Asst.Professor
RSET Dept. of IT Dept. of IT
RSET RSET

External Examiner Internal Examiner


ACKNOWLEDGEMENTS

Management is efficiency in climbing the ladder of success; leadership determines whether


the ladder is leaning against the right wall. Our Principal, Dr. J. Isaac, has always made
sure that our ladder to success was leaning against the right wall. I thank him for his
help and support.

I am thankful to my Head of the Department Ms Kuttyamma A J ,whose help and


guidance has been a major factor in completing my journey.

I express my gratitude to project co-ordinator and my Guide, Ms.Preetha K G, Asst.


Professor, Dept. of Information Technology for her support and guidance.

I extend my sincere and heartfelt thanks to my guide, Mr.Biju Paul, Asst. Professsor,
Dept.of Information Technology ,for helping me in my presentation and pro- viding with
timely advices and guidance

Mr Binu A, Asst. Professor, Dept.of Information Technology, was a great mentor


throughout this project. He was a great driving force, who went out of his comfort zone
for helping me in my work and providing with timely advices and guidance.I thank him
for his immense support.

Growth is never by mere chance; it is the result of forces working together. My sem-
inar is the culmination of many forces joining hands together. For this I thank all my
friends,especially Mr. Joseph John, for the support and encouragement they have given
me during the course of my work.

Abin Paul
ABSTRACT

Although ”the Grid” is still just a dream.Grid computing is already a reality.Imagine


several million computers from all over the world, and owned by thousands of different
people. They might include desktops, laptops, supercomputers, data vaults, and instru-
ments like mobile phones, meteorological sensors and telescopes.Now imagine that all of
these computers can be connected to form a single, huge and super-powerful computer!
This huge, sprawling, global computer is what many people dream ”The Grid” will be.

A middleware named Globus is stretching this dream to reality.Globus has been used
by National Center for Supercomputing Applications (NCSA), the European DataGrid
project based at CERN, the NASA Information Power Grid and so on.so the aim of
my project is to set up a grid using Globus and to make a performance evaluation of
the same.The motto of this evaluation is to find the inherent communication problems
associated with Globus protocols and to replace them with a better technology.
Contents

Acknowledgements ii

Abstract iii

List of Figures v

1 Introduction 1
1.1 Preamble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Scope and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Organization of the report . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Globus Toolkit 3
2.1 Evolution of Globus Toolkit . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Virtual Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Grid and Globus Toolkit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Grid Security Infrastructure (GSI) . . . . . . . . . . . . . . . . . . 8
2.3.2 Grid Resource Allocation Manager (GRAM) . . . . . . . . . . . . . 15
2.3.3 GridFTP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Hardware and Software Specification 19


3.1 Hardware Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Software Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 System Analysis and Design 23


4.1 Building a grid architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

iv
4.4 Grid architecture models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5 Installation of Globus Toolkit 31


5.1 Setting up the first machine . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2 Setting up your second machine . . . . . . . . . . . . . . . . . . . . . . . . 34

6 Results 38

7 Discussion 41

8 Concluding remarks 43
8.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

References 44

v
List of Figures

2.1 Globus Toolkit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4


2.2 Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Virtual Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.1 Software Specification 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21


3.2 Software Specification 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Software Specification 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.1 Step 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2 Step 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.3 Step 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.4 Step 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.5 Step 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.6 Step 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.7 Step 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.8 Step 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.9 Step 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.10 Step 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.11 Step 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.12 Step 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.13 Step 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.14 Step 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.15 Step 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.16 Step 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.17 Step 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.18 Step 17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.19 Step 18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.20 Step 19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

vi
6.1 Gridmap Entry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.2 Adding user to myproxy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.3 Error in simpleCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.4 SimpleCA installation 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.5 SimpleCA installation 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

7.1 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

vii
Chapter 1

Introduction

1.1 Preamble

In an increasing number of scientific disciplines, large data collections and intensive


computing are emerging as important elements. In domains as diverse as global climate
change, high energy physics, and computational genomics, the volume of interesting data
is already measured in petabytes. The communities of researchers that need to access
and analyze this data are often large and are almost always geographically distributed, as
are the computing and storage resources that these communities rely upon to store and
analyze their data .

This combination of large dataset size, geographic distribution of users and resources,
and computationally intensive analysis results in complex and stringent performance de-
mands that are not satisfied by any traditional infrastructure. A large scientific collabora-
tion may generate many queries, each involving access to supercomputer-class computa-
tions on gigabytes or terabytes of data. Efficicient and reliable execution of these queries
may require careful management of terabyte of data transfer over wide area networks,
coscheduling of data transfers and supercomputer computation, accurate performance es-
timations to guide the selection of dataset replicas, and other advanced techniques that
collectively maximize use of scarce storage, networking, and computing resources. Inorder
to deal with this the growing challenges the best option is to implement grid techmology.

1.2 Motivation

Grid computing is the collection of computer resources from multiple locations to reach
a common goal. In the past most computing was done in a centralized manner involving
machines with large computation power, but this model can be quite expensive and doesn’t

1
scale well. Using grid we have the ability to distribute jobs to many smaller server
components using load sharing software that distributes the load evenly based on resource
availability and policies. Now instead of having one heavily burdened server the load
can be spread evenly across many smaller computers. The distributed nature of grid
computing is transparent to the user. When a user submits a job they don’t have to
think about which machine their job is going to get executed on. The ”grid software” will
perform the necessary calculations and decide where to send the job based on policies.
Many research institutions are using some sort of grid computing to address complex
computational challenges.

1.3 Scope and Objectives

Grid computing refers to the automated sharing and coordination of the collective pro-
cessing power of many widely scattered, robust computers that are not normally centrally
controlled, and that are subject to open standards. Grid technology helps in building up
enormous amounts of computing power and data storage,a possibility to share expensive
computational resources. Grid covers a variety of areas such as computation management,
storage management, security provisioning, data movement, monitoring, agreement nego-
tiation, notification mechanisms, trigger services, and information aggregation.

1.4 Organization of the report

Here, Chapter 1 deals with an introduction to the whole concept. Then, we move into
the primary concerns and origins of the Globus Toolkit in Chapter 2. In the next chapter,
we analyze the hardware and software specification. Chapter 4 gives the system analysis
and design. The chapter 5 gives the installation of the Globus Toolkit. Finally, Chapter
6 closes the report with the concluding remarks, followed by the reference list.

1.5 Summary

The first chapter was an overview of the mini project. The motivation for this topic,
scope and objective and major contributions of the mini project are also discussed. The
next chapter gives a description of the relevence of the toolkit, its installation and related
works.

2
Chapter 2

Globus Toolkit

The open source Globus Toolkit is a fundamental enabling technology for the ”Grid,”
letting people share computing power, databases, and other tools securely online across
corporate, institutional, and geographic boundaries without sacrificing local autonomy.
The toolkit includes software services and libraries for resource monitoring, discovery,
and management, plus security and file management. In addition to being a central part
of science and engineering projects that total nearly a half-billion dollars internationally,
the Globus Toolkit is a substrate on which leading IT companies are building significant
commercial Grid products. The toolkit includes software for security, information infras-
tructure, resource management, data management, communication, fault detection, and
portability. It is packaged as a set of components that can be used either independently or
together to develop applications. Every organization has unique modes of operation, and
collaboration between multiple organizations is hindered by incompatibility of resources
such as data archives, computers, and networks. The Globus Toolkit was conceived to
remove obstacles that prevent seamless collaboration. Its core services, interfaces and
protocols allow users to access remote resources as if they were located within their own
machine room while simultaneously preserving local control over who can use resources
and when.

The Globus Toolkit has grown through an open-source strategy similar to the Linux
operating system’s, and distinct from proprietary attempts at resource-sharing software.
This encourages broader, more rapid adoption and leads to greater technical innovation,
as the open-source community provides continual enhancements to the product.

3
Figure 2.1: Globus Toolkit

2.1 Evolution of Globus Toolkit

In late 1994 Rick Stevens, director of the mathematics and computer science division at
Argonne National Laboratory, and Tom DeFanti, director of the Electronic Visualization
Laboratory at the University of Illinois at Chicago, proposed establishing temporary links
among 11 high-speed research networks to create a national grid (the ”I-WAY”) for two
weeks before and during the Supercomputing ’95 conference. A small team led by Ian
Foster at Argonne created new protocols that allowed I-WAY users to run applications
on computers across the country. This successful experiment led to funding from the
Defense Advanced Research Projects Agency (DARPA, and 1997 saw the first version
of the Globus Toolkit, which was soon deployed across 80 sites worldwide. The U.S.
Department of Energy (DOE) pioneered the application of grids to science research, the
National Science Foundation (NSF) funded creation of the National Technology Grid to
connect university scientists with high-end computers, and NASA started similar work on
its Information Power Grid.

Grids first emerged in the use of supercomputers, as scientists and engineers across the
U.S. sought access to scarce high-performance computing resources that were concentrated

4
at a few sites. Begun in 1996, the Globus Project was initially based at Argonne, ISI,
and the University of Chicago (U of C). What is now called the Globus Alliance has
expanded to include the University of Edinburgh, the Royal Institute of Technology in
Sweden, the National Center for Supercomputing Applications, and Univa Corporation.
Project participants conduct fundamental research and development related to the Grid.
Sponsors include federal agencies such as DOE, NSF, DARPA, and NASA, along with
commercial partners such as IBM and Microsoft.

The project has spurred a revolution in the way science is conducted. High-energy
physicists designing the Large Hadron Collider at CERN are developing Globus-based
technologies through the European Data Grid, and the U.S. efforts like the Grid Physics
Network (GriPhyN) and Particle Physics Data Grid. Other large-scale e-science projects
relying on the Globus Toolkit include the Network for Earthquake Engineering and Simu-
lation (NEES), FusionGrid, the Earth System Grid (ESG),the NSF Middleware Initiative
and its GRIDS Center, and the National Virtual Observatory. In addition, many univer-
sities have deployed campus Grids, and deployments in industry are growing rapidly.

Much as the World Wide Web brought Internet computing onto the average user’s
desktop, the Globus Toolkit is helping to bridge the gap for commercial applications
of Grid computing. Since 2000, companies like Avaki, DataSynapse, Entropia, Fujitsu,
Hewlett-Packard, IBM, NEC, Oracle, Platform, Sun and United Devices have pursued
Grid strategies based on the Globus Toolkit. This widespread industry adoption has
brought a new set of objectives, with the cardinal purpose being to preserve the open-
source, non-profit community in which the Globus Project has thrived, while seeding
commercial grids based on open standards. 2004 saw the formation of Univa Corporation,
a company devoted to providing commercial support for Globus software, and 2005 the
creation of the Globus Consortium by a group of companies with an interest in supporting
Globus Toolkit enhancements for enterprise use. From version 1.0 in 1998 to the 2.0
release in 2002 and now the latest 4.0 version based on new open-standard Grid services,
the Globus Toolkit has evolved rapidly into what The New York Times called ”the de
facto standard” for Grid computing. In 2002 the project earned a prestigious R&D 100
award, given by R&D Magazine in a ceremony where the Globus Toolkit was named ”Most

5
Promising New Technology” among the year’s top 100 innovations. Other honors include
project leaders Ian Foster of Argonne National Laboratory and the University of Chicago,
Carl Kesselman of the University of Southern California’s Information Sciences Institute
(ISI), and Steve Tuecke of Argonne being named among 2003’s top ten innovators by
InfoWorld magazine, and a similar honor from MIT Technology Review, which named
Globus Toolkit-based Grid computing one of ”Ten Technologies That Will”

Figure 2.2: Evolution

2.2 Virtual Organization

The real and specific problem that underlies the Grid concept is coordinated resource
sharing and problem solving in dynamic, multi-institutional virtual organizations . The
sharing that we are concerned with is not primarily file exchange but rather direct access
to computers, software, data, and other resources, as is required by a range of collabora-
tive problem-solving and resource-brokering strategies emerging in industry, science, and
engineering. This sharing is, necessarily, highly controlled, with resource providers and
consumers defining clearly and carefully just what is shared, who is allowed to share, and
the conditions under which sharing occurs.

A set of individuals and/or institutions defined by such sharing rules form what we
call a virtual organization (VO). The following are examples of VOs: the application

6
service providers, storage service providers, cycle providers, and consultants engaged by
a car manufacturer to perform scenario evaluation during planning for a new factory;
members of an industrial consortium bidding on a new aircraft; a crisis management
team and the databases and simulation systems that they use to plan a response to an
emergency situation; and members of a large, international, multiyear high-energy physics
collaboration. Each of these examples represents an approach to computing and problem
solving based on collaboration in computation- and data-rich environments.

As these examples show, VOs vary tremendously in their purpose, scope, size, duration,
structure, community, and sociology. Nevertheless, careful study of underlying technology
requirements leads us to identify a broad set of common concerns and requirements. In
particular, we see a need for highly flexible sharing relationships, ranging from client-
server to peer-to-peer; for sophisticated and precise levels of control over how shared
resources are used, including fine-grained and multi-stakeholder access control, delegation,
and application of local and global policies; for sharing of varied resources, ranging from
programs, files, and data to computers, sensors, and networks; and for diverse usage
modes, ranging from single user to multi-user and from performance sensitive to cost-
sensitive and hence embracing issues of quality of service, scheduling, co-allocation, and
accounting.

Figure 2.3: Virtual Organization

7
2.3 Grid and Globus Toolkit

The grid architecture can be divided into five layers fabric layer, connectivity layer,
resource layer, collective layer and application layer.

Fabric layer deals with controlling things locally, basically control and access to re-
sources. Connectivity layer deals with communication and security between components
in the grid. Resource layer deals with sharing resources, negotiating access, controlling use
of the resource. The job of the collective layer is coordinating multiple resources, ubiq-
uitous infrastructure services and application specific distributed services. Application
layer handles all the services pertaining to the applications running on the grid.

The Globus Toolkit mainly covers Connectivity layer and Resource layer and some parts
of collective layer. Connectivity in Globus Toolkitmainly deal with Security. Security
is provided by Grid Security Infrastructure (GSI) which enables collaborators to share
resources without blind trust. Resource layer in GT carreies out resource management and
data transfer. Resource management is done using Grid Resource Allocation Management
(GRAM) and Data Transfer is done using Grid File Transfer Protocol (GridFTP).

2.3.1 Grid Security Infrastructure (GSI)

The Grid Security Infrastructure (GSI) is the portion of the Globus Toolkit that pro-
vides the fundamental security services needed to support Grids. GT supports both
message-level and transport-level security. ”Message-level security” means support for
the WS-Security standard and the WSSecureConversation specification to provide mes-
sage protection for SOAP messages. ”Transport-level security” mean authentication via
TLS with support for X.509 proxy certificates.

GT support for message-level security is important as it allows us to comply with the


WS-Interoperability Basic Security Profile. However, because current message-leve secu-
rity implementations have relatively poor performance, GT services use transport level
security by default. This choice is driven by user performance demands. The poor perfor-
mance of message-level security implementations seems to be partly an implementation
issue and partly a specification issues, and it is not clear when it will improve. If and when

8
the performance of message-level security does improve, GT can be expected to move to
using message-level security as a default. Eventually, transport-level support could be
deprecated. However, this would only be after a transition period and will not occur any
time soon.

GSI may be thought of as being composed of four distinct functions: message protection,
authentication, delegation, and authorization. Implementations of different standards are
used to provide each of these functions:

• TLS (transport-level) or WS-Security and WS-SecureConversation (messagelevel)


are used as message protection mechanisms in combination with SOAP.

• X.509 End Entity Certificates or Username and Password are used as authentication
credentials

• X.509 Proxy Certificates and WS-Trust are used for delegation

• SAML assertions are used for authorization

Message Protection

The Web Services portions of GT4 use SOAP [1] as their message protocol for com-
munication. Message protection can be provided either by transporting SOAP messages
over TLS, known as Transport-level security, or by signing and/or encrypting portions of
the SOAP message using the WS-Security standard, known as Message-level Security. In
this section we describe these two methods.

Transport-level Security

Transport-level security entails SOAP messages conveyed over a network connection


protected by TLS. TLS provides for both integrity protection and privacy (via encryp-
tion). Transport-level security is supported today as a higher-performance alternative
to the more standards driven message-level security. If and when message-level security
improves in performance, driven by a combination of implementation and specification
factors, we expect a gradual deprecation of transport-level security. Transport-level se-
curity is normally used in conjunction with X.509 credentials for authentication, but can

9
also be used without such credentials to provide message protection without authentica-
tion, often referred to as anonymous transport-level security. In this mode of operation,
authentication may be done on a different level, e.g. via username and password in a
SOAP message, or communications may be truly unauthenticated.

Proposed System

The SOAP specification allows for the abstraction of the application-specific portion of
the payload from any security (e.g., digital signature, integrity protection, or encryption)
applied to that payload, allowing GSI security to be applied in a consistent manner
across SOAP messages for any GT Web Service-based application or component. GSI
implements the WS-Security standard and the WS-SecureConversation specification to
provide message protection for SOAP messages. (We use the term specification to denote
a scheme that has been well documented but has not passed through a public standards
body.) The WS-Security standard from OASIS defines a framework for applying security
to individual SOAP messages; GSI conforms to this standard. GSI uses these mechanisms
to provide security on a per-message basis, i.e., to an individual message without any
preexisting context between the send and receiver (outside sharing some set of trust roots).
WS-SecureConversation is a proposed standard from IBM and Microsoft that allows for
an initial exchange of message to establish a security context which can then be used to
protect subsequent messages in a manner that requires less computational overhead (i.e.,
it allows the trade-off of initial overhead for setting up the session for lower overhead
for messages). Note that SecureConversation is only offered with GSI when using X.509
credentials as described in the subsequent section on authentication. Both WS-Security
and WS-SecureConversation are intentionally neutral to the specific types of credentials
used to implement this security. GSI, as described further in the subsequent section
on authentication, allows for both X.509 public key credentials and the combination of
username and password for this purpose. GSI used with either username/password or
X.509 credentials uses the WS-Security standard to allow for authentication; that is a
receiver can verify the identity of the communication initiator. When used with X.509
credentials GSI uses WS-Security and WS-SecureConversation to allow for the following
additional protection mechanisms (which can be combined):

10
• Integrity protection: a receiver can verify messages were not altered in transit from
the sender.

• Encryption: messages can be protected to provide confidentiality.

• Replay prevention: a receiver can verify that it has not received the same mes-
sagepreviously.

The specific manner in which these protections are provided varies between WS-Security
and WS-SecureConversation. In the case of WS-Security, the keys associated with the
sender and receivers X.509 credentials are used. In the case of WS-SecureConversation,
the X.509 credentials are used to establish a session key that is used to provide the message
protection.

Public Key Cryptography

The most important thing to know about public key cryptography is that unlike earlier
cryptographic systems, it relies not on a single key (a password or a secret ”code”), but
on two keys. These keys are numbers that are mathematically related in such a way that
if either key is used to encrypt a message, the other key must be used to decrypt it.
Also important is the fact that it is next to impossible (with our current knowledge of
mathematics and available computing power) to obtain the second key from the first one
and/or any messages encoded with the first key.

By making one of the keys available publicly (a public key) and keeping the other key
private (a private key), a person can prove that he or she holds the private key simply by
encrypting a message. If the message can be decrypted using the public key, the person
must have used the private key to encrypt the message. It is critical that private keys be
kept private, anyone who knows the private key can easily impersonate the owner.

Digital Signatures

Using public key cryptography, it is possible to digitally ”sign” a piece of information.


Signing information essentially means assuring a recipient of the information that the

11
information hasn’t been tampered with since it left your hands. To sign a piece of infor-
mation, first compute a mathematical hash of the information. (A hash is a condensed
version of the information. The algorithm used to compute this hash must be known to
the recipient of the information, but it isn’t a secret.) Using your private key, encrypt the
hash, and attach it to the message. Make sure that the recipient has your public key.

To verify that your signed message is authentic, the recipient of the message will com-
pute the hash of the message using the same hashing algorithm you used, and will then
decrypt the encrypted hash that you attached to the message. If the newly-computed
hash and the decrypted hash match, then it proves that you signed the message and that
the message has not been changed since you signed it.

Certificates

A central concept in GSI authentication is the certificate. Every user and service on
the Grid is identified via a certificate, which contains information vital to identifying
and authenticating the user or service. A GSI certificate includes four primary pieces of
information:

• A subject name, which identifies the person or object that the

• certificate represents.

• The public key belonging to the subject.

• The identity of a Certificate Authority (CA) that has signed the certificate to certify
that the public key and the identity both belong to the subject.

• The digital signature of the named CA.

Note that a third party (a CA) is used to certify the link between the public key
and the subject in the certificate. In order to trust the certificate and its contents, the
CA’s certificate must be trusted. The link between the CA and its certificate must be
established via some non-cryptographic means, or else the system is not trustworthy.
GSI certificates are encoded in the X.509 certificate format, a standard data format for
certificates established by the Internet Engineering Task Force (IETF). These certificates

12
can be shared with other public key-based software, including commercial web browsers
from Microsoft and Netscape.

Mutual Authentication

If two parties have certificates, and if both parties trust the CAs that signed each
other’s certificates, then the two parties can prove to each other that they are who they
say they are. This is known as mutual authentication. The GSI uses the Secure Sock-
ets Layer (SSL) for its mutual authentication protocol, which is described below. (SSL
is also known by a new, IETF standard name: Transport Layer Security, or TLS.) Be-
fore mutual authentication can occur, the parties involved must first trust the CAs that
signed each other’s certificates. In practice, this means that they must have copies of
the CAs’ certificates–which contain the CAs’ public keys–and that they must trust that
these certificates really belong to the CAs. To mutually authenticate, the first person (A)
establishes a connection to the second person (B). To start the authentication process, A
gives B his certificate. The certificate tells B who A is claiming to be (the identity), what
A’s public key is, and what CA is being used to certify the certificate. B will first make
sure that the certificate is valid by checking the CA’s digital signature to make sure that
the CA actually signed the certificate and that the certificate hasn’t been tampered with.
(This is where B must trust the CA that signed A’s certificate.) Once B has checked out
A’s certificate, B must make sure that A really is the person identified in the certificate.
B generates a random message and sends it to A, asking A to encrypt it. A encrypts the
message using his private key, and sends it back to B. B decrypts the message using A’s
public key. If this results in the original random message, then B knows that A is who he
says he is. Now that B trusts A’s identity, the same operation must happen in reverse.
B sends A her certificate, A validates the certificate and sends a challenge message to
be encrypted. B encrypts the message and sends it back to A, and A decrypts it and
compares it with the original. If it matches, then A knows that B is who she says she is.
At this point, A and B have established a connection to each other and are certain that
they know each others’ identities.

13
Confidential Communication

By default, the GSI does not establish confidential (encrypted) communication between
parties. Once mutual authentication is performed, the GSI gets out of the way so that
communication can occur without the overhead of constant encryption and decryption.
The GSI can easily be used to establish a shared key for encryption if confidential com-
munication is desired. Recently relaxed United States export laws now allow us to include
encrypted communication as a standard optional feature of the GSI.

A related security feature is communication integrity. Integrity means that an eaves-


dropper may be able to read communication between two parties but is not able to modify
the communication in any way. The GSI provides communication integrity by default.
(It can be turned off if desired). Communication integrity introduces some overhead in
communication, but not as large an overhead as encryption.

Securing Private Keys

The core GSI software provided by the Globus Toolkit expects the user’s private key
to be stored in a file in the local computer’s storage. To prevent other users of the
computer from stealing the private key, the file that contains the key is encrypted via a
password (also known as a passphrase). To use the GSI, the user must enter the passphrase
required to decrypt the file containing their private key. We have also prototyped the use
of cryptographic smartcards in conjunction with the GSI. This allows users to store their
private key on a smartcard rather than in a filesystem, making it still more difficult for
others to gain access to the key.

Delegation and Single Sign-On

The GSI provides a delegation capability: an extension of the standard SSL protocol
which reduces the number of times the user must enter his passphrase. If a Grid computa-
tion requires that several Grid resources be used (each requiring mutual authentication),
or if there is a need to have agents (local or remote) requesting services on behalf of a
user, the need to re-enter the user’s passphrase can be avoided by creating a proxy.

14
A proxy consists of a new certificate (with a new public key in it) and a new private key.
The new certificate contains the owner’s identity, modified slightly to indicate that it is a
proxy. The new certificate is signed by the owner, rather than a CA. (See diagram below.)
The certificate also includes a time notation after which the proxy should no longer be
accepted by others. Proxies have limited lifetimes. The proxy’s private key must be kept
secure, but because the proxy isn’t valid for very long, it doesn’t have to kept quite as
secure as the owner’s private key. It is thus possible to store the proxy’s private key in
a local storage system without being encrypted, as long as the permissions on the file
prevent anyone else from looking at them easily. Once a proxy is created and stored,
the user can use the proxy certificate and private key for mutual authentication without
entering a password. When proxies are used, the mutual authentication process differs
slightly. The remote party receives not only the proxy’s certificate (signed by the owner),
but also the owner’s certificate. During mutual authentication, the owner’s public key
(obtained from her certificate) is used to validate the signature on the proxy certificate.
The CA’s public key is then used to validate the signature on the owner’s certificate. This
establishes a chain of trust from the CA to the proxy through the owner. Note that the
GSI and software based on it (notably the Globus Toolkit, GSI-SSH, and GridFTP) is
currently the only software which supports the delegation extensions to TLS (a.k.a. SSL).
The Globus Project is actively working with the Grid Forum and the IETF to establish
proxies as a standard extension to TLS so that GSI proxies may be used with other TLS
software.

2.3.2 Grid Resource Allocation Manager (GRAM)

GRAM is the module that provides the remote execution and status management of the
execution. When a job is submitted by a client, the request is sent to the remote host
and handled by the gatekeeper daemon located in the remote host. Then the gatekeeper
creates a job manager to start and monitor the job. When the job is finished, the job
manager sends the status information back to the client and terminates.
GRAM comprises of the following :

• The globusrun command

• Resource Specification Language (RSL)

15
• The gatekeeper daemon

• The job manager

• The forked process

• Global Access to Secondary Storage (GASS)

• Dynamically-Updated Request Online Coallocator (DUROC)

The globusrun command

The globusrun command submits and manages remote jobs and is used by almost
all GRAM client tools. This command provides the following functions:Request of
job submission to remote machines : Job submission uses security functions (such
as GSS-API) to check mutual authentication between clients and servers, and also
to verify the rights to submit the job. Transfer the executable files and the resulting
job-submission output files : The globusrun command can get the standard output
of job results from remote machines. It uses GASS to provide the secure file transfer
between grid machines.

Resource Specification Language (RSL)

RSL is the language used by the clients to submit a job. All job submission requests
are described in RSL, including the executable file and condition on which it must be
executed. You can specify, for example, the amount of memory needed to execute a job
in a remote machine.

Gatekeeper

The gatekeeper daemon builds the secure communication between clients and servers.
The gatekeeper daemon is similar to inetd daemon in terms of functionality. However,
gatekeeper provides a secure communication. It communicates with the GRAM client
(globusrun) and authenticates the right to submit jobs. After authentication, gatekeeper
forks and creates a job manager delegating the authority to communicate with clients.

16
Job manager

Job manager is created by the gatekeeper daemon as part of the job requesting process.
It provides the interfaces that control the allocation of each local resource manager, such
as a job scheduler like PBS, LSF, or LoadLeveler. The job manager functions are:

• Parse the resource language Breaks down the RSL scripts.

• Allocate job requests to the local resource managers The local resource manager
is usually a job scheduler like PBS, LSF, or LoadLeveler. The resource manager
interface is written in the Perl language,which easily allows you to create a new job
manager to the local resource manager, if necessary.

• Send callbacks to clients, if necessary

• Receive the status and cancel requests from clients

• Send output results to clients using GASS, if requested

Global Access to Secondary Storage (GASS)

GRAM uses GASS for providing the mechanism to transfer the output file from servers
to clients. Some APIs are provided under the GSI protocol to furnish. secure transfers.
This mechanism is used by the globusrun command,gatekeeper, and job manager.

Dynamically-Updated Request Online Coallocator (DUROC)

By using the DUROC mechanism, users are able to submit jobs to different job managers
at different hosts or to different job managers at the same host. The RSL script that
contains the DUROC syntax is parsed at the GRAM client and allocated to different job
managers. The grammar and attributes of RSL and DUROC are explained in Resource
Specification Language (RSL) .

2.3.3 GridFTP

GridFTP provides a secure and reliable data transfer among grid nodes. The word
GridFTP can referred to a protocol, a server, or a set of tools.

17
GridFTP protocol

GridFTP is a protocol intended to be used in all data transfers on the grid. It is based
on FTP, but extends the standard protocol with facilities such as multistreamed transfer,
auto-tuning, and Globus based security. This protocol is still in draft level, so for more
information, please refer to the following Web site.
As the GridFTP protocol is still not completely defined, Globus Toolkit does not
support the entire set of the protocol features currently presented. A set of GridFTP
tools is distributed by Globus as additional packages. Globus Project has selected some
features and extensions defined already in IETF RFCs and added a few additional features
to meet requirements from current data grid projects.

GridFTP server and client

Globus Toolkit provides the GridFTP server and GridFTP client, which are implemented
by the in.ftpd daemon and by the globus-url-copy command, respectively. They support
most of the features defined on the GridFTP protocol. The GridFTP server and client
support two types of file transfer: standard and third-party. The standard file transfer is
where a client sends the local file to the remote machine, which runs the FTP server.Third-
party file transfer is where there is a large file in remote storage and the client wants to
copy it to another remote server.

GridFTP tools

Globus Toolkit provides a set of tools to support GridFTP type of data transfers. The
gsi-ncftp package is one of the tools used to communicate with the GridFTP Server.The
GASS API package is also part of the GridFTP tools. It is used by the GRAM to transfer
the output file from servers to clients.

18
Chapter 3

Hardware and Software Specification

3.1 Hardware Specification

In order to build, install, and run the Globus Toolkit on your system, please follow
or consider these requirements or recommendations about your systems CPU, physical
memory, and disk space.
CPU : The Globus software itself is not CPU intensive, but the computing power
required to run the Globus Toolkit depends on what kind of host your system is used as.
If the host that will be running the Globus Toolkit will not be providing computational
services for Globus Toolkit jobs then a moderately equipped system will suffice. In this
configuration the purpose of the host is to provide a gateway to other resources that
Globus Toolkit jobs can use. For example, for an array of multiprocessor systems or a
cluster of workstations the gateway may act as a central entry point from which Globus
Toolkit jobs will be distributed to other resources. If the host is also to provide computing
services for Globus Toolkit jobs, then the computing power should be enough to service
the computational needs of the Globus Toolkit jobs targeted for the host.
Physical Memory : The Globus Toolkit itself is not memory intensive, therefore,
the host on which it will run need only have a nominal amount of memory for the sake of
the Globus Toolkit code.
Disk Space : Disk space requirements for building, installing, and deploying the
Globus Toolkit can vary depending on the number of architectures and the number of
development libraries that are built. Thus only approximate disk space requirements can
be given.
Source Code : The approximate size of the compressed Globus Toolkit tar file is 10
megabytes (Mbyte). The approximate size of an uncompressed Globus tar file is 60 Mbyte.
Building the Globus Toolkit : The approximate size of the build space depends on

19
how many communication package protocol or development libraries are being built. By
default, the Globus Toolkit build process keeps the build space to a minimum by cleaning
up the build space of unneeded files. If, however, the builddirs-persistoption is used
then all files in the build directory remain intact. This option is used for debugging the
build process and is usually not used for a normal build procedure. If the option is used,
additional disk space will be needed to accommodate the build files. Installing the Globus
ToolkitThe Globus Toolkit installation procedure allows for a common location for Globus
Toolkit code to be installed. This approach uses architecture-dependent subdirectories
for differentiating between binaries from each architecture, thus, the amount of disk space
required is dependent on the number of architectures for which the Globus Toolkit is built.
For example, the approximate size of an install directory supporting three architectures
and two types of communication packages per architecture is 150 Mbyt.
Deploying the Globus Toolkit : You should deploy the Globus Toolkit onto local
disk space. During the deployment process, architecture-specific binary files are copied
to a separate deployment location. Because some of these binaries are daemons typically
run at machine bootup time, it is generally preferable that they not run from a mounted
filesystem, that is, to ensure proper startup. The Globus Toolkit log files are also kept
in one of the deploy subdirectories. Under normal conditions the log files do not grow
very large, however, a rotation and/or archiving plan should still be considered. The
approximate size of the deploy directory is 20 Mbyte.

3.2 Software Specification

The details of the software needed are as given in the figures below.

20
Figure 3.1: Software Specification 1

Figure 3.2: Software Specification 2

21
Figure 3.3: Software Specification 3

22
Chapter 4

System Analysis and Design

This chapter provides architectural design considerations for grid computing, especially
for the Globus toolkit. Other design topics that will be discussed are different grid topolo-
gies, grid infrastructure design, and grid architecture models. At a glance, the following
topics are discussed:

• Grid architecture design concepts

• Different grid topologies

• Grid architecture models

• Building a grid architecture

• Grid architecture conceptual model

4.1 Building a grid architecture

The foundation of a grid solution design is typically built upon an existing infrastructure
investment. However, a grid solution does not come to fruition by simply installing
software to allocate resources on demand. Given that grid solutions are adaptable to
meet the needs of various business problems, differing types of grids are designed to meet
specific usage requirements and constraints. Additionally, differing topologies are designed
to meet varying geographical constraints and network connectivity requirements. The
success of a grid solution is heavily dependant on the amount of thought the IT architect
puts into the solution design.

Once the functional and non-functional requirements are known, the IT architect should
readily be able to select the type of grid and the best topology required to satisfy the

23
majority of the business requirements. When armed with this information, the high level
grid design will be easier to complete, and by leveraging the use of known grid types and
topologies, articulating the solution design will require much less effort.

It is important to focus on starting small and to begin building the basic framework
of the design. Rather than setting out to build the desired end state grid solution all at
once, consider building the grid solution in a phased approach. The milestone for the
initial phase is to provide an intragrid solution, which is essentially a grid sandbox that
supports a basic set of grid services. This solution would support a single location built
upon the core grid components, such as a security model, information services, workload
management, and the host devices. As long as this model supports the same protocols
and standards, this design can be expanded as needed.

An easy way to begin the design is to start with the grid security model. The grid
security model is typically built upon a Public Key Infrastructure (PKI) framework, and
is the foundation for grid user authentication. Knowing the grid type, grid topology,
and the desired security model is fundamental to the customization of the high level
grid solution design. Given that the primary characteristic of a grid solution is that the
network and hardware infrastructure is shared by multiple users and potentially multiple
locations, it makes logical sense to make early architectural decisions based upon on the
implications of the security requirements. The first step of the design process is to build a
graphical representation of the grid components. The subsequent phases of the design will
be focused on the next level of architecture. This phase of the design is a starting point
for architects, technical managers, and executives to understand the overall configuration
of the architecture.

Within any networked environment, there is going to be some risk and exposure in-
volved with the security of your infrastructure. Unless the computers are unplugged in
a locked room, there is the potential that someone may bypass the security and get ac-
cess to protected resources. Whether the weaknesses are exploited in the infrastructure,
application, configuration, or administration, there is some level of risk. The security
objectives are in place to help to reduce that risk to an acceptable level. While no design

24
is 100 percent secure, the level of risk is reduced and controlled through the use of security
controls. The goal of the security objectives are to examine the security requirements and
implement the necessary tools and processes to reduce the risk involved. The degree of
security involved is based on the type of grid topology and the data the security will be
protecting. The security requirements for a grid design within a bank will be completely
different from those of an academic institution doing research. Whatever the security re-
quirements may be, the security design objectives for the grid design need to be a central
focus for the conceptual architecture. Considering that the basic grid security model is
based on PKI, it is imperative that the security components are designed and thought
out carefully. While PKI has been around for a while, there are different components and
necessary processes that should be identified. Rushing this process could lead to many
problems in the future. With the PKI architecture being the focus of the initial design,
there are still areas that need attention. The infrastructure components (firewalls, IDS,
anti-virus, and encryption) and the processes to manage these pieces are all part of the
security objectives. Knowing which areas match up with your existing environment is the
first step to robust security. The following bullet points are an example of some security
questions that will be answered during the course of the design:

• Where will my CA be deployed and how will we manage it?

• Do I have the necessary processes in place to administer my own CA?

• What are the responsibilities for managing my own CA?

• How will I administer security on the local servers?

• Are my servers of a uniform build or common operating environment?

• Do I have a consistent software build across critical grid infrastructure systems?

• Which processes are running on my servers?

4.2 Availability

Availability in its simplest terms commonly refers to the percentage of time that a site
is up and servicing job requests. Determining how much availability should be built into

25
the design are part of the availability objectives. This leads down the path of discovering
how many potential single point of failure exist and how much redundancy should be built
into the design. It is inevitable that some components will fail during a lifetime of usage,
but this can be managed by using redundant components where possible. Whenever you
review various availability scenarios, there are always discussions about the amount of
availability that is required. In this respect, a grid design is no different from any other
infrastructure. A good start is to list the potential components within the design that
should be resilient to failure. Once these components have been identified, you can seek
out the specific availability options for those components. In the following examples,
some differentinfrastructure options are described. An important point that needs to be
discussed is the availability of dynamic resources within a grid environment. Grid is not
like a standard environment where resources are fixed and do not change regularly. Within
Globus environments, the resources are constantly changing according to the membership
and participation of the grid. When grid resources are active, they can register with
the information services (GIIS) within the grid to alert the system of their state. It
is important to make sure that when you design your grid that you keep this in mind.
Besides the grid middleware components, the different infrastructure components will also
require different levels of availability. Some components will be more critical than others
and it will be up to your design to make sure that you account for this. When going
through the different availability requirements, make sure that you account for both the
grid and infrastructure components. The following lists are some examples of availability
resources that should be accounted for:

• Grid middleware Workload management Grid directory and indexing service Se-
curity services Data storage Grid software clustering

• Networks Load-balancing High-availability routing protocols Redundant and di-


verse network paths

• Security -Redundant firewalls

• Datastore Mirroring Data replication Parallel processing

• Systems management Backup and recovery

26
• LDAP replicas Alerts and monitoring to signal a failure within the environment

Every so often, different components necessary to the workflow process fail periodically
and disrupt availability of the system. You can help mitigate the risk involved by elimi-
nating the single points of failure within your environment through the use of redundant
software or hardware components. To give you a better idea of some different availability
targets, the following list presents the expected system availability in a whole year:

• Normal commercial availability (single node): 9999.5 percent, 87.643.8hours of sys-


tem down

• High availability: 99.9 percent, 8.8 hours of system down

• Fault resilient: 99.99 percent, 53 minutes of system down

• Fault tolerant: 99.999 percent, 5 minutes of system down

• Continuous processing: 100 percent, 0 minutes of system down

Keep in mind, however, that the redundancy that is added to the grid infrastructure will
normally increase the costs within the infrastructure. It is up to the business to help justify
the costs that would bring an environment from 99.9 percent availability per year up to
99.99 percent per year. While the difference in time between those two numbers is about
eight hours, the costs associated may be too much to justify the increased availability.

4.3 Performance

The performance objective for a grid environment is to most efficiently utilize the
various resources within the grid. Whether that includes spare CPU cycles, access to a
federated databases, or application processing, it is up to you to match the performance
goals of the business and design accordingly. If your application can take advantage of
multiple resources, you can design your grid to be broken up into smaller instances and
have the work distributed throughout the grid. The goal is to take advantage of the grid
as a whole in order to increase the performance of the application. Through intelligent
workload management and scheduling, your application can take advantage of whatever
resources within the grid are available. Part of the performance is based on the form

27
of workload management to make sure that all resources within the grid are actively
servicing jobs or requests within the grid.

4.4 Grid architecture models

There are different type of grid architectures to fit different types of business problems.
Some grids are designed to take advantage of extra processing resources, whereas some
grid architectures are designed to support collaboration between various organizations.
The type of grid selected is based primarily on the business problem that is being solved.
Taking the goals of the business into consideration will help you choose the proper type
of grid framework. A business that wants to tap into unused resources for calculating
risk analysis within their corporate datacenter will have a much different design than a
company that wants to open their distributed network to create a federated database
with one or two of their main suppliers. Such different types of grid applications will
require proportionately different designs, based on their respective unique requirements.
The selection of a specific grid type will have a direct impact on the grid solution design.
Additionally, it should be mentioned that grid technologies are still evolving and tactical
modifications to a grid reference architecture may be required to satisfy a particular
business requirement.

Computational grid

A computational grid aggregates the processing power from a distributed collection of


systems. A well known example of a computational grid is the SETI@home grid. This
type of grid is primarily comprised of low powered computers with minimal application
logic awareness and minimal storage capacity.

Rather than simply painting images of flying toasters, the idle cycles of the personal
computers on the SETI@home grid are combined to create a computational grid used to
analyze radio transmissions received from outer space in the Search for Extra Terrestrial
Intelligence. Most businesses interested in computational grids will likely have similar IT
initiatives in common. While they probably will not want to search for extraterrestrials,
there will likely be a business initiative to expand abilities and maximize the computer

28
utilization of existing resources through aggregation and sharing. The business may re-
quire more computer capacity than is available. The business is interested in modifying
specific vertical applications for parallel computing opportunities. Additional uses for a
computational grid include mathematical equations, derivatives, pricing, portfolio valu-
ation, and simulation (especially risk measurement). Note that not all algorithms are
able to leverage parallel processing, data intensive and high throughput computing, order
and transaction processing, market information dissemination, and enterprise risk man-
agement. In many cases, the grid architecture model is not (yet) suitable for real-time
applications. Computational grids can be recognized by these primary characteristics:

• Made up of clusters of clusters

• Enables CPU scavenging to better utilize resources

• Provides the computational power to process large scale jobs

• Satisfies the business requirement for instant access to resources on demand

The primary benefits of computational grids are a reduced Total Cost of Ownership
(TCO), and shorter deployment life cycles. Besides the SETI@home grid, the Distributed
Terascale Facility (TeraGrid), UK, and Netherlands grids are all different types of com-
putational grids. The next generation of computational grid computing will shift focus
towards solving real-time computational problems.

Data grid

While computational grids are more suited for aggregating resources, data grids focus
on providing secure access to distributed, heterogeneous pools of data. Through collabo-
ration, data grids can also include a new concept such as a federated database. Within
a federated database a data grid makes a group of databases available that function as
a single virtual database. Through this single interface, the federated database provides
a single query point, data modeling, and data consistency. Data grids also harness data,
storage, and network resources located in distinct administrative domains, respect local
and global policies governing how data can be used, schedule resources efficiently, again
subject to local and globalconstraints, and provide high speed and reliable access to data.

29
Businesses interested in data grids typically have IT initiatives to expand data mining
abilities while maximizing the utilization of an existing storage infrastructure investment,
and to reduce the complexity of data management.

30
Chapter 5

Installation of Globus Toolkit

5.1 Setting up the first machine

The initial step is to setup the first machine in the gird. To do this first we install
Ubunthu 12.04 Precise in the machine. The hostname is given as elephant and the IP
address is given as 192.168.1.1/24. Then the following steps are carried out.

Figure 5.1: Step 1

This will install the GridFTP, GRAM, and MyProxy services, as well as set up a basic
SimpleCA so that you can issue security credentials for users to run the Globus services.
The next step is setting up security on your first machine. The Globus Toolkit uses
X.509 certificates and proxy certificates to authenticate and authorize grid users. we use
the Globus SimpleCAtools to manage our own Certificate Authority, so that we don’t need
to rely on any external entitty to authorize our grid users.In many deployment scenarios,
certificates for both services and users are obtained through one or more third party CAs.
In such scenarios, it is unnecessary to use SimpleCA or MyProxy to issue certificates.
Since this quickstart is intended to describe a simple, standalone deployment scenario, we
describe how to use these tools to issue your own certificates. When the globus-simple-ca
package is installed, it will automatically create a new Certificate Authority and deploy
its public certificate into the globus trusted certificate directory. It will also create a host
certificate and key, so that the Globus services will be able to run. We’ll also need to
copy the host certificate and key into place so that the myproxy service can use it as well.
Creating a MyProxy Server is the next step in installation. We are going to create
a MyProxy server on elephant.This will be used to store our user’s certificates. In order

31
Figure 5.2: Step 2

to enable myproxy to use the SimpleCA, modify the /etc/myproxy-server.config file, by


uncommenting every line in the section Complete Sample Policy 1 such that section looks
like this myproxy configuration:——————————-
We’ll next add the myproxy user to the simpleca group so that the myproxy server
can create certificates.

Figure 5.3: Step 3

Now we start the myproxy server and check whether it is running properly.

Figure 5.4: Step 4

As a final check, we’ll make sure the myproxy TCP port 7512 is in use via the netstat
command:
The next step is creating the user credentials. We’ll need to specify a full name and
a login name for the user we’ll create credentials for. We’ll be using the QuickStart User
as the user’s name and quseras user’s account name. You can use this as well if you first
create a quser unix account. Otherwise, you can use another local user account. Run the
myproxy-admin-adduser command as the myproxy user to create the credentials. You’ll
be prompted for a passphrase, which must be at least 6 characters long, to encrypt the
private key for the user. You must communicate this passphrase to the user who will be
accessing this credential. He can use themyproxy-change-passphrase command to change
the passphrase. The command to create the myproxy credential for the user is

32
Figure 5.5: Step 5

Figure 5.6: Step 5

User Authorization is the next step in the installation. we’ll create a grid map file
entry for this credential, so that the holder of that credential can use it to access globus
services. We’ll use the grid-mapfile-add-entry program for this. We need to use the exact
string from the output above as the parameter to the -dn command-line option, and the
local account name of user to authorize as the parameter to the -ln command-line option.
The next step is the starting the gridFTP and checking whether it works.
Now the GridFTP server is waiting for a request, so we’ll generate a proxy from the
myproxy service by using myproxy-logon and then copy a file from the GridFTP server
with the globus-url-copy command. We’ll use the passphrase used to create the myproxy
credential for quser.
At this point, we’ve configured the myproxy and GridFTP services and verified that
we can create a security credential and transfer a file.
Now that we have security and GridFTP set up, we can set up GRAM for resource
management. There are several different Local Resource Managers (LRMs) that one could
configure GRAM to use, but this guide will explain the simple case of setting up a ”fork”
jobmanager, without auditing. The GRAM service will use the same host credential as
the GridFTP service, and is configured by default to use the fork manager, so all we need
to do now is start the service. We start the GRAM gatekeeper:
We can now verify that the service is running and listening on the GRAM5 port.
The gatekeeper is set up to run, and is ready to authorize job submissions and pass

33
Figure 5.7: Step 6

Figure 5.8: Step 7

them on to the fork job manager. We can now run a couple of test jobs:

5.2 Setting up your second machine

The installation of the second machine is the next step in the installation. We start my
installing the packages needed.
Now we get security set up on the second machine. We’re going to trust the original
simpleCA to this new machine; there’s no need to create a new one. First, we’ll bootstrap
trust of the SimpleCA running on elephant:
This allows clients and services on donkey to trust certificates which are signed by
the CA on elephant machine. If we weren’t going to run any Globus services on donkey,
then we could stop here. Users on donkey could acquire credentials using the myproxy-
logon command and perform file transfers and execute jobs using the globus-url-copy
and globus-job-run commands. However, we’ll continue to configure the GridFTP and
GRAM5 services on donkey as well. We’re going to create the host certificate for donkey,
but we create it on elephant, so that we don’t have to copy the certificate request between
machines. The myproxy-admin-addservice command will prompt for a passphrase for this
credential. We will use this passphrase to retrieve the credential on donkey.
Next we’ll retrieve the credential on donkey as the root user.

34
Figure 5.9: Step 8

Figure 5.10: Step 9

At this point, we no longer need to have donkey’s host certificate on elephant’s myproxy
server, so we’ll delete it.
As a final setup, we’ll add quser’s credential to the grid-mapfile on donkey, so that the
quser account can access services there as well.
At this point, we have set up security on donkey to trust the CA on elephant. We
have created a host certificate for donkey so that we can run Globus services on donkey,
and we have enabled the quser account to use services on donkey. The last thing to do is
to turn on the Globus services on donkey. Now we test GridFTP and GRAM for second
machine.

35
Figure 5.11: Step 10

Figure 5.12: Step 11

Figure 5.13: Step 12

Figure 5.14: Step 13

Figure 5.15: Step 14

Figure 5.16: Step 15

36
Figure 5.17: Step 16

Figure 5.18: Step 17

Figure 5.19: Step 18

Figure 5.20: Step 19

37
Chapter 6

Results

The below given are the screenshots of the few operations carried out during the install-
tion.

Figure 6.1: Gridmap Entry

38
Figure 6.2: Adding user to myproxy

Figure 6.3: Error in simpleCA

39
Figure 6.4: SimpleCA installation 1

Figure 6.5: SimpleCA installation 2

40
Chapter 7

Discussion

GridFTP is the name of a variation of the File Transfer Protocol (FTP) which has
Grid Secure Infrastucture (GSI) enabled authentication using X.509 certificates.”globus-
url-copy” is the command used for the data transfer. The feature or rather short comings
of globus-url-copy are as given below

• Does support parallell file transfers

• Does support third party transfers

• Not really an end user command and very awkward to use

• Does not support wildcards

• Does not support recursive copies

• Does not create needed intermediate directories

On a WAN-connection the network latency can drastically diminish the bandwidth if


the application waits for acknowledgement of each small packet. The latency problem has
been solved for globus-url-copy and other tools by

• Increasing the size of the tcp-buffers and the size of the data that is acknowledged
during the transfer

• Allowing several simultaneous tcp/ip streams within one logical transfer

Sometimes 20 MB/s of speed has been reached between CERN and Helsinki, still with-
out really searching for the most optimal parameters. The variable GLOBUS TCP PORT
RANGE needs to be set by hand for parallell streams to work according to the local fire-
wall configuration. On a LAN connection the latency is smaller so there is not such a big

41
difference between different file transfer tools. It seems that pushing data on a LAN with
globus-url-copy is faster than pulling data. Even better results can be obtained by tuning
the operating systems tcp/ip stack and the switches and routers parameters in addition
to using parallell file transfers. The below given is the graph of data transfered to the
time taken for the transfer.

Figure 7.1: Performance Evaluation

42
Chapter 8

Concluding remarks

8.1 Conclusion

The grid technology has been there for quite a long time but its relevance was subdued
for a better part of its evolution. Earlier caged to scientific community alone now the
technology is breaking out of its barriers. This project was aimed at setting up a grid using
Globus toolkit environment and evaluate the working of the grid with giving emphasis to
GridFTP. With advent of ubiquitous technologies which will need more everything from
computation to data storage grid technologies are going to be the need of the day. This
project is aimed at opening a window to the area of research in grid technologies.

8.2 Future Works

Globus Toolkit is the de-facto standard in grid computing. Even with that stature
Globus toolkit suffers from the inherent problems associated with underlying protocol.
Due to this the communication lag between two VO that use Globus toolkit is slow even
when the distance between them is small, in grid terms. It is optimal to use protocols
that enhances the communication in grid than to depend on protocols being used because
they have been used for a long time. As future enhancement I would like to integrate
RBUDP protocol into Globus toolkit and evaluate the performance of GrodFTP afer the
integration.

43
References

[1] I. Foster, C. Kesselman, S. Tuecke., ”The Anatomy of the Grid: Enabling Scalable
Virtual Organizations”, International J. Supercomputer Applications, 15(3), 2001.

[2] I. Foster, C. Kesselman, J. Nick, S. Tuecke, ”The Physiology of the Grid: An Open
Grid Services Architecture for Distributed Systems Integration”, Advance Computing
Conference (IACC), 2013 IEEE 3rd International.

[3] Eric He, Jason Leigh, Oliver Yu, Thomas A. DeFanti., ”Reliable Blast UDP : Pre-
dictable High Performance Bulk Data Transfer”,IEEE International Conference on
Cluster Computing, 2002

[4] William Allcock, John Bresnahan, Rajkumar Kettimuthu and Joseph Link, ”The
Globus eXtensible Input/Output System (XIO):A protocol independent IO system for
the Grid” IEEE International Parallel and Distributed Processing Symposium, 2005

[5] Ann L. Chervenak, Robert Schuler, Matei Ripeanu, Muhammad Ali Amer, Shishir
Bharathi, Ian Foster, Adriana Iamnitchi, and Carl Kesselman, ”The Globus Replica
Location Service: Design and Experience”, Pattern Recognition and Image Analysis
(PRIA), IEEE Transactions On Parallel And Distributed Systems, VOL. 20, NO. 9,
Sept 2009

[6] Ian Foster, ”Globus Online Accelerating and Democratizing Science through Cloud-
Based Services”, IEEE Internet Computing.

[7] Aaron Brown, Ezra Kissel and Martin Swany, ”Improving GridFTP Performance
Using The Phoebus Session Layer”, ACM, 2009

44

You might also like