0% found this document useful (0 votes)
27 views

Scalability and Heterogeneity: Colin Perkins

Uploaded by

BARUTI JUMA
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Scalability and Heterogeneity: Colin Perkins

Uploaded by

BARUTI JUMA
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Scalability and Heterogeneity

Colin Perkins
http://csperkins.org/teaching/2004-2005/gc5/
Lecture Outline

• Review of Traditional Distributed Systems


• How is Grid Computing Different?
– Aspects of Heterogeneity
– Aspects of Scalability
– Implications for System Design
• Preparation for Tutorial 1

The aims of today:


• To understand why grid computing is difficult, raise a number of
issues to consider throughout the module
Copyright © 2004 University of Glasgow

• To give more examples of Grid computing systems


Review of Traditional Distributed Systems

“A distributed system is a collection of independent computers that


appears to its users as a single coherent system.”
[Tanenbaum & van Steen, 2002]

• The machines are autonomous, but the users think they’re dealing
with a single system
• Typically distributed systems are used to share resources within
an organization:
– Homogeneity eases management, fault tolerance, scheduling, authentication
– E.g. a departmental fileserver, database of exam marks
Copyright © 2004 University of Glasgow

What if we break the assumption of homogeneity?


…we move from distributed systems to grid computing!
What is Grid Computing?

Infrastructure for Internet-scale Distributed Systems

• A software system, implemented in terms of a middleware layer,


that provides dependable, consistent, pervasive, and inexpensive
access to high-end computational capabilities
• A system that allows sharing of remote resources as-if they were
local, across geographical and organizational boundaries
• A large and widely distributed collection of independent systems
that appears to its users as a single coherent system
Copyright © 2004 University of Glasgow

There are many definitions, depending who you ask…


How is Grid Computing Different?

• A computational grid integrates disparate resources into a single


virtual organization
– Varying applications using the services of the Grid
– Large and varied amounts of data to be processed
– Varying classes of user, with different rights and responsibilities
– Running on a range of networks, using varying hardware and software
– Across different administrative and legal domains
• Implies a more heterogeneous environment and greater scaling
than traditional distributed systems

• How does this affect system design?


Copyright © 2004 University of Glasgow
Aspects of Heterogeneity

Heterogeneity comes from several factors:


• Users, applications and data
• Software and the hardware on which it runs
• Interconnection networks
• Organizations
Copyright © 2004 University of Glasgow
Heterogeneous Users, Applications & Data

• Large scale grid computing started to service the needs of the


e-science community

• The EGEE project (“Enabling Grids for e-Science in Europe”)


– A typical e-science grid development project
– European Union funding: €30 million
– Building a computational grid for physics, health and bio-sciences, earth
sciences, astronomy, etc.
– Share resources of 70 sites in 27 countries; aiming for thousands of active
users, wide range of applications, lots of data
• The Grid2003 Production Grid for physics and astronomy [1]
Copyright © 2004 University of Glasgow

Consider diversity of user locations and environment,


job processing, data, trust and security models…
Heterogeneous Users, Applications & Data

• Many of the concepts of grid computing finding their way into


commercial applications

• Google, Amazon or iTunes


– Large database/commerce sites; significant financial value
– Accessed directly via web-site, or embedded in an application
– Worldwide user community; millions of users and transactions
• Business process automation
– Automatic inventory processing, ordering management
– E.g. airline reservation systems, stock trading and financial modelling
Copyright © 2004 University of Glasgow

Consider diversity of user locations and environment,


job processing, data, trust and security models…
Heterogeneous Hardware and Software

• EGEE aiming to allow users at 70 different sites to share data, run


distributed computational jobs
• Google is reputed to manage a distributed system of ~100,000
hosts around the world; caching the entire web in memory
• In large-scale grids like these, you cannot standardize on a
particular hardware or software environment
– By the time you’ve synchronised the system software, some hardware will
have failed, requirements will have changed
– Client software will vary widely
– Likely multiple versions of server and client software in use
Copyright © 2004 University of Glasgow

Design for compatibility, interoperability


and cross-platform operation
Heterogeneous Networks

• Grids are widely distributed systems connected by the Internet


• What are the characteristics of the Internet?
– Big and complex
– Best effort service; no performance guarantees
– Fragmented ownership

• Implication: the variation in the network will affect how we build


a computational grid
– Paper [2] discusses how network heterogeneity affects design and
modelling of new protocols
Copyright © 2004 University of Glasgow
Heterogeneous Networks

• Big, and getting bigger:


– Size of the network has more than doubled every year
since early 1980s
– Approximately 100 million hosts at end of year 2000
– What happens if 0.1% of hosts behave atypically?

• Traffic patterns shift rapidly:


– World-wide web: doubled every ~7 weeks for 2 years
– Mbone: some sites reported >50% traffic in 1995 was
multicast; now virtually none
– Peer-to-peer: many reports of network congestion due
Copyright © 2004 University of Glasgow

to Napster, Kazaa, BitTorrent, etc.


– Worms and malicious traffic: Nimda; from release to
100 probes-per-second in 30 minutes
Heterogeneous Networks

• At least 6 orders of magnitude variation in link speed:


– 9.6 kbps GSM wireless → 10 Gbps optical fibre
– Link capacity growing faster than Moore’s law
• At least 4 orders of magnitude variation in latency:
– Sub-millisecond LAN connections; hundreds of milliseconds worldwide
– Varies with queuing delay, network congestion, path changes
– A fundamental limiting factor for synchronous protocols (e.g. web services)
• Wide variation in packet loss rates:
45
40
35
Loss Rate (percent)

30
25
Copyright © 2004 University of Glasgow

20
15
10
5
0
8:00 10:00 12:00 14:00 16:00 18:00
Time (BST)
Implications of Heterogeneous Networks

• Systems and protocols must be adaptive and scalable


• Decentralisation is essential, to handle load
• Global synchronisation is difficult, tending to impossible, due to
latency

Your system works in the lab today…


Will it still work in a few months,
Implications for systems
when you have thousands of users?

Widely distributed or peer-to-peer


Copyright © 2004 University of Glasgow

Asynchronous and weakly consistent


Location transparent
Loss tolerant, rate/latency adaptive
Organizational Heterogeneity

• Goal is to share resources across organizational boundaries, to


form new virtual organizations
• How to authenticate users and resources?
– Who do you trust to do the authentication?
– Do you trust users to delegate authority?
• To other users? Significant implications
• To software agents?
on security infrastructure
– Do you trust the servers? The data?
• How to provide, control and limit access?
– Full user accounts or a limited subset of functionality
– Firewalls
– Malicious users/applications
Copyright © 2004 University of Glasgow

• Who sets the acceptable use policy?


– Is it consistent worldwide? Can/should it be?
How is Grid Computing Different?

• A computational grid integrates disparate resources into a single


virtual organization
– Varying applications using the services of the Grid
– Large and varied amounts of data to be processed
– Varying classes of user, with different rights and responsibilities
– Running on a range of networks, using varying hardware and software
– Across different administrative and legal domains
• Implies a more heterogeneous environment and greater scaling
than traditional distributed systems

• How does this affect system design?


Copyright © 2004 University of Glasgow
Aspects of Scalability

When building a grid, need to consider how it will scale in terms of:
• Data Storage and Distribution
• Software
• Scheduling
• Robustness and Fault Tolerance
• System Management
Copyright © 2004 University of Glasgow
Scalability of Data Storage & Distribution

Storage is cheap:
• Apple Xserve RAID: 3.5Tbytes = £8,799
5.25×17×18.4 inches
• Consider the storage available on a large distributed system…

Grid computing applications produce a lot of data:


• The ATLAS experiment at CERN will produce 1.3 petabytes/year
of raw data (a stack of CDROMs 10 miles high…)
– Simulation and analysis software routinely produces data files around 2
gigabytes in size
• Measurements on Grid2003 show 2 terabytes/day transferred to
Copyright © 2004 University of Glasgow

support experiments on a 27 site grid


– Continuous 200 Mbits/second transfer rate
Scalability of Data Storage & Distribution

• How to manage, distribute this much data?


– Do you move the data to the job? Or the job to the data?
– Are you allowed to move the data?
• Copyright, confidentiality, privacy, legal reasons
• E.g. Grid computing for oil exploration: governments won’t let geological data
out of the country – remote access to terabytes of data in Africa from the US?
– How to transfer large datasets?
• Manually?
• Automatically and transparently? How?
• How to index and search this much data?
• Need interoperable and standardised data formats
– Long term archival and curation; efficient short term access
Copyright © 2004 University of Glasgow

• How to do data fusion across heterogeneous databases/sources?


– Transparent database queries across multiple systems, worldwide
• How to maintain data provenance?
Scalable Scheduling

• Job scheduling on single and parallel computers well understood


• Evolving towards job scheduling for clusters:
– VAX/VMS clustering in the mid-1980s
– Condor, OpenPBS, Sun Grid Engine, etc. more recently

• How to move to Internet-scale job scheduling?


– Policy compliance
– Load balancing
– Co-scheduling
– System monitoring and failure handling
– Distributing jobs and data
Copyright © 2004 University of Glasgow

– Location transparency
An open research problem…
See also paper [3]
Naming, Addressing and Middleware

• How to write an application that runs over thousands of hosts,


when you don’t know which hosts it’s using?

• Need a naming scheme and communication protocol that works


independently of location
– Can’t use DNS or IP addresses directly; tied to organizational structure,
network topology
• Peer-to-peer protocols solve some of these problems:
– Distributed hash tables/content addressable networks and event notification
systems built on them
– e.g. Pastry and Scribe
• Lots of research; no standards yet
Copyright © 2004 University of Glasgow

Paper [3] addresses some of these issues in more depth


Robustness and Fault Tolerance

• Systems fail; an internet-scale distributed system might never be


completely operational
– If a system is large enough, statistically likely something will have failed
• Grid2003 reports job failure and restart rates of 30% in some cases… [1]
– How big can a system be before failures become overwhelmingly likely?

• How to detect and recover from failures?


– Routing around failures?
– Recovering from failure while a job is running?
– Avoiding cascading failures?
• Distributed systems and parallel computing has given us many
useful algorithms
Copyright © 2004 University of Glasgow

– Complicated by the scale of computational grids, and cross organizational


management issues
See paper [4]
System Configuration Management

• Independent of job scheduling and resource management, need to


manage the configuration of the grid
– Operating system updates + patches
– New versions of application software
– Detecting and fixing hardware and software failures

• How to manage thousands of hosts?


• How to manage a system that’s never completely functional?

• Build applications that monitor the system, and reconfigure it as


needed – leads to the idea of autonomic computing [5]
Copyright © 2004 University of Glasgow
Summary

• The two biggest challenges to designing a computational grid are


heterogeneity and scalability
• These distinguish grids from traditional distributed systems

• Have asked lots of questions… the reading list will raise more
issues
• The rest of the module will try to answer some of these questions;
others are open research topics…
Copyright © 2004 University of Glasgow

• Next week: discussion of current standard architectures and


protocols for grid computing
References

[1] I. Foster et al., “The Grid2003 Production Grid: Principles and Practice”,
Proceedings of the 13th IEEE Intl. Symp. on High Performance Distributed
Computing, 2004.
[2] S. Floyd and V. Paxson, “Difficulties in Simulating the Internet”, IEEE/ACM
Transactions on Networking, Vol. 9, No. 4, August 2001.
[3] J. A. Crowcroft, S. M. Hand, T. L. Harris, A. J. Herbert, M. A. Parker and I.
A. Pratt, “FutureGRID: A Program for long-term research into GRID systems
architecture”, Proceedings of the UK e-Science All Hands Meeting, Sept 2003.
[4] M. Amin, “Toward Self-Healing Infrastructure Systems”, IEEE Computer,
August 2000.
[5] J. O. Kephart and D. M. Chess, “The Vision of Autonomic Computing”, IEEE
Computer, January 2003.
Copyright © 2004 University of Glasgow
Preparation for Tutorial 1

We will be discussing two papers on Monday:


• “Computational Grids”
• “The Anatomy of the Grid”

Between now and Monday:


• You should all read both papers
• Prepare a summary of each paper, explain “what is a grid?”
– Work in groups to do this, discuss the papers in advance
– Use the material from Research Techniques to help you prepare
On Monday, two people will be chosen at random for each paper:
Copyright © 2004 University of Glasgow

– They will stand in front of the class and present the paper (5 minutes)
– Then, the rest of the class will then discuss the paper, to see if they agree
with that view of grid computing

You might also like