Haddad PHD Thesis
Haddad PHD Thesis
Ibrahim Haddad
March 2006
By: _______________________________________________________________________
Entitled:____________________________________________________________________
_____________________________________________________________________
_____________________________________________________________________
complies with the regulations of the University and meets the accepted standards with respect to
originality and quality.
__________________________________________ Chair
__________________________________________ Examiner
__________________________________________ Examiner
Approved by
_________________________________________
Chair of Department or Graduate Program Director
____________2006 _______________________
Dr. Nabil Esmail, Dean
Faculty of Engineering and Computer Science
ii
Abstract
The HAS Architecture: A Highly Available and Scalable Cluster Architecture for Web Server
Ibrahim Haddad, Ph.D.
Concordia University, 2006
This dissertation proposes a novel architecture, called the HAS architecture, for scalable and highly
available web server clusters. The prototype of the Highly Available and Scalable Web Server
Architecture was validated for scalability and high availability. It provides non-stop service and is
able to maintain the base line performance of approximately 1000 requests per second per processor,
for up to 16 traffic processors in the cluster, achieving close to linear scalability. The architecture
supports dynamic traffic distribution using a lightweight distribution scheme, and supports connection
synchronization to ensure that web connections survive software or hardware failures. Furthermore,
the architecture supports different redundancy models and high availability capabilities such as
Ethernet and NFS redundancy that contribute in increasing the availability of the service, and
eliminating single points of failures.
This dissertation presents current methods for scaling web servers, discusses their limitations, and
investigates how clustering technologies can help overcome some of these challenges and enables the
design of scalable web servers based on a cluster of workstations. It examines various ongoing
research projects in the academia and the industry that are investigating scalable and highly available
architectures for web servers. It discusses their scope, architecture, provides a critical analysis of their
work, presents their advantages and drawbacks, and their contributions to this dissertation.
The proposed Highly Available and Scalable Web Server Architecture builds on current knowledge,
and provides contributions in areas such as scalability, availability, performance, traffic distribution,
and cluster representation.
iii
Acknowledgments
The work that has gone into this thesis has been thoroughly enjoyable largely because of the
interaction that I have had with my supervisors and colleagues. I would like to express my gratitude
to my supervisor Professor Greg Butler, whose expertise, understanding, and patience, added
considerably to my graduate experience. I appreciate his vast knowledge and skills in many areas, and
his encouragement that provided me with much support, guidance, and constructive criticism.
I would like to thank the other members of my committee, Professor J. William Atwood, Dr. Ferhat
Khendek, and Professor Thiruvengadam Radhakrishnan for the assistance they provided at all levels
of the project. The feedback I received from members of my committee as early as during my
doctoral proposal was very important and had influence on the direction of the work.
I also would like to acknowledge the support I received from Ericsson Research granting me
unlimited access to their remarkable research lab in Montréal, Canada.
I would also like to thank and express my gratitude to my wife, parents, brother, and sister for their
love, encouragement, and support.
Ibrahim Haddad
March 2006
iv
Table of Contents
Abstract .................................................................................................................................................iii
Acknowledgments .................................................................................................................................iv
Table of Contents ................................................................................................................................... v
List of Figures .....................................................................................................................................viii
List of Tables.........................................................................................................................................xi
Chapter 1 Introduction and Motivation .................................................................................................. 1
1.1 Internet and Web Servers ............................................................................................................. 1
1.2 The Need for Scalability............................................................................................................... 2
1.3 Web Servers Overview................................................................................................................. 3
1.4 Properties of Internet and Web Applications ............................................................................... 8
1.5 Study Objectives......................................................................................................................... 10
1.6 Scope of the Study...................................................................................................................... 11
1.7 Thesis Contributions................................................................................................................... 13
1.8 Dissertation Roadmap ................................................................................................................ 14
Chapter 2 Background and Related Work............................................................................................ 16
2.1 Cluster Computing ..................................................................................................................... 16
2.2 SMP versus Clusters................................................................................................................... 22
2.3 Cluster Software Components.................................................................................................... 23
2.4 Cluster Hardware Components................................................................................................... 23
2.5 Benefits of Clustering Technologies .......................................................................................... 23
2.6 The OSI Layer Clustering Techniques ....................................................................................... 26
2.7 Clustering Web Servers.............................................................................................................. 32
2.8 Scalability in Internet and Web Servers ..................................................................................... 36
2.9 Overview of Related Work......................................................................................................... 43
2.10 Related Work: In-depth Examination....................................................................................... 46
Chapter 3 Preparatory Work................................................................................................................. 65
3.1 Early Work ................................................................................................................................. 65
3.2 Description of the Prototyped Web Cluster................................................................................ 65
3.3 Benchmarking Environment....................................................................................................... 67
3.4 Web Server Performance............................................................................................................ 69
3.5 LVS Traffic Distribution Methods ............................................................................................. 70
v
3.6 Benchmarking Scenarios............................................................................................................ 74
3.7 Apache Performance Test Results ............................................................................................. 74
3.8 Tomcat Performance Test Results ............................................................................................. 79
3.9 Scalability Results...................................................................................................................... 81
3.10 Discussion ................................................................................................................................ 83
3.11 Contributions of the Preparatory Work.................................................................................... 84
Chapter 4 The Architecture of the Highly Available and Scalable Web Server Cluster ..................... 85
4.1 Architectural Requirements ....................................................................................................... 85
4.2 Overview of the Challenges....................................................................................................... 87
4.3 The HAS Architecture ............................................................................................................... 88
4.4 HAS Architecture Components ................................................................................................. 91
4.5 HAS Architecture Tiers ............................................................................................................. 94
4.6 Characteristics of the HAS Cluster Architecture ....................................................................... 96
4.7 Availability and Single Points of Failures ................................................................................. 99
4.8 Overview of Redundancy Models............................................................................................ 102
4.9 HA Tier Redundancy Models .................................................................................................. 103
4.10 SSA Tier Redundancy Models............................................................................................... 107
4.11 Storage Tier Redundancy Models.......................................................................................... 109
4.12 Redundancy Model Choices .................................................................................................. 109
4.13 The States of a HAS Cluster Node......................................................................................... 111
4.14 Example Deployment of a HAS Cluster ................................................................................ 113
4.15 The Physical View of the HAS Architecture ......................................................................... 116
4.16 The Physical Storage Model of the HAS Architecture .......................................................... 118
4.17 Types and Characteristics of the HAS Cluster Nodes ........................................................... 123
4.18 Local Network Access ........................................................................................................... 126
4.19 Master Nodes Heartbeat......................................................................................................... 127
4.20 Traffic Nodes Heartbeat using the LDirectord Module ......................................................... 128
4.21 CVIP: A Cluster Virtual IP Interface for the HAS Architecture............................................ 130
4.22 Connection Synchronization .................................................................................................. 136
4.23 Traffic Management............................................................................................................... 140
4.24 Access to External Networks and the Internet ....................................................................... 149
4.25 Ethernet Redundancy ............................................................................................................. 150
vi
4.26 Dependencies and Interactions between Software Components ............................................ 151
4.27 Scenario View of the Architecture ......................................................................................... 155
4.28 Network Configuration with IPv6 .......................................................................................... 172
Chapter 5 Architecture Validation...................................................................................................... 176
5.1 Introduction .............................................................................................................................. 176
5.2 Validation of Performance and Scalability............................................................................... 176
5.3 The Benchmarked HAS Architecture Configurations.............................................................. 178
5.4 Test-0: Experiments with One Standalone Traffic Node ......................................................... 180
5.5 Test-1: Experiments with a 4-nodes HAS Cluster.................................................................... 183
5.6 Test-2: Experiments with a 6-nodes HAS Cluster.................................................................... 186
5.7 Test-3: Experiments with a 10-nodes HAS Cluster.................................................................. 188
5.8 Test-4: Experiments with an 18-nodes HAS Cluster................................................................ 191
5.9 Scalability Charts...................................................................................................................... 192
5.10 Validation of High Availability.............................................................................................. 194
5.11 HA-OSCAR Architecture: Modeling and Availability Prediction......................................... 199
5.12 Impact of the HAS Architecture on Open Source .................................................................. 204
5.13 HA-OSCAR versus Beowulf Architecture............................................................................. 205
5.14 The HA-OSCAR Architecture versus the HAS Architecture................................................. 207
5.15 HAS Architecture Impact on Industry.................................................................................... 210
Chapter 6 Contributions, Future Work, and Conclusion .................................................................... 212
6.1 Contributions ............................................................................................................................ 212
6.2 Future Work ............................................................................................................................. 220
6.3 Conclusion................................................................................................................................ 226
Bibliography....................................................................................................................................... 228
Glossary.............................................................................................................................................. 241
vii
List of Figures
Figure 1: Web server components ......................................................................................................... 4
Figure 2: Request handling inside a web server.................................................................................... 5
Figure 3: Analysis of a web request....................................................................................................... 5
Figure 4: The SMP architecture ........................................................................................................... 17
Figure 5: The MPP architecture ........................................................................................................... 18
Figure 6: Generic cluster architecture .................................................................................................. 19
Figure 7: Cluster architectures with and without shared disks ............................................................ 19
Figure 8: A cluster node stack.............................................................................................................. 20
Figure 9: The L4/2 clustering model.................................................................................................... 26
Figure 10: Traffic flow in an L4/2 based cluster ................................................................................. 27
Figure 11: The L4/3 clustering model.................................................................................................. 28
Figure 12: The traffic flow in an L4/3 based cluster........................................................................... 29
Figure 13: The process of content-based dispatching – L7 clustering model...................................... 30
Figure 14: A web server cluster ........................................................................................................... 32
Figure 15: Using a router to hide the web cluster ................................................................................ 33
Figure 16: Hierarchical redirection-based web server architecture ..................................................... 47
Figure 17: Redirection mechanism for HTTP requests........................................................................ 48
Figure 18: The web farm architecture with the dispatcher as the central component.......................... 51
Figure 19: The SWEB architecture...................................................................................................... 53
Figure 20: The functional modules of a SWEB scheduler in a single processor ................................. 54
Figure 21: The LSMAC implementation ............................................................................................. 56
Figure 22: The LSNAT implementation .............................................................................................. 56
Figure 23: The architecture of the IP sprayer....................................................................................... 58
Figure 24: The architecture with the HACC smart router.................................................................... 58
Figure 25: The two-tier server architecture.......................................................................................... 61
Figure 26: The flow of the web server router ...................................................................................... 62
Figure 27: The architecture of the prototyped web cluster .................................................................. 66
Figure 28: The architecture of the WebBench benchmarking tool ...................................................... 68
Figure 29: The architecture of the LVS NAT method ......................................................................... 71
Figure 30: The architecture of the LVS DR method............................................................................ 72
Figure 31: Benchmarking results of NAT versus DR.......................................................................... 73
Figure 32: Benchmarking results of the Apache web server running on a single processor ............... 75
Figure 33: Apache reaching a peak of 5,903 KB/s before the Ethernet driver crashes ....................... 75
Figure 34: Benchmarking results of Apache on one processor – post Ethernet driver update ............ 76
Figure 35: Results of a two-processor cluster (requests per second) ................................................... 77
Figure 36: Results of a four-processor cluster (requests per second) .................................................. 77
Figure 37: Results of eight-processor cluster (requests per second).................................................... 78
Figure 38: Results of Tomcat running on two processors (requests per second)................................. 79
Figure 39: Results of a four-processor cluster running Tomcat (requests per second)........................ 80
Figure 40: Results of an eight-processor cluster running Tomcat (requests per second).................... 80
Figure 41: Scalability chart for clusters consisting of up to 12 nodes running Apache....................... 82
Figure 42: Scalability chart for clusters consisting of up to 12 nodes running Tomcat....................... 82
Figure 43: The HAS architecture ......................................................................................................... 90
Figure 44: Built-in redundancy at different layers of the HAS architecture ...................................... 101
Figure 45: The process of the network adapter swap......................................................................... 102
Figure 47: The 1+1 active/standby redundancy model ...................................................................... 104
viii
Figure 48: Illustration of the failure of the active node ...................................................................... 104
Figure 49: The 1+1 active/active redundancy model ......................................................................... 106
Figure 50: The N+M and N-way redundancy models........................................................................ 107
Figure 51: The N+M redundancy model with support for state replication ....................................... 108
Figure 52: The N+M redundancy model, after the failure of an active node ..................................... 108
Figure 53: the redundancy models at the physical level of the HAS architecture.............................. 109
Figure 54: The state diagram of the state of a HAS cluster node ....................................................... 112
Figure 55: The state diagram including the standy state .................................................................... 113
Figure 56: A HAS cluster using the HA NFS implementation .......................................................... 114
Figure 57: The HA-OSCAR prototype with dual active/standby head nodes.................................... 114
Figure 58: The physical view of the HAS architecture ...................................................................... 117
Figure 59: The no-shared storage model ........................................................................................... 119
Figure 60: The HAS storage model using a distributed file system ................................................... 120
Figure 61: The NFS server redundancy mechanism .......................................................................... 120
Figure 62: DRDB disk replication for two nodes in the 1+1 active/standby redundancy model ....... 122
Figure 63: A HAS cluster with two specialized storage nodes .......................................................... 123
Figure 64: The master node stack....................................................................................................... 124
Figure 65: The traffic node stack........................................................................................................ 125
Figure 66: The redundant LAN connections within the HAS architecture ........................................ 126
Figure 67: The topology of the heartbeat Ethernet broadcast............................................................. 128
Figure 68: The CVIP generic configuration ....................................................................................... 131
Figure 69: Level of distribution.......................................................................................................... 132
Figure 70: Network termination concept............................................................................................ 133
Figure 71: The CVIP framework........................................................................................................ 134
Figure 72: Step 1 - Connection Synchronization................................................................................ 138
Figure 73: Step 2 - Connection Synchronization................................................................................ 138
Figure 74: Step 3 - Connection Synchronization................................................................................ 139
Figure 75: Step 4 - Connection Synchronization................................................................................ 139
Figure 76: Peer-to-peer approach ....................................................................................................... 140
Figure 77: The CPU information available in /proc/cpuinfo.............................................................. 143
Figure 78: The memory information available in /proc/meminfo ...................................................... 144
Figure 79: Example list of traffic nodes and their load index ............................................................ 146
Figure 80: Illustration of the interaction between the traffic client and the traffic manager .............. 147
Figure 81: The direct routing approach – traffic nodes reply directly to web clients......................... 149
Figure 82: The restricted access approach – traffic nodes reply to master nodes, who in turn reply to
the web clients ........................................................................................................................... 150
Figure 83: The dependencies and interconnections of the HAS architecture system software .......... 152
Figure 84: The sequence diagram of a successful request with one active master node.................... 157
Figure 85: The sequence diagram of a successful request with two active master nodes .................. 158
Figure 86: A traffic node reporting its load index to the traffic manager........................................... 159
Figure 87: A traffic node joining the HAS cluster ............................................................................. 160
Figure 88: The boot process of a diskless node.................................................................................. 161
Figure 89: The boot process of a traffic node with disk – no software upgrades are performed ....... 162
Figure 90: The process of rebuilding a node with disk ...................................................................... 163
Figure 91: The process of upgrading the kernel and application server on a traffic node.................. 164
Figure 92: The sequence diagram of upgrading the hardware on a master node ............................... 165
Figure 93: The sequence diagram of a master node becoming unavailable ....................................... 166
Figure 94: The NFS synchronization occurs when a master node becomes unavailable ................... 166
ix
Figure 95: The sequence diagram of a traffic node becoming unavailable ....................................... 167
Figure 96: The scenario assumes that node C has lost network connectivity .................................... 168
Figure 97: The scenario of an Ethernet port becoming unavailable .................................................. 169
Figure 98: The sequence diagram of a traffic node leaving the HAS cluster .................................... 169
Figure 99: The LDirectord restarting an application process............................................................. 171
Figure 100: The network becomes unavailable ................................................................................. 172
Figure 101: The sequence diagram of the IPv6 autoconfiguration process ....................................... 173
Figure 102: A functional HAS cluster supporting IPv4 and IPv6...................................................... 175
Figure 103: A screen capture of the WebBench software showing 379 connected clients................ 177
Figure 104: The network setup inside the benchmarking lab ............................................................ 178
Figure 105: The benchmarked HAS cluster configurations showing Test-[1..4] .............................. 179
Figure 106: The results of benchmarking a standalone processor -- transactions per second ........... 181
Figure 107: The throughput benchmarking results of a standalone processor................................... 182
Figure 108: The number of failed requests per second on a standalone processor ........................... 182
Figure 109: The number of successful requests per second on a HAS cluster with four nodes ........ 184
Figure 110: The throughput results (KB/s) on a HAS cluster with four nodes.................................. 185
Figure 111: The number of failed requests per second on a HAS cluster with four nodes................ 185
Figure 112: The number of successful requests per second on a HAS cluster with six nodes .......... 187
Figure 113: The throughput results (KB/s) on a HAS cluster with six four nodes ............................ 187
Figure 114: The number of failed requests per second on a HAS cluster with six nodes.................. 188
Figure 115: The number of successful requests per second on a HAS cluster with 10 nodes ........... 190
Figure 116: The throughput results (KB/s) on a HAS cluster with 10 nodes .................................... 190
Figure 117: The number of successful requests per second on a HAS cluster with 18 nodes ........... 191
Figure 118: The throughput results (KB/s) on a HAS cluster with 18 nodes .................................... 192
Figure 119: The results of benchmarking the HAS architecture prototype ....................................... 193
Figure 120: The scalability chart of the HAS architecture prototype ................................................ 194
Figure 121: The possible connectivity failure points......................................................................... 195
Figure 122: The tested setup for data redundancy ............................................................................ 198
Figure 123: The modeled HA-OSCAR architecture, showing the three sub-models ........................ 200
Figure 124: A screen shot of the SPNP modeling tool ...................................................................... 201
Figure 125: System instantaneous availabilities ................................................................................ 203
Figure 126: Availability improvement analysis of HA-OSCAR versus the Beowulf architecture.... 204
Figure 127: The architecture of a Beowulf cluster............................................................................. 205
Figure 128: The architecture of HA-OSCAR .................................................................................... 207
Figure 129: The CGL cluster architecture based on the HAS architecture........................................ 210
Figure 130: The contributions of the HAS architecture..................................................................... 212
Figure 131: The untested configurations of the HAS architecture..................................................... 221
Figure 133: The architecture logical view with specialized nodes .................................................... 224
x
List of Tables
Table 1: Classification of clusters by usage and functionality ............................................................. 21
Table 2: Characteristics of SMP and cluster systems........................................................................... 22
Table 3: Expected service availability per industry type...................................................................... 24
Table 4: Advantages and drawbacks of clustering techniques operating at the OSI layer................... 31
Table 5: Web performance metrics ...................................................................................................... 69
Table 6: The results of benchmarking with Apache............................................................................. 78
Table 7: The results of benchmarking with Tomcat............................................................................. 81
Table 8: The possible redundancy models per each tier of the HAS architecture.............................. 110
Table 9: The supported redundancy models per each tier in the HAS architecture prototype ........... 111
Table 10: The performance results of one standalone processor running the Apache web server..... 180
Table 11: The results of benchmarking a four-nodes HAS cluster .................................................... 183
Table 12: The results of benchmarking a HAS cluster with six nodes............................................... 186
Table 13: The results of benchmarking a HAS cluster with 10 nodes ............................................... 189
Table 14: The summary of the benchmarking results of the HAS architecture prototype ................. 192
Table 15: Input parameters for the HA-OSCAR model ..................................................................... 201
Table 16: System availability for different configurations................................................................. 202
Table 17: The changes made to the Linux kernel to support NFS redundancy.................................. 217
xi
Chapter 1
Introduction and Motivation
2
and are robust to accommodate to rapid changes in load. Furthermore, the variations of load
experienced by web servers intensify the challenges of building scalable and highly available web
servers. It is not uncommon to experience more than 100-fold increases in demand when a web site
becomes popular [8].
When the terrorist attacks on New York City and Washington DC occurred on September 11, 2001,
Internet news services reached unprecedented levels of demands. CNN.com, for instance, experienced
a two-and-a-half hour outage with load exceeding 20 times the expected peak [8]. Although the site
team managed to grow the server farm by a factor of five by borrowing machines from other sites,
this arrangement was not sufficient to deliver adequate service during the load spike. CNN.com came
back online only after replacing the front page with a text-only summary in order to reduce the load
[9]. Web sites are also subject to sophisticated denial-of-service attacks, often launched
simultaneously from thousands of servers, which can knock a service out of commission. Denial-of-
service attacks have had a major impact on the performance of sites such as Yahoo! and
whitehouse.gov [10]. The number of concurrent sessions and hits per day to Internet sites translates
into a large number of I/O and network requests, placing enormous demands on underlying resources.
3
Web Server
Machine Disk
Storage
Web Clients
Apache
Web Server
Software
Network
CGI 5
Web Object 5
Users
Users HTTP Request
4
1
142.133.17.22
142.133.17.22 3
www.website.com
Figure 3 illustrates the two main phases that a web request goes through from outside the web server:
the lookup phase includes steps (1), (2), and (3), and the request phase includes steps (4) and (5).
5
When the user requests a web site from the browser in the form of a URL, the request arrives (1) to
the local DNS server who consults (2) the authoritative DNS server responsible for the requested web
site. The local domain name system (DNS) server then sends back (3) the IP address location of the
web server hosting the requested web site to the client. The client requests (4) the document from the
web server using the web server’s IP address and the web server responds (5) to the client with the
requested web document.
Web servers should cope with the numerous incoming requests using minimum system resources.
They have to be multitasking to deal with more than one request at a time. They provide mechanisms
to control access authorization and ensure that the incoming requests are not a threat for the host
system, where the web server software runs. In addition, web servers respond to error messages they
receive, negotiate a style and a language of response with the client, and in some cases run as a proxy
server. Web servers generate logs of all connections for statistics and security reasons.
6
handle an incoming request, reading from the network, looking up the requested document, reading
the document from disk, and writing the document onto the network.
7
reliability and availability. Therefore, web servers should deploy hardware and software fault-
tolerance and redundancy mechanisms to ensure reliability, to prevent single points of failure, and
to maintain availability in case of a hardware or software failure.
- Ability to sustain a guaranteed number of connections: This requirement entails the web server to
maintain a minimum number of connections per second and to process these connections
simultaneously. The ability to sustain a guaranteed number of connections, also described as
maintaining the base performance, has a direct effect on the total number of requests the web
server can process at any point in time.
- High storage capacity: Web servers provide I/O storage capacity to store data and a variety of
information it is hosting. In addition, with the increased demand on multimedia data, requiring
fast data retrieval is an essential requirement.
- Cost effectiveness: An important requirement governing the future of web servers is their cost
effectiveness. Otherwise, the cost per transaction increases as the number of transactions increase,
and as a result, the cost becomes an important factor governing which server architecture and
software to use in any given deployment case.
Designing a high performance and scalable web and Internet server is a challenging task. This
dissertation aims to understand what causes scalability problems in the web server cluster and
explores how we can scale a web server cluster. The dissertation focuses on the design of next
generation cluster architecture to meet the requirements discussed above. The architecture need to be
able to scale linearly for up to 16 processors, support service availability, and reliability. The
architecture will inherently meet other requirements such as better cluster resource utilization and the
ability to handle different types of traffic.
8
1.4.1 High Concurrency
The growth in popularity and functionality of Internet and web services has been astounding. While
the world wide web itself is growing in size, with recent estimates anywhere between 1 billion and
2.5 billion unique documents, the number of users on the web is also growing at a staggering rate
[16][17]. In April 2002, Nielsen NetRatings estimated that there are over 422 million Internet users
worldwide [18]. Consequently, Internet and web applications need to support unprecedented
concurrency demands, and these demands are increasing over time.
9
not over commit its resources and degrade in a way that all clients suffer. Rather, the service need to
be aware of overload conditions and attempt to adapt to them, by degrading the quality of service
delivered to clients, or by predictably shedding load, such as by giving users some indication that the
service is saturated. It is far better for an overloaded service to inform users of the overload than to
silently drop requests.
10
and validate the scalability of the architecture in the lab without resorting to building a theoretical
model and simulating it.
1.6.1 Goal
The goal of this study is to propose an architecture for scalable and highly available web server
clusters. The architecture allows the following properties: fast access, linear scalability for up to 16
processors, architecture transparency, high availability, and robustness of offered services.
11
This dissertation does not address multimedia servers, streams, sessions, states, and applications
servers. In addition, it does not address nor try to fix problem with networking protocols. In addition,
high performance computing (HPC) is not in the scope of the study. HPC is a branch of computer
science that concentrates on developing software to run on supercomputers. The HPC research
focuses on developing parallel processing algorithms that divide large computational task into small
pieces so that separate processors can execute simultaneously. Architectures in this category focus on
maximizing compute performance for floating point operations. This branch of computing is
unrelated to the dissertation.
The architecture targets servers providing services over the Internet with the characteristics
previously mentioned. The architecture applies to systems with short response times such as, but not
exclusively, web servers, Authentication Authorization and Accounting (AAA) servers, Policy
servers, Home Location Register (HLR) servers, Service Control Point (SCP) servers, without
specialized extensions needed at the architectural level.
1.6.5 Scalability
Existing server scaling methods rely on adding more hardware, upgrading processor and memory, or
distributing the incoming load and traffic by distributing users or data into several servers. These
schemes, discussed in Chapter 2, suffer from different shortcomings, which can cause uneven load
distribution, create bottlenecks, and obstruct the scalability of a system. These scaling methods are
out of our scope and we do not aim to improve them.
Our goals with the architecture are to achieve scalability through clustering, where we dynamically
direct incoming web requests to the appropriate cluster node, and scale the number of serving nodes
with as little overheard or drop in baseline performance as possible. Therefore, our focus is scalability
while maintaining a high throughput. The cluster needs to be able to distribute the application load
across N cluster nodes with linear or close-to linear scalability.
12
categories: HA stateless, with no saved state information and the HA stateful with state information
that allows the web application to maintain sessions across a failover. Our scope focuses on the HA
stateless web applications, although we can apply the same principles to the HA stateful web
applications.
13
address. It provides a single entry to the cluster by hiding the complexity of the cluster, and provides
address location transparency, to address a resource in the cluster without knowing or specifying the
processor location. Section 4.21 presents this contribution.
Application availability: The architecture provides the capabilities to monitor the health of the
application server running on the traffic nodes, and dynamically exclude the node from the cluster in
the event the application process fails. Section 4.20 discusses these capabilities. Furthermore, with
connection synchronization between the two master nodes, in the event of failure of a master node,
the standby node is able to continue serving the established connections. Section 4.22 discusses
connection synchronization.
Contribution to Open Source: This work has resulted in several contributions to the the HA-OSCAR
project [21], whose architecture is based on the HAS architecture. Section 5.12 discusses these
contributions.
Contributions to the industry: The Carrier Grade industry initiative [26] at the Open Source
Development Labs [27] has adopted the HAS architecture as the base standard architecture for carrier
grade clusters running telecommunication applications. Section 5.15 discusses this contribution.
Other contributions include benchmarking current solutions, providing enhancements to their
capabilities and adding functionalities to existing system software, and providing best practices for
building benchmarking environment for large-scale systems.
14
scaling Internet and web servers. It presents a survey of academic and industry research projects,
discusses their focus areas, results, and contributions. It also presents the contributions of these
projects into this dissertation to help us achieve our goal of a scalable and highly available web server
platform.
Chapter 3 summarizes the technical preparatory work we completed in the laboratory prior to
designing the Highly Available and Scalable (HAS) architecture. This chapter describes the
prototyped web cluster that uses existing components and mechanisms. It also describes the
benchmarking environment we built specifically to test the performance and scalability of web
clusters and present the benchmarking results of the tests we conducted on the prototyped cluster.
Chapter 4 focuses on describing and discussing the HAS architecture. It presents the architecture, its
components, and their characteristics. The chapter then discusses the conceptual, physical, and
scenario architecture views, the supported redundancy models, the traffic distribution scheme, and the
dependencies between the various components. It also covers the architecture characteristics as
related to eliminating single points of failures.
Chapter 5 presents the validation of the architecture and illustrates how it scales for up to 16
processors without performance degradation. The validation covers two aspects: scalability and
availability. The chapter presents the results of the benchmarking tests we conducted on the HAS
architecture prototype. It also presents the results of experiments we conducted to test the availability
features in a HAS cluster.
Chapter 6 presents the contributions and future work in the areas of scalability and performance of
Internet and web servers.
15
Chapter 2
Background and Related Work
16
more) processor(s) and manages access to the shared resources among all the processors. A single
copy of the operating system is in charge of all the processors. SMP systems available on the market
(at the time of writing) do not exceed 16 processors, with configurations available in two, four, eight,
and 16 processors.
System Bus
Memory IO
SMP systems are not scalable because all nodes have to access the same resources. In addition, SMP
systems have a limit to the number of processors they can have. They require considerable
investments in upgrades, and an entire replacement of the system to accommodate a larger capacity.
Furthermore, an SMP system runs a single copy of the operating system, where all processors share
the same copy of the operating system data. If one processor becomes unavailable because of either
hardware or software error, it leaves locks unlocked, data structures in partially updated states, and
potentially, I/O devices in partially initialized states. As a result, the entire system becomes
unavailable on the account of a single processor. In addition, SMP architectures are not highly
available. SMP systems have several single points of failure (cache, memory, processor, bus); if one
subsystem becomes unavailable, it brings the system down and makes the service unavailable to the
end users.
17
use of fully distributed memory. In an MPP system, each processor is self-contained with its own
cache and memory chips.
Interconnecting Network
Another distinct characteristic of MPP systems is the job scheduling subsystem. We achieve job
scheduling through a single run queue. MPP systems tackle one very large computational problem at
a time and serve to solve HPC problems. In addition, MPP systems suffer from the same issues as
SMP systems in the areas of scalability, single points of failures, and the impact on high availability,
and the need to shutdown the system to perform either software or hardware upgrades.
18
Figure 6 illustrates the generic cluster architecture, which consists of multiple standalone nodes that
are connected through redundant links, and providing a single entry to the cluster.
Cluster
...
Node A Node B Node C Node N
Users
Users Single
entry
point
The Internet
Redundant LANs
Cluster nodes interconnect in different ways. Figure 7 illustrates two common variations. In the first
variation, Figure 7-A, cluster nodes share a common disk repository; in the second variation, Figure
7-B, cluster nodes do not share common resources and use their own local disk for storage.
Network
The phrase single, unified computing resource in Greg Pfister definition of a cluster invokes a wide
variety of possible applications and uses, and is deliberately vague in describing the services provided
by the cluster. At one end of the spectrum, a cluster is nothing more than the collection of whole
computers available for use by a sophisticated distributed application. At the other end, the cluster
creates an environment where existing non-distributed programs can benefit from increased
19
availability because of the cluster wide fault masking, and increased performance because of the
increased computing capacity.
A cluster is a group of independent COTS servers interconnected through a network. The servers,
called cluster nodes, appear as a single system, and they share access to cluster resources such as such
as shared disks, network file systems and the network. A network interconnects all the nodes in a
cluster and it is separate from the cluster’s external environment such as the local Intranet or the
Internet. The interconnection network employs local area network or systems area network
technology.
Clusters can be highly available because of the of build-in redundancy that prevents the presence of a
single point of failure (SPOF). As a result, failures are contained within a single node. Monitoring
software continually runs checks, by sending signals also called heartbeats, to ensure that the cluster
node and the application running on are up and available. If these signals stop, then the system
software initiates the failover to recover from the failure. The presumably dead or unavailable system
or application is then isolated from I/O access, disks, and other resources such as access to the
network; furthermore, incoming traffic is redirected to other available nodes within the cluster. As for
performance, clusters allow the possibility to add nodes and scale up the performance, the capacity,
and the throughput of the cluster as the number of users or traffic increases.
Applications
Operating System
Interconnect Protocol
Interconnect Technology
Nodes
Goal Maximize floating Maximize throughput and Maximize service Maximize ease of
point computation performance availability management of
performance multiple computing
resources
Description - Many nodes - Many nodes working - Redundancy and - Also called
working together on similar tasks, failover for fault Single System
on a single- distributed in a tolerance of Image (SSI)
compute based defined fashion based services
- Provide a central
problem. on system load
- Support stateless management of
characteristics
- Performance is and stateful cluster resources
measured as the - Performance is applications and treat the
number of measured as cluster as a
- Availability is
floating point throughput in terms single
measured as a
operations of KB/s management
percentage of the
(FLOP) per unit.
- Adding more nodes time the system is
second
to the cluster to up and providing
increase its capacity service. Section
4.7 presents the
- Network oriented
formula for
(network
calculating the
throughput), or Data
availability.
oriented (data
transactions)
Examples Beowulf clusters such Examples include the Examples include the Examples include the
as MOSIX [30][31], Linux Virtual Server [23], HA-OSCAR project OpenSSI project
Rocks [32], OSCAR TurboLinux [37], in [21], in addition to [38], the OpenGFS
[33] [34] [35], and addition to commercial commercial clustering project [39], and the
Ganglia [36] database products products Oracle Cluster File
System [40]
21
Clustering for scalability (Table 1, 3rd column) focuses on distributing web traffic among cluster
nodes using distribution algorithms such as round robin DNS.
Clustering for high availability (Table 1, 4th column) relies on redundant servers to ensure that critical
applications remain available if a cluster node fails. There are two methods for failover solutions:
software-based failover solutions discussed in Section 2.7.1.2, and hardware-based failover devices
discussed in Section 2.7.1.1. Software-based failover detects when a server has failed and
automatically redirect new incoming HTTP requests to the cluster members that are available.
Hardware-based failover devices have limited built-in intelligence and require an administrator's
intervention when they detect a failure.
Many of the clustering products available fit into more than one of the above categories. For instance,
some products include both failover and load-balancing components. In addition, SSI products that fit
into the server consolidation category (Table 1, 5th column) provide certain HA failover capabilities.
Our goal with this dissertation is a cluster architecture that targets both scalability and high
availability.
SMP systems have limited scalability, while clusters have virtually unlimited scaling capabilities
since we can always continue to add more nodes to the cluster. As for high availability, an SMP
22
system has several single points of failure; one single error can lead to a system downtime; in contrast
to a cluster, where functionalities are redundant and spread across multiple cluster nodes. As for
management, an SMP system is a single system, while a cluster is composed of several nodes, some
of which can be SMP machines.
23
2.5.1 High Availability
High availability (HA) refers to the availability of resources in a computer system [41]. We achieve
HA through redundant hardware, specialized software, or both [41][42][43]. With clusters, we can
provide service continuity by isolating or reducing the impact of a failure in the node, resources, or
device through redundancy and fail over techniques. Table 3 presents the various levels of HA, the
annual downtime and type of applications for various classes of systems [44].
9's Availability Downtime per year Example Areas for Deployments
1 90.00% 36 days 12 hours Personal clients
2 99.00% 87 hours 36 minutes Entry-level businesses
3 99.90% 8 hours 46 minutes ISPs, mainstream businesses
4 99.99% 52 minutes 33 seconds Data centers
5 99.999% 5 minutes 15 seconds Telecom system, medical, banking
6 99.9999% 31.5 seconds Military defense, carrier grade routers
It is important that a service not only be down except for N minutes a year, but also that the length of
outages be short enough, and the frequency of outages be low enough, that the end user does not
perceive it as a problem. Therefore, the goal is to have a small number of failures and a prompt
recovery time. This concept is termed Service Availability, meaning whatever services the user wants
are available in a way that meets the user’s expectations.
2.5.2 Scalability
Clusters provide means to reach high levels of scalability by expanding the capacity of a cluster in
terms of processors, memory, storage, or other resources, to support users and traffic growth [1].
2.5.5 Manageability
Clusters require a management layer that allows us to manage all cluster nodes as a single entity [28].
Such cluster management facilities help reduce system management costs. There exists a significant
number of cluster management software, almost all of them originating from research projects, and
are now adopted by commercial vendors.
2.5.8 Transparency
The SSI layer represents the nodes that make up the cluster as a single server. It allows users to use a
cluster easily and effectively without the knowledge of the underlying system architecture or the
number of nodes inside the cluster. This transparency frees the end-user from having to know where
an application runs.
25
cluster structure not only benefits the end user but the cluster vendor as well, yielding a wide array of
system capabilities and cost tradeoffs to meet customer demands.
Replies
Server 1
.
Requests .
.
Dispatcher
Server n
Replies
In L4/2 based clusters, the dispatcher and all the servers in the cluster share the cluster network-layer
address using primary and secondary IP addresses. While the primary address of the dispatcher is the
same as the cluster address, each cluster server is configured with the cluster address as a secondary
address using either interface aliasing or by changing the address of the loopback device on the
cluster servers. The nearest gateway is configured such that all packets arriving for the cluster address
are addressed to the dispatcher at layer two using a static Address Resolution Protocol (ARP) cache
entry. If the packet received corresponds to a TCP/IP connection initiation, the dispatcher selects one
of the servers in the server pool to service the request (Figure 9).
The selection of the server to respond to the incoming request relies on a traffic distribution algorithm
such as round robin. When an incoming request arrives to the dispatcher, the dispatcher creates an
entry in a connection map that includes information such as the origin of the connection and the
chosen cluster server. The layer two destination address is then rewritten to the hardware address of
26
the chosen cluster server, and the frame is placed back on the network. If the incoming packet is not
for a connection initiation, the dispatcher examines its connection map to determine if it belongs to a
currently established connection. If it does, the dispatcher rewrites the layer two destination address
to be the address of the cluster server previously selected, and forwards the packet to the cluster
server as before. In the event that the received packet does not correspond to an established
connection and is not a connection initiation packet, then the dispatcher drops it.
Figure 10 illustrates the traffic flow in an L4/2 clustered environment [45]. A web client sends an
HTTP packet (1) with A as the destination IP address. The immediate router sends the packet to the
dispatcher at IP address A (2). Based on the traffic distribution algorithm and the session table, the
dispatcher decides which back-end server will handle this packet, server 2 for instance, and sends the
packet to server 2 by changing the MAC address of the packet to server 2's MAC address and
forwarding it (3). Server 2 accepts the packet and replies directly to the web client.
Router
4
1 2 3
L4/2 clustering has a performance advantage over L4/3 clustering because of the downstream bias of
web transactions. Since the network address of the cluster server to which the packet is delivered is
identical to the one the web client used originally in the request packet, the cluster server handling
that connection may respond directly to the client rather than through the dispatcher. As a result, the
dispatcher processes only the incoming data stream, which is a fraction of the entire transaction.
Moreover, the dispatcher does not need to re-compute expensive integrity codes (such as the IP
checksums) in software since only layer two parameters are modified. Therefore, the two parameters
that limit the scalability of the cluster are the network bandwidth and the sustainable request rate of
the dispatcher, which is the only portion of the transaction actually processed by the dispatcher.
27
One restriction on L4/2 clustering is that the dispatcher must have a direct physical connection to all
network segments that house servers (due to layer two frame addressing). This contrasts with L4/3
clustering (Section 2.6.2), where the server may be anywhere on any network with the sole constraint
that all client-to-server and server-to-client traffic must pass through the dispatcher. In practice, this
restriction on L4/2 clustering has little appreciable impact since servers in a cluster are likely to be
connected via a single high-speed LAN.
Among research and commercial products implementing layer two clustering are the ONE-IP
developed at Bell Laboratories [46], the IBM's eNetwork Dispatcher [47], and the LSMAC from the
University of Nebraska-Lincoln (Section 2.10.4).
Server 1
Requests .
.
Replies .
Dispatcher
Server n
Similar to L4/2 clustering, the selection of the cluster server relies on a traffic distribution algorithm.
The dispatcher then creates an entry in the connection map noting the origin of the connection, the
chosen server, and other relevant information. However, unlike the L4/2 approach, the dispatcher
rewrites the destination IP address of the packet as the address of the cluster server selected to service
this request. Furthermore, the dispatcher re-calculates any integrity codes affected such as packet
checksums, cyclic redundancy checks, or error correction checks. The dispatcher then sends the
modified packet to the cluster server corresponding to the new destination address of the packet. If the
28
incoming web client traffic is not a connection initiation, the dispatcher examines its connection map
to determine if it belongs to a currently established connection. If it does, the dispatcher rewrites the
destination address as the server previously selected, re-computes the checksums, and forwards the
packet to the cluster server as we described earlier. In the event that the packet does not correspond to
an established connection and it is not a connection initiation packet, then the dispatcher drops the
packet.
The traffic sent from the cluster servers to the web clients travels through the dispatcher since the
source address on the response packets is the address of the particular server that serviced the request,
not the cluster address. The dispatcher rewrites the source address to the cluster address, re-computes
the integrity codes, and forwards the packet to the web client.
Router
1 2 3
5 4
Dispatcher
IP Address=A
Figure 12 illustrates the traffic flow in an L4/3 clustered environment [45]. A web client sends an
HTTP packet with A as the destination IP address (1). The immediate router sends the packet to the
dispatcher (2), since the dispatcher machine is the owner of the IP address A. Based on the traffic
distribution algorithm and the session table, the dispatcher decides to forward this packet to the back-
end server, Server 2 (3). The dispatcher then rewrites the destination IP address as B2, recalculates
the IP and TCP checksums, and sends the packet to B2 (3). Server 2 accepts the packet and replies to
the client via the dispatcher (4), which the back-end server sees as a gateway. The dispatcher rewrites
the source IP address of the replying packet as A, recalculates the IP and TCP checksums, and sends
the packet to the web client (5).
RFC 2391, Load Sharing using IP Network Address Translation, presents the L4/3 clustering
approach [48]. The LSNAT from the University of Nebraska-Lincoln provides a non-kernel space
implementation of the L4/3 clustering approach [49]. Section 2.10.4 discusses the project and the
implementation.
29
L4/2 clustering theoretically outperforms L4/3 clustering due to the overhead imposed by L4/3
clustering with the necessary integrity code recalculation coupled with the fact that all traffic must
flow through the dispatcher, resulting that the L4/3 dispatcher processes more traffic than an L4/2
dispatcher does. Therefore, the total data throughput of the dispatcher limits the scalability of the
system more than the sustainable request rate.
2.6.3 L7 Clustering
Level 7 web switch works at the application level. The web switch establishes a connection with the
web client and inspects the HTTP request content to decide about dispatching. The L7 clustering
technique is also known as content-based dispatching since it operates based on the contents of the
client request. The Locality-Aware Request Distribution (LARD) dispatcher developed by the
researchers at Rice University is an example of the L7 clustering. LARD partitions a Web document
tree into disjoint sub-trees. The dispatcher then allocates each server in the cluster one of these sub-
trees to serve. As such, LARD provides content-based dispatching as the dispatcher receives web
clients requests.
Server 1
a a a
a
a a a
.
c c b
.
.
Dispatcher
c c b Server n
b c
Figure 13 presents an overview of the processing with the L7 clustering approach [45]. Server 1
processes request of type ; Server 2 processes requests of types and . The dispatcher separates
the stream of requests into two streams of requests: one stream for Server with requests of of type ,
and and stream for Server 2 with requests of types and . As requests arrive from clients for the
web cluster, the dispatcher accepts the connection and the request. It then classifies the requested
document and dispatches the request to the appropriate server. The dispatching of requests requires
support from a modified kernel that enables the connection handoff protocol. After establishing the
30
connection, identifying the request, and choosing the cluster server, the dispatcher informs the cluster
server of the status of the network connection, and the cluster server takes over that connection, and
communicates directly with the web client. Following this approach, the LARD allows the file system
cache of each cluster server to cache a separate part of the web tree rather than having to cache the
entire tree, as it is the case with L4/2 and L4/3 clustering. Additionally, it is possible to have
specialized server nodes, where for instance, the dynamically generated content is offloaded to special
compute servers while other requests are dispatched to servers with less processing power. The
LARD requires modifications to the operating system on the servers to support the TCP handoff
protocol.
HA and fault Varies, several single point of Varies, several single point of Varies, several single point of
Restrictions Incoming traffic passes Dispatcher lies between client and Incoming traffic passes
through dispatcher server. All incoming and outgoing through dispatcher
traffic passes through dispatcher
Table 4: Advantages and drawbacks of clustering techniques operating at the OSI layer
Each of the approaches creates bottlenecks that limit scalability, and presents several single points of
failure. For L4/2 dispatchers, system performance is constrained by the ability of the dispatcher to set
up, look up, and tear down entries. However, the most telling performance metric is the sustainable
request rate. The limitation of L4/3 dispatchers is their ability to rewrite and recalculate the
checksums for the massive numbers of packets they process. Therefore, the most telling performance
31
metric is the throughput of the dispatcher. Lastly, the L7 clustering approach has limitations related to
the complexity of the content-based routing algorithm and the size of their cache.
Web Clients
The Internet
Web Server
Cluster
The following sub-sections explore the software and hardware techniques used to build web clusters.
Web Clients
The Internet
Router
Web
Cluster
Nodes
Network Attached
Storage
Hardware-based clustering solutions use routers to provide a single IP interface to the cluster and to
distribute traffic among various cluster nodes. These solutions are a proven technology; they are
neither complicated nor complex by design. However, they have certain limitations such as limited
intelligence, un-awareness of the applications running on the cluster nodes, and the presence of
SPOF.
Limited intelligence: Packet routers can load balance in a round robin fashion, and some can detect
failures and automatically remove failed servers from a cluster and redirect traffic to other nodes.
These routers are not fully intelligent network devices. They do not provide application-aware traffic
distribution. While they can redirect requests upon discovering a failure, they do not allow
configuring redirection thresholds for individual servers in a cluster, and therefore, they are unable to
manage load to prevent failures.
Lack of Dynamism: A router cannot measure the performance of a web application server or make an
intelligent decision on where to route the request based on the load of the cluster node and its
hardware characteristics.
33
Single point of failure: Packet router constitutes a SPOF for the entire cluster. If the router fails, the
cluster is not accessible to end users and the service becomes unavailable.
34
promises to support larger clusters and to include enhanced services to simplify the creation of highly
scalable, cluster-aware applications [54]. The current version of MSCS suffers from scalability issues
as it only supports two servers that require upgrading as the traffic increases.
The Linux Virtual Server (LVS) is an open source project that aims to provide a high performance
and highly available software clustering implementation for Linux [23]. It implements layer 4
switching in the Linux kernel, providing a virtual server layer built on a cluster of real servers and
allowing TCP and UDP sessions to be load balanced between multiple real servers. The virtual
service is by either an IP address or a port and protocol. The front-end of the real servers is a load
balancer, which schedules requests to the different servers and makes parallel services of the cluster
appear as a virtual service on a single IP address. The architecture of the cluster is transparent to end
users, and the users who only see the address of the virtual server. The LVS is available in three
different implementations [55]: Network Address Translation (NAT), Direct Routing (DR), and IP
tunneling. Sections 3.5.1, 3.5.2, and 3.5.3, present and discuss the NAT, DR, and IP tunneling
methods, respectively.
We have experimented with both the NAT and DR methods. Section 3.5.4 presents the benchmarking
results comparing the performance of both methods. Each of these techniques for providing a virtual
interface to a web cluster has its own advantages and disadvantages. Based on our lab experiments
discussed in Chapter 3, we concluded that the common disadvantage among these schemes is their
limited scalability (Figure 31). When the traffic load increases, the load balancer becomes a
bottleneck for the whole cluster and the local director crashes under heavy load or stops accepting
new incoming requests. In both cases, the local director replies very slowly to ongoing requests.
Software clustering solutions have three main advantages that make them a better alternative than
clustering solutions: flexibility, intelligence, and availability. First, software clustering solutions can
augment existing hardware devices, thereby providing a more robust traffic distribution and failover
solution. Additionally, by integrating hardware with software, you diminish, if not eliminate, losses
on capital expenditures that your organization has already made. Secondly, they provide a level of
intelligence that enables preventive traffic distribution measures that actually minimize the chance of
servers becoming unavailable. In the event that a server becomes overloaded or actually fails, some
software can automatically detect the problem and reroute HTTP requests other nodes in the cluster.
Thirdly, with software clustering solutions, we can support high availability capabilities to avoid
35
single points of failure. An individual server failure does not affect the service availability since
functionalities and failover capabilities are distributed among the cluster servers.
However, we need to consider several issues when evaluating cluster software solutions, mainly the
differences among feature sets, the platform constraints, their HA and scaling capabilities. Software
clustering solutions have different capabilities and features, such as their capabilities of providing
automatic failure detection, notification, and recovery. Some solutions have significantly delayed
failure detection; others allow the configuration of the load thresholds to enable preventive measures.
In addition, they can support different redundancy models such as the 1+1 active/standby, 1+1
active/active, N+M and N-way. Therefore, we need to determine the needs or requirements for
scalability and failover and pick the solution accordingly. In addition, software solutions have limited
platform compatibility; they are available to run on specific operating system or computing
environments. Furthermore, the capability of the clustering solution to scale is important. Some
solutions have limited capabilities restricted to four, eight, or 16 nodes, and therefore have scaling
limitations.
36
As such, scalability presents itself as a crucial factor for the success or failure of online services and it
is certainly one important challenge faced when designing servers that provide interactive services for
a wide clientele.
Many factors can affect negatively the scalability of systems [59]. The first common factor is the
growth of user base, which cause serious capacity problems for servers that can only serve a certain
number of transactions per second. If the server is not able to cope with the increased number of users
and traffic, the server starts rejecting requests. A second key factor negatively affecting the scalability
of servers is the number and size of data objects, particularly the size of audio and video files strains
the network and I/O capacity causing scalability problems. The increasing amount of accessible data
makes data search, access, and management more difficult, which causes processing problems and
eventually led to rejecting incoming requests. Finally, the non-uniform request distribution imposes
strains on the servers and network at certain times of the day or at certain requested data. These
factors can cause servers to suffer from bottlenecks, and run out of network, processing, and I/O
resources.
37
generation interactive and multimedia services. These servers suffer from scalability problems as the
number of mobile subscribers is increasing at a fast pace [1]. To cope with the increased number of
users and traffic, mobile operators are resorting to upgrading servers or buying new servers with more
processing power [59], a process that proved to be expensive and iterative. According to Ericsson
Research, the growth rate of mobile subscribers in 2004 was approximately 500,000 users per day
[62]. This raises the question of whether the Mobile Internet servers and the applications running on
those servers will be able to cope with such growth.
When the servers are not able to cope with increased traffic, it results in failure to meet the high
expectations of paying customers who expect services to be available at all times with acceptable
performance levels [63][64], and meeting and managing service level agreements. Service level
agreements dictate the percentage of the time services will be available, the number of users that can
be served simultaneously, specific performance benchmarks to which actual performance will be
periodically compared, and access availability. If ISPs, for instance, are not able to cope with the
increasing number of users, they will break their service level agreements causing them to loose
money and potentially loose customers. Similarly, mobile operators have to deal with huge money
losses if their servers are not available for their subscribers.
The response time is the space of that exists between the moment a user gives an input, or posts a
request, to the moment when a user receives an answer from the server. Total response time includes
the time to connect, the time to process the request on the server, and the time to transmit the response
back to the client:
Total response time = connect time + process time + response transit time
38
When throughput is low, the response transmit time is insignificant. However, as throughput
approaches the limit of network bandwidth, the server has to wait for bandwidth to become available
before it can transmit the response.
The response time in a distributed system consists of all the delays created at the source site, in the
network, and at the receiver site. The possible reasons for the delays and their length depend on the
system components and the characteristics of the transport media. The response time consists of the
delays in both directions.
39
The widely deployed scaling methods for clustered web servers are round robin DNS and packet
router device to distribute incoming traffic.
40
2.8.4 Principles of Scalable Architecture
This section discusses the principles of scalable architectures. After presenting the HAS architecture
in Chapter 4, we discuss how the HAS architecture design meets the architectural scaling principle
presented in this section.
We can characterize the applications by their consumption of four primary system resources:
processor, memory, file system bandwidth, and network bandwidth. We can achieve scalability by
simultaneously optimizing the consumption of these resources and designing an architecture that can
grow modularly by adding more resources.
Several design principles are required to design scalable systems. The list includes divide and
conquer, asynchrony, encapsulation, concurrency, and parsimony [65]. Each of these principles
presents a concept that is important in its own right when designing a scalable system. There are also
tensions between these principles; we can sometimes apply one principle at the cost of another. The
root of a solid system design is to strike the right balance among these principles. In the following
subsections, we present on each of these principles.
2.8.4.2 Asynchrony
The asynchrony principle means that the system carries out the work based on available resources
[65]. Synchronization constrains a system under load because application components cannot process
work in random order, even if resources do exist to do so. Asynchrony decouples functions and lets
the system schedule resources more freely and thus potentially more completely. This principle
allows us to implement strategies that effectively deal with stress conditions such as peak load.
41
2.8.4.3 Encapsulation
The encapsulation principle is the concept of building the system using loosely coupled components,
with little or no dependence among components [65] . This principle often, but not always, correlates
with asynchrony. Highly asynchronous systems tend to have well encapsulated components and vice
versa. Loose coupling means that components can pursue work without waiting for work from others.
2.8.4.4 Concurrency
The concurrency principle means that there are many moving parts in a system and the goal is to split
the activities across hardware, processes, and threads [65]. Concurrency aids scalability by ensuring
that the maximum possible work is active at all times and addresses system load by spawning new
resources on demand within predefined limits. Concurrency also maps directly to the ability to scale
by rolling in new hardware. The more concurrency applications exploit, the better the possibilities to
expand by adding new hardware.
2.8.4.5 Parsimony
The parsimony principle indicates that the designer of the system needs to be economical in what he
or she designs [65]. Each line of code and each piece of state information has a cost, and, collectively,
the costs can increase exponentially. A developer has to ensure that the implementation is as efficient
and lightweight as possible. Paying attention to thousands of micro details in a design and
implementation can eventually pay off at the macro level with improved system throughput.
Parsimony also means that designers carefully use scarce or expensive resources. No matter what
design principle a developer applies, a parsimonious implementation is appropriate. Some examples
include algorithms, I/O, and transactions. Parsimony ensures that algorithms are optimal to the task
since several small inefficiencies can add up and kill performance. Furthermore, performing I/O is
one of the more expensive operations in a system and we need to keep I/O activities to the bare
minimum. Moreover, transactions constrain access to costly resources by imposing locks that prohibit
read or write operations. Applications should work outside of transactions whenever feasible and go
out of each transaction in the shortest time possible.
42
2.8.5 Strategies for Achieving Scalability
Section 2.8.5 presented the five principles of scalable architectures. This section presents the design
strategies to achieve a scalable architecture.
44
The researchers at the Korea Advanced Institute of Science and Technology have developed an
adaptive load balancing method that changes the number of scheduling entities according to different
workload [71]. It behaves exactly like dispatcher based scheme with low or intermediate workload,
taking advantage of fine-grained load balancing. When the dispatcher is overloaded, the DNS servers
distribute the dispatching jobs to other entities such as and back-end servers. In this way, they relax
the hot spot of the dispatcher. Based on simulation results, they demonstrated that the adaptive
dispatching method improves the overall performance on realistic workload simulation.
In [72], the authors present and evaluate an implementation of a prototype scalable web server
consisting of a balanced cluster of hosts that collectively accept and service TCP connections. The
host IP addresses are advertised using round robin DNS technique allowing a host to receive requests
from a client. They use a low-overhead technique called the distributed packet rewriting (DPR) to
redirect TCP connections. Each host keeps information about the remaining hosts in the system. Their
performance measurements suggest that their prototype outperforms round robin DNS. However,
their benchmarking was limited to a five-node cluster, where each node reached a peak of 632
requests per second, compared to over 1,000 requests per second per node we achieved with our early
prototype (Section 3.7).
In [45], the authors discuss clustering as a preferred technique to build scalable web servers. The
authors examine early products and a sample of contemporary commercial offerings in the field of
transparent web server clustering. They broadly classify transparent server clustering into three
categories: L4/2, L4/3, and L7 clustering, and discuss their advantages and disadvantages.
In [73], the authors present their two implementations for traffic manipulation inside a web cluster:
MAC-based dispatching (LSMAC) and IP-based dispatching (LSNAT). The authors discuss their
results, and the advantages and disadvantages of both methods. Section 2.10.4 discusses those
approaches.
The researchers from Lucent Technologies and the University of Texas at Austin present in [74] their
architecture for a scalable web cluster. The distributed architecture consists of independent servers
sharing the load through a round robin traffic distribution mechanism.
In [75], the authors present on optimizations to the NCSA http server [76] to make it more scalable
and allow it to serve more requests.
45
2.10 Related Work: In-depth Examination
This section discusses six projects that share the common goal of increasing the performance and
scalability of web clusters. These projects had different focus areas such as traffic distribution
algorithms, new architectures, presenting the cluster as a single server through a virtual IP layer. This
section examines these research projects, presents their respective area of research, their architectures,
highlights their status and plans, and discusses the contributions of their research into our work. The
works discussed are the following:
- “Redirectional-based Web Server Architecture” at University of Texas (Austin): The goal of this
project is to design and prototype a redirectional-based hierarchical architecture that eliminates
bottlenecks in the cluster and allows the administrator to add hardware seamlessly to handle
increased traffic [77]. Section 2.10.1 discusses this project.
- “Scalable policies for Scalable Web clusters” at the University of Roma: The goal of the project
is to provide scalable scheduling policies for web clusters [68][78]. Section 2.10.2 discusses this
project.
- “The Scalable Web Server (SWEB)” at the University of California (Santa Barbara): The project
investigates the issues involved in developing a scalable web server on a cluster of workstations.
The objective is to strengthen the processing capabilities of such servers by utilizing the power of
computers to match the huge demand in simultaneous access requests from the Internet [78].
Section 2.10.3 discusses this project.
- “LSMAC and LSNAT”: The project at the University of Nebraska-Lincoln investigates server
responsiveness and scalability in clustered systems and client/server network environments [79].
The project is focusing on different server infrastructures to provide a single entry into the cluster
and traffic distribution among the cluster nodes [73]. Section 2.10.4 examines the project and its
results.
- “Harvard Array of Clustered Computers (HACC)”: The HACC project aims to design and
prototype cluster architecture for scalable web servers [81]. The focus of the project is on a
technology called “IP Sprayer”, a router component that sits between the Internet and the cluster
and is responsible for traffic distribution among the nodes of the cluster [82]. Section 2.10.5
discusses this project.
- “IBM Scalable and Highly Available Web Server”: This project is investigating scalable and
highly available web clusters. The goal with the project is to develop a scalable web cluster that
46
will host web services on IBM proprietary SP-2 and RS/6000 systems [83]. Section 2.10.6
discusses this project.
Clients
Redirectional Redirectional
Server 1 …. Server k
Figure 16 illustrates the architecture of the hierarchical redirection based web server approach. Each
HTTP server stores a portion of the data available at the site. The round robin DNS distributes the
load among the redirection servers [75]. The redirection servers in turn redirect the requests to the
HTTP servers where a subset of the data resides. The redirection mechanism is part of the HTTP
47
protocol and it is completely transparent to the user. The browser automatically recognizes the
redirection message, derives the new URL from it, and connects to the new server to fetch the file.
The original goal of the redirection mechanism supported in HTTP was to facilitate moving files from
one server to another. When a client uses the old URL from its cache or from the bookmark after, and
if the file referenced by the old URL was moved to a new server, the old server returns a redirection
message, which contains the new URL. The cluster administrator partitions the documents stored at
the site among the different servers based on their content. For instance, server 1 could store stock
price data, while server 2 stores weather information and server 3 stores movie clips and reviews. All
requests for stock quotes are directed to server 1 while requests for weather information are directed
for server 2.
It is possible to implement the architecture described with server software modifications. However, in
order to provide more flexibility in load balancing and additional reliability, there is a need to
replicate contents on multiple servers. Implementing data replication requires modifying the data
structure containing the mapping information. If there is replication of data, a logical file name is
mapped to multiple URLs on different servers. In this case, the redirection server has to choose one of
the servers containing the relevant information data. Intelligent strategies for choosing the servers can
be implemented to better balance the load among the HTTP servers. Many approaches are possible
including round robin and weighted round robin.
Users
Users
Browser
2 6
File
DNS Target
URL 5
3 4
Base Redirectional
HTTP
URL message with
Server
base URL
Redirectional
Server
48
Figure 17 illustrates the steps a web request goes through until the client gets a response back from
the HTTP server. The web user types a web request into the web browser (1). The DNS server
resolves the address and returns the IP address of the server, which in this case is the address of the
redirectional server (2). When the request arrives to the redirectional server (3), it is examined and
forwarded to the appropriate HTTP server (4,5). The HTTP server processes the request and replies to
the web client (6).
The authors implement load balancing by having each HTTP server report its load periodically to a
load monitoring coordinator. If the load on a particular server exceeds a certain threshold, the load
balancing procedure is triggered. Some portions of the content on the overloaded server are then
moved to another server with lower load. Next, the redirection information is updated in all
redirection servers to reflect the data move.
The authors implemented a prototype of the redirection-based server architecture using one
redirectional server and three HTTP servers. Measurements using the WebStone [86] benchmark
demonstrate that the throughput scales up with the number of machines added. Measurements of
connection times to various sites on the Internet indicate that additional connection to the redirection
server accounts for a +20% increase in latency [74].
This architecture is implemented using COTS hardware and server software. Web clients see a single
logical web server without knowing the actual location of the data, or the number of current servers
providing the service. The administrator of the system partitions the document store among the
available cluster nodes however using tedious and mechanical mechanisms and does not provide
dynamic load balancing; rather, it requires the interference of a system administrator to move data to
a different server(s) and to update manually the redirection rules.
One important characteristic of the implementation is the size of the mapping table. The HTTP server
stores the redirection information in a table that is created when the server is started and stored in
main memory. This mapping table is searched on every access to the redirection server. If the table
grows too large, it increases the processing time searching in the redirection server.
The architecture assumes that all HTTP servers have disk storage, which is not very realistic as many
real deployments take advantage of diskless nodes and network storage. The maintenance and update
of all copies of the data is difficult. In addition, web requests require an additional connection
between the redirection server and the HTTP server.
49
Other drawbacks of the architecture include the lack of redundancy at the main redirectional server.
The authors did not focus on incorporating high availability capabilities within the architecture. In
addition, since the architecture assumes a single redirectional server, there was no effort to investigate
a single IP interface to hide all the redirectional servers. As a result, the redirectional server poses a
SPOF and limits the performance and scalability of the architecture. Furthermore, the authors did not
investigate the scaling limitation of the architecture. Overall, the architecture promises a limited level
of scalability.
We classify the main inputs from this project in four essential points. First, the research provided us
with a confirmation that a distributed architecture is the right way to proceed forward. A distributed
architecture allows us to add more servers to handle the increase in traffic in a transparent fashion.
The second input is the concept of specialization. Although very limited in this study, node
specialization can be beneficial where different nodes within the same cluster handle different traffic
depending on the application running on the cluster nodes. The third input to our work relates to load
balancing and moving data between servers. The redirectional architecture achieves load balancing by
manually moving data to different servers, and then updating the redirection information stored on the
redirectional server. This load balancing scheme is an interesting concept for small configurations;
however, it is not practical for large web clusters and we do not consider this approach for our
architecture. The fourth input to our work is the need for a dynamic traffic distribution mechanism
that is efficient and lightweight.
50
Wide Area
Network
Server
Server 11
Server
Server 22
Local Server
Dispatcher
Dispatcher Server 33
Area
Network
Server
Server 44
Primary DNS
URL Æ IP
Server
Server nn
Figure 18: The web farm architecture with the dispatcher as the central component
Figure 18 presents the architecture of the web cluster with n servers connected to the same local
network and providing service to incoming requests. The dispatcher server connects to the same
network as the cluster servers, provides an entry point to the web cluster, and retains transparency of
the distributed architecture for the users [84]. The dispatcher receives the incoming HTTP requests
and distributes it to the back-end cluster servers.
Although web clusters consist of several servers, all servers use one hostname site to provide a single
interface to all users. Moreover, to have a mechanism that controls the totality of the requests
reaching the site and to mask the service distribution among multiple back-end servers, the web
server farm provides a single virtual IP address that corresponds to the address of front-end server(s).
This entity is the dispatcher that acts as a centralized global scheduler that receives incoming requests
and routes them among the back-end servers of the web cluster. To distribute the load among the web
servers, the dispatcher identifies uniquely each server in the web cluster through a private address.
The researchers argue that the dispatcher cannot use highly sophisticated algorithms for traffic
distribution because it has to take fast decision for hundreds of requests per second. Static algorithms
are the fastest solution because they do not rely on the current state of the system at the time of
making the distribution decision. Dynamic distribution algorithms have the potential to outperform
static algorithms by using some state information to help dispatching decisions. However, they
require a mechanism that collects, transmits, and analyzes that information, thereby incurring in
overheads.
51
The research project considered three scheduling policies that the dispatcher can execute [84]:
random (RAN), round robin (RR) and weighted round robin (WRR). The project does not consider
sophisticated traffic distribution algorithms to prevent the dispatcher from becoming the primary
bottleneck of the web farm.
Based on modeling simulations, the project observed that burst of arrivals and skewed service times
alone do not motivate the use of sophisticated global scheduling algorithms. Instead, an important
feature to consider for the choice of the dispatching algorithm is the type of services provided by the
web site. If the dispatcher mechanism has a full control on client requests and clients require HTML
pages or submit light queries to a database, the system scalability is achieved even without
sophisticated scheduling algorithms. In these instances, straightforward static policies are as effective
as their more complex dynamic counterparts are. Scheduling based on dynamic state information
appears to be necessary only in the sites where the majority of client requests are of three or more
orders of magnitude higher than providing a static HTML page with some embedded objects.
The project observes that for web sites characterized with a large percentage of static information, a
static dispatching policy such as round robin provides a satisfactory performance and load balancing.
Their interpretation for this result is that a light-medium load is implicitly balanced by a fully
controlled circular assignment among the server nodes that is guaranteed by the dispatcher of the web
farm. When the workload characteristics change significantly, so that very long services dominate,
the system requires dynamic routing algorithms such as WRR to achieve a uniform distribution of the
workload and a more scalable web site. However, in high traffic web sites, dynamic policies become
a necessity.
The researchers did not prototype the architecture into a real system and run benchmarking tests on it
to validate the performance, scalability, and high availability. In addition, the project did not design or
prototype new traffic distribution algorithms for web servers; instead, it relied on existing distribution
algorithms such as the DNS routing and RAN, RR, and WRR distribution. The architecture presents
several single points of failure. In the event of the dispatcher failure, the cluster becomes unreachable.
Furthermore, if a cluster node becomes unavailable, there is no mechanism is place to notify the
dispatcher of the failure of individual nodes. Moreover, the dispatcher presents a bottleneck to the
cluster when under heavy load of traffic.
52
The main input from this project is that dynamic routing algorithms are a core technology to achieve a
uniform distribution of the workload and a reach scalable web cluster. The key is in the simplicity of
the dynamic scheduling algorithms.
Users
Users
Users HTTP Users
Users
Users
Requests
- Scheduler DNS
Internet
- Load info
- httpd
Server
Disk
Internal Network
53
Figure 19 illustrates the SWEB architecture. The DNS routes the user requests to the SWEB
processors using round robin distribution. The DNS assigns the requests without consulting the
dynamically changing system load information. Each processor in the SWEB architecture contains a
scheduler, and the SWEB processors collaborate with each other to exchange system load
information. After the DNS sends a request to a processor, the scheduler on that processor decides
whether to process this requests or assign it to another SWEB processor. The architecture uses URL
redirection to achieve re-assignment. The SWEB architecture does not allow SWEB servers to
redirect HTTP requests more than once to avoid the ping-pong effect.
SWEB Broker
- Manages servers
. - chooses sever for request
.
accept_request(r)
choose_server(s) Oracle
if (s != me) - Characterize requests
reroute_request(r)
else
handle_request(r)
. loadd
. - Manages distributed
load info
httpd loadd
Figure 20 illustrates the functional structure of the SWEB scheduler. The SWEB scheduler contains a
HTTP daemon based on the source code of the NCSA HTTP [76] for handling http requests, in
addition to the broker module that determines the best possible processor to handle a given request.
The broker consults with two other modules, the oracle module and the loadd module. The oracle
module is a miniature expert system, which uses a user-supplied table that characterizes the processor
and disk demands for a particular task. The loadd module is responsible for updating the system
processor, network and disk load information periodically (every 2 to 3 seconds), and making the
processors, which have not responded within the time limit, unavailable. When a processor leaves or
joins the resource pool, the loadd module is aware of the change as long as the processor has the
original list of processor that was setup by the administrator of the SWEB system.
The SWEB architecture investigates several concepts. It supports a limited flavor of dynamism while
monitoring the processor and disk usage on processors. The loadd module collects processor and disk
54
usage information and feed back this information to the broker to make better distribution decisions.
The drawback of this mechanism is that it does not report available memory as part of the metrics,
which is as important as the processor information; instead it reports local disk information for an
architecture that relies on a network file system for storage.
The SWEB architecture does not provide high availability features, making it vulnerable to single
points of failures. The oracle module expects as input from the administrator a list of processors in
the SWEB system and the processor and disk demands for a particular task. It is not able to collect
this information automatically. As a result, the administrator of the cluster interferes every time we
need to add or remove a processor.
The SWEB implementation modified the source code of the web server and created two additional
software modules [78]. The implementation is not flexible and does not allow the usage of those
modules outside the SWEB specific architecture.
The researchers have benchmarked the SWEB architecture built using a maximum of four processors
with an in-house benchmarking tool, not using a standardized tool such as WebBench with a
standardized workload. The results of the tests demonstrate a maximum of 76 requests per second for
1 KB/s request size, and 11 requests per seconds for 1.5 MB/s request size, which ranks low
compared to our initial benchmarking results (Section 3.7).
The project contributes to our work by providing us how-to on actively monitoring the usages of
processor, I/O channels, and the network load. This information allows us to distribute effectively
HTTP requests across cluster nodes. Furthermore, the concept of web cluster without master nodes,
and having the cluster nodes provide the services master nodes usually provide, is a very interesting
concept.
55
Figure 21 presents the LSMAC approach [80]. A client sends an HTTP packet (1) with A as the
destination IP address. The immediate router sends the packet to the dispatcher at IP address A (2).
Based on the load sharing algorithm and the session table, the dispatcher decides that this packet
should be handled by the back-end-server, Server 2, and sends the packet to Server 2 by changing the
MAC address and forwarding it (3). Server 2 accepts the packet and replies directly to the client (4).
HTTP requests
with dest IP=A
3
1 Router 2
Route AÆD
5
HTTP requests
with dest IP=A 4
3
1 Router 2
Route AÆD
Figure 22 illustrates the LSNAT approach [73][80]. The LSNAT implementation follows RFC 2391
[48]. A client (1) sends an HTTP packet with A as the destination IP address. The immediate router
sends the packet to the dispatcher (2) on A, since the dispatcher machine is assigned the IP address A.
Based on the load sharing algorithm and the session table, the dispatcher decides that this packet
should be handled by the back-end server, Server 2. It then rewrites the destination IP address as
Server 2, recalculates the IP and TCP checksums, and sends the packet to Server 2 (3). Server 2
56
accepts the packet (4) and replies to the client via the dispatcher, which the back-end servers see as a
gateway. The dispatcher rewrites the source IP address of the replying packet as A, and recalculates
the IP and TCP checksums, and send the packet to the client (5).
The dispatcher in both approaches, LSMAC and LSNAT, is not highly available and presents a SPOF
that can lead to service discontinuity. The work did not focus on providing high availability
capabilities; therefore, in the event of a node failure, the node continues to receive traffic. Moreover,
the architecture does not support scaling the number of servers. The largest setup tested was a cluster
that consists of four nodes. The authors did not demonstrate the scaling capabilities of the proposed
architecture beyond four nodes [73][80]. The performance measurements were performed using the
benchmarking tool WebStone [86]. The LSMAC implementation running on a four-node cluster
averaged 425 transactions per second per traffic node. The LSNAT implementation running on a four
nodes cluster averaged 200 transactions per second per traffic node [79].
Furthermore, the architecture does not provide adaptive optimized distribution. The dispatcher does
not take into consideration the load of the traffic nodes nor their heterogeneous nature to optimize its
traffic distribution. It assumes that all the nodes have the same hardware characteristics such as the
same processor speed and memory capacity.
57
Web Clients
HTTP Requests
a C a C a C
B B B
Web Clients
HTTP Requests
a B C
Figure 24 illustrates the concept of the HACC smart router. Instead of being responsible for the entire
working set, each node in the cluster is responsible for only a fraction of the document store. The size
of the working set of each node decreases each time we add a node to the cluster, resulting in a more
efficient use of resources per node. The smart router uses an adaptive scheme to tune the load
presented to each node in the cluster based on that node’s capacity, so that it can assign each node a
fair share of the load. The idea of HACC bears some resemblance to the affinity based scheduling
schemes for shared memory multiprocessor systems [88][89], which schedule a task on a processor
where relevant data already resides.
58
2.10.5.1 HACC Implementation
The main challenge in realizing the potential of the HACC design is building the Smart Router, and
within the Smart Router, designing the adaptive algorithms that direct requests at the cluster nodes
based on the locality properties and capacity of the nodes [81].
The smart router implementation consists of two layers: the low smart router (LSR) and the high
smart router (HSR). The LSR corresponds to the low-level kernel resident part of the system and the
HSR implements the high-level user-mode brain of the system. The authors conceived this
partitioning to create a separation of mechanism and policy, with the mechanism implemented in the
LSR and the policy implemented in the HSR.
The Low Smart Router: The LSR encapsulates the networking functionality. It is responsible for
TCP/IP connection setup and termination, for forwarding requests to cluster nodes, and forwarding
the result back to clients. The LSR listens on the web server port for a connection request. When the
LSR receives connection request, TCP passes a buffer to the LSR containing the HTTP request. The
HSR extracts and copies the URL from the request. The LSR queues all data from this incoming
request and waits for the HSR to indicate which cluster node should handle the request. When the
HSR identifies the node, the LSR establishes a connection with it and forwards the queued data over
this connection. The LSR continues to ferry data between the client and the cluster node serving the
request until either side closes the connection.
The High Smart Router: The HSR monitors the state of the document store, the nodes in the cluster,
and properties of the documents passing through the LSR. It uses this information to decide how to
distribute requests over the HACC cluster nodes. The HSR maintains a tree that models the structure
of the document store. Leaves in the tree represent documents and nodes represent directories. As the
HSR processes requests, it annotates the tree with information about the document store to the applied
in load balancing. This information could include node assignment, document sizes, request latency
for a given document, and in general, sufficient information to make an intelligent decision about
which node in the cluster should handle the next document request. When a request for a particular
file is received for the first time, the HSR adds nodes representing the file and newly reached
directories to its model of the document store, initializing the file’s node with its server assignment.
In the current prototype, incoming new documents are assigned to the least loaded server node. After
the first request for a document, subsequent requests go to the same server and though improve the
locality of references.
59
Dynamic Load Balancing: Dynamic load balancing is implemented using Windows NT’s
performance data helper (PDH) interface [90]. The PDH interface allows collecting a machine’s
performance statistics remotely. When the smart router initializes, it spawns a performance
monitoring thread that collects performance data from each cluster node at a fixed interval. The HSR
then uses the performance data for load balancing in two ways. First, it identifies a least loaded node
and assigns new requests to it. Second, when a node becomes overloaded, the HSR tries to offload a
portion of the documents for which the overloaded node is responsible to the least loaded node.
60
2.10.6 IBM Scalable and Highly Available Web Server
IBM Research is investigating the concept of a scalable and highly available web server that offers
web services via a Scalable Parallel (SP-2) system, a cluster of RS/6000 workstations. The goal is to
support a large number of concurrent users, high bandwidth, real time multimedia delivery, fine-
grained traffic distribution, and high availability. The server will provide support for large multimedia
files such as audio and video, real time access to video data with high access bandwidth, fine-grained
traffic distribution across nodes, as well as efficient back-end database access. The project is focusing
on providing efficient traffic distribution mechanisms and high availability features. The server
achieves traffic distribution by striping data objects across the back-end nodes and disks. It achieves
high availability by detecting node failures and reconfiguring the system appropriately. However,
there is no mention of the time to detect the failure and to recover.
Users Users
Users Users
Users Users
External network
connection
Load Balancing
Front-End Nodes
App: Application …
specific software App App App
Communication Switch
Back-End Nodes
…
Disks
Figure 25 illustrates the architecture of the web cluster. The architecture consists of a group of nodes
connected by a fast interconnect. Each node in the cluster has a local disk array attached to it. The
disks of a node either can maintain a local copy of the web documents or can share it among nodes.
61
The nodes of the cluster are of two types: front-end (delivery) nodes and back-end (storage) nodes.
The round robin DNS is used to distribute incoming requests from the external network to the front-
end nodes, which also run httpd daemons. The logical front-end node then forwards the required
command to the back-end nodes that have the data (document), using a shared file system. Next, the
httpd daemons send the results to the front-end nodes through the switch and then the results are
transmitted to the user. The front-end nodes run the web daemons and connect to the external
network. To balance the load among them, the client spread the load across nodes using RR DNS
[51]. All the front nodes are assigned a single logical name and the RR DNS maps the name to
multiple IP addresses.
3
Web
Server
Nodes
Switch
TCP
Router
Nodes
Internet 1
Figure 26 illustrates another approach for achieving traffic distribution. One or more nodes of the
cluster serve as TCP routers, forwarding client requests to the different frond-end nodes in the cluster
in a round robin order. The name and IP address of the router is public, while the addresses of the
other nodes in the cluster are private. If there is more than one router node, a single name is used and
the round robin maps the name to the multiple routes. The flow of the web server router (Figure 26) is
as follows. When a client sends requests to the router node (1), the router node forwards (2) all
packets belonging to a particular TCP connection to one of the server front-end nodes. The router can
use different algorithms to select which node to route to, or use round robin scheme. The server nodes
directly reply to the client (3) without using the router. However, the server nodes change the source
62
address on the packets sent back to the client to be that of the router node. The back-end nodes host
the shared file system used by the front-ends to access the data.
There are four main drawbacks preventing the architecture from achieving a scalable and highly
available web cluster: limited traffic distribution performance, limited scalability, lack of high
availability capabilities, the presence of several SPOF, and the lack of a dynamic feedback
mechanism.
The architecture relies on round robin DNS to distribute traffic among server nodes. The scheme is
static, does not adjust based on the load of the cluster nodes, and does not accommodate the
heterogeneous nature of the cluster nodes. The authors proposed an improved traffic distribution
mechanism [83] that involves changing packet headers but still relies on round robin DNS to
distribute traffic among router server nodes. The concept was prototyped with four front-end nodes
and four back-end node. The project did not demonstrate if the architecture is capable of scaling
beyond four traffic nodes and if failures at node level are observed and accommodated for
dynamically. The architecture does not provide features that allow service continuity. The switch as
shown in Figure 25 and Figure 26 is a SPOF. The network file system where data resides is also
vulnerable to failures and presents another SPOF. Furthermore, the architecture does not support a
dynamic feedback loop that allows the router to forward traffic depending on the capabilities of each
traffic node.
2.10.7 Discussion
The surveyed projects share some common results and conclusions. Current server architectures do
not provide the needed scalability to handle large traffic volumes and a large number of web users.
There seem to be a consensus among surveyed literature for a need to design new server architecture
that is able to meet our visions for next generation Internet servers. A distributed architecture that
consists of independent servers sharing the load is more appropriate than single server architectures
for implementing a scalable web server.
Surveyed projects such as [63], [64], [65], [68], [72], [91], [92], [93], [94], and [95], discuss that
using clustering technologies help increase the performance and scalability of the web server.
Clustering is the dominant technology that will help us achieve better scalability and higher
performance. The focus is not on clustering as a technology, rather, the focus is on using this
technology as a mean to achieve scalability, high capacity, and high availability. Several of the
63
surveyed projects such as [68], [79], [81], and [82], focused on providing efficient traffic distribution
mechanisms and providing high availability features that enable continuous service availability.
However, looking at current benchmarking results, adding high availability, and fault tolerance
features is negatively affecting the performance and scalability of the cluster. Traffic distribution is an
essential aspect towards achieve highly scalable platform. Hardware-base traffic distribution solutions
are not scalable; they constitute both a performance bottleneck and a SPOF.
There is a clear need to provide a software single Virtual IP layer that hides the cluster nodes and
make them transparent to end users. There is a need to have an unsophisticated design and
consequently a simple implementation. The uncomplicated design allows a smooth integration of
many components into a well-defined architecture. The benefits are a lightweight design and a faster
and more robust system. One interesting observation is that the surveyed works, for the exception of
the HACC project [81], use the Linux operating system to prototype and implement Internet and web
servers.
A highly available and scalable web cluster requires is a combination of a smart and efficient traffic
distribution mechanism that distributes incoming traffic from the cluster interface to the least busy
nodes in the cluster based on a dynamic feedback loop. The distribution mechanism and the cluster
interface should provide neither a bottleneck nor a SPOF. To scale such cluster, we would like to
have the ability to add nodes into the cluster without disruption of the provided services and the
capability to increase the number of nodes to meet traffic resulting in close to linear scalability.
64
Chapter 3
Preparatory Work
This chapter describes the preparatory work conducted as part of the early investigations. It describes
the prototyped web server cluster, presents the benchmarking environment, and reports the
benchmarking results.
65
eight processors has three SCSI disks with a capacity of 54 GB in total; the other eight processors are
diskless. All processors have access to a common storage volume provided by the master nodes.
Although we have used the NFS, we also experienced with the Parallel Virtual File System (PVFS)
provide a shared disk space among all nodes. The PVFS supports high performance I/O over the web
[91][92][105][106]. To provide high availability, we implemented redundant Ethernet connections,
redundant network file systems, in addition to software RAID [104]. The LVS [23] provides a single
IP interface for the cluster and provides HTTP traffic distribution mechanism among the servers in
the cluster. As for the web server software, we run Apache release 2.08 [24] and Tomcat release 3.1
[25] on traffic nodes. Other studies and benchamrking we performed in this area include [99], [100],
[103], [107], and [108].
A single entry
to the cluster
LVS Virtual IP Interface
Master nodes providing
Master Node 1 Master Node 2
Storage and cluster services HA
NFS
LAN 1
Redundant
LANS LAN 2
As for the local network, all cluster nodes are interconnected using redundant dedicated links. Each
node connects to two networks through two Ethernet ports over two redundant network switches.
External network access is restricted to master nodes unless traffic nodes require direct access, which
is configurable for some scenarios such as direct routing. The dotted connection indicates that the
processor connects to LAN 1. The solid connection indicates that the processor connects to LAN 2.
The components and parameters of the prototyped system include two master nodes and 12 traffic
nodes, a traffic distribution mechanism, a storage sub-system, and local and external networks. The
master nodes provide a single entry point to the system through the virtual IP address receiving
incoming web traffic and distributing it to the traffic nodes using the traffic distribution algorithm.
66
The master nodes provide cluster-wide services such as I/O services through NFS, DHCP, and NTP
services. They respond to incoming web traffic and distribute it among the traffic nodes. The traffic
nodes run the Apache web server and their primary responsibility is to respond to web requests. They
rely on master nodes for cluster services. We used the LVS in different configurations to distribute
incoming traffic to the traffic nodes. Section 3.5 discusses the network address translation and direct
routing methods, and demonstrates their capabilities.
The workload tree provided by WebBench contains the test files the WebBench client access when
we execute a test suite. WebBench workload tree is the result of studying real-world sites such as
Microsoft, USA Today, and the Internet Movie Database. The tree uses multiple directories and
different directory depths. It contains over 6,200 static pages and executable test files. The WebBench
provides static (STATIC.TST) and dynamic test suites (WBSSL.TST). The static suites use HTML
and GIF files. The dynamic suites use applications that run on the server.
WebBench keeps at run-time all the transaction information and uses this information to compute the
final metrics presented when the tests are completed. The standard test suites of WebBench begin
with one client and add clients incrementally until they reach a maximum of 60 clients (per single
client machine). WebBench provides numerous standardized test suites. For our testing purposes, we
executed a mix of the STATIC.TST (90%) and WBSSL.TST (10%) tests. Each test takes on average
two hours and a half to run this combination of test suites.
68
3.4 Web Server Performance
In considering the performance of a web server, we should pay special regard to its software,
operating system, and hardware environment, because each of these factors can dramatically
influence the results. In a distributed web server, this environment is complicated further by the
presence of multiple components, which require connection handoffs, process activations, and request
dispatching. A complete performance evaluation of all layers and components of a distributed web
server system is very complex impossible. Hence, a benchmarking study needs to define clearly its
goals and scope. In our case, the goal is to evaluate the end-to-end performance of a web cluster. Our
main interests do not go into the hardware and operating system that in many cases are given.
Web server performance refers to the efficiency of a server when responding to user requests
according to defined benchmarks. Many factors affect a server performance such as application
design and construction, database connectivity, network capacity and bandwidth, and hardware server
resources. In addition, the number of concurrent connections to the web server has a direct impact on
its performance. Therefore, the performance objectives include two dimensions: the speed of a single
user's transaction and the amount of performance degradation related to the increasing number of
concurrent connections.
Metric Name Description
Throughput The rate at which data is sent through the network, expressed in Kbytes per second (KB/s)
Connection rate The number of connections per second
Request rate The number of client requests per second
Reply rate The number of server responses per second
Error rate The percentage of errors of a given type
DNS lookup time The time to translate the hostname into the IP address
Connect time The time interval between sending the initial SYN and the last byte of a client request and
the receipt of the first byte of the corresponding response
Latency time In a network, latency, a synonym for delay, is an expression of how much time it takes for a
packet of data to get from one designated point to another
Transfer time The time interval between the receipt of the first response byte and the last response byte
Web object response time The sum of latency time and transfer time
Web page response time The sum of web object response time pertaining to a single web page, plus the connect time
Session time The sum of all web page response times and user think time in a user session
69
Table 5 presents the common metrics for web system performance. In reporting the results of the
benchmarking tests, we report the connection rate, number of successful transactions per second, and
the throughput as KB/s.
70
virtual server service (according to the virtual server rule table), then the scheduling algorithm, round
robin by default, chooses a real server from the cluster to serve the request, and adds the connection
into the hash table which records all established connections. The load balancer server rewrites the
destination address and the port of the packet to match those of the chosen real server, and forwards
the packet to the real server. The real server processes the request (3) and returns the reply to the load
balancer. When an incoming packet belongs to this connection and the established connection exists
in the hash table, the load balancer rewrites and forwards the packet to the chosen server. When the
reply packets come back from the real server to the load balancer, the load balancer rewrites the
source address and port of the packets (4) to those of the virtual service, and submits the response
back to the client (5). The LVS removes the connection record from the hash table when the
connection terminates or timeouts.
Processing
3
the request
Users Real
Server 1
Real
Internet/Intranet Server 2
Scheduling and
2
rewriting packets
1 Requests
5 Replies Switch/Hub
Load Balancer
Linux Box Real
Server N
Rewriting
4
replies
71
accesses a virtual service provided by the server cluster (1), the packet destined for the virtual IP
address arrives to the load balancer. The load balancer examines (2) the packet's destination address
and port. If it matches a virtual service, the scheduling algorithm chooses a real server (3) from the
cluster to serve the request and adds the connection into the hash table that records connections. Next,
the load balancer forwards the request to the chosen real server. If new incoming packets belong to
this connection and the chosen server is available in the hash table, the load balancer directly routes
the packets to the real server. When the real server receives the forwarded packet, the server finds that
the packet is for the address on its alias interface or for a local socket, so it processes the request (4)
and returns the result directly to the user (5). The LVS removes the connection record from the hash
table when the connection terminates or timeouts.
Processing
4
the request
Users
1 Requests
Real
Examine packet Server 1
Internet/Intranet 2
destination
Virtual IP
Address Internal Real
Network Server 2
.
Linux .
Director
.
Forward request
3 Real
to real server
Server N
72
The main advantage of using tunneling is that the real servers (i.e. traffic nodes) can be on a different
network. We did not experiment with the IP Tunneling method because of the unstable status of the
implementation, and because it does not provide additional capabilities over the DR method.
However, we present it for completion purpose.
5000
4500
Requests per Second
4000
3500
3000
2500
2000
1500
1000
500
0
8_ ents
_c t s
cl t
_c ts
_c ts
_c ts
_c ts
_c ts
_c ts
s
_c ts
_c ts
_c ts
_c ts
_c ts
_c ts
4_ lien
nt
12 lien
16 lien
20 lien
24 lien
28 lien
32 lien
36 lien
40 lien
44 lien
48 lien
52 lien
56 lien
60 lien
lie
c
i
1_
73
In both tests, the bottleneck occurs at the load balancer node that was unable to accept more traffic
and distribute it to the traffic servers. Instead, the LVS director was rejecting incoming connections
resulting in unsuccessful requests. This test demonstrates that the DR approach is more efficient than
the NAT approach and allows better performance and scalability. In addition, it demonstrates the
bottleneck at the director level of the LVS.
74
Throughput (Kbytes/Sec) Requests Per Second
0
1000
2000
3000
4000
5000
6000
7000
0
200
400
600
800
1000
1200
1_ 1_
cli cli
e nt e nt
4_ 4_
cli cli
e nt e nt
8_ s 8_ s
cli cli
e nt e nt
12 s 12 s
_c _c
lie lie
nt nt
16 s 16 s
_c _c
lie lie
nt nt
20 s 20 s
_c _c
lie lie
nt nt
24 s 24 s
_c _c
lie lie
nt nt
28 s 28 s
_c _c
lie lie
nt nt
32 s 32 s
_c _c
lie lie
nt nt
36 s 36 s
Number of Clients
Number of Clients
_c _c
lie lie
Requests Per Second
nt nt
Throughput (Kbytes/Sec)
40 s 40 s
_c _c
lie lie
nt nt
44 s 44 s
_c _c
lie lie
nt nt
48 s 48 s
_c _c
lie lie
nt nt
52 s 52 s
_c _c
lie lie
nt nt
s s
Figure 33: Apache reaching a peak of 5,903 KB/s before the Ethernet driver crashes
Figure 32: Benchmarking results of the Apache web server running on a single processor
75
Apache served 1,053 requests per second before it suddenly stops servicing incoming requests; in
fact, as far as the WebBench tool, Apache crashed (Figure 32). We thought that the Apache server has
crashed under heavy load. However, that was not the case. The Apache server process was still
running when we logged locally into the machine. It turned out that the Ethernet device driver crashed
and caused the processor to disconnect from the network and became unreachable. Figure 33
illustrates the throughput achieved on one processor (5,903 KB/s), before the processor disconnects
from the network dues to the device driver crash. We investigated the device driver problem, fixed it,
and made the updated source code publicly available. We did not face the driver crash problem in
further testing. Figure 34 presents the benchmarking result of Apache on a single processor after
fixing the device driver problem. Apache served an average of 1,043 requests per second.
1200
1000
Requests Per Second
800
600
400
200
0
nt
s
nt
nt
nt
nt
nt
nt
nt
nt
nt
nt
nt
nt
nt
ie
ie
ie
lie
lie
lie
lie
lie
lie
lie
lie
lie
lie
lie
cl
cl
cl
_c
_c
_c
_c
_c
_c
_c
_c
_c
_c
_c
1_
4_
8_
12
16
20
24
28
32
36
40
44
48
52
Number of Clients
Figure 34: Benchmarking results of Apache on one processor – post Ethernet driver update
Next, we setup the cluster in several configurations with two, four, six, eight, 10 and 12 processors
and we performed the benchmarking tests. In these benchmarks, the LVS was forwarding the HTTP
traffic to the traffic nodes following the DR distribution method.
Figure 35 presents the results of the benchmarking test we performed on a cluster with two processors
running Apache. The average number of requests per second per processor is 945.
76
Requests Per Second Requests Per Second
1_ 1_
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0
200
400
600
800
1000
1200
1400
1600
1800
2000
cli cli
e nt e nt
4_ 4_
cli cli
e nt e nt
8_ s 8_ s
cli cli
e nt e nt
12 s 12 s
_c _c
lie lie
nt nt
16 s 16 s
_c _c
lie lie
nt nt
20 s 20 s
_c _c
lie lie
nt nt
24 s 24 s
_c _c
lie lie
nt nt
28 s 28 s
_c _c
lie lie
nt nt
32 s 32
_c _c s
lie lie
nt
36 s 36 n ts
_c _c
lie lie
nt
40 s 40 n ts
_c _c
Number of Clients
Number of Clients
lie lie
nt
44 s 44 n ts
_c _c
lie lie
nt
48 s 48 n ts
_c _c
lie lie
nt
52 s 52 n ts
_c _c
lie lie
nt
56 s 56 n ts
_c _c
lie lie
Figure 35: Results of a two-processor cluster (requests per second)
77
Figure 36 presents the results of the benchmarking test we performed on a cluster with four
processors running Apache. The average number of requests per second per processor is 1,003.
8000
7000
6000
Requests Per Second
5000
4000
3000
2000
1000
0
cli t
8_ t s
_c s
_c s
_c s
_c s
_c s
_c s
_c s
_c s
_c s
_c s
_c s
_c s
_c s
_c s
_c s
s
n
16 en t
20 en t
24 en t
28 en t
32 en t
36 en t
40 en t
44 en t
48 en t
52 en t
56 en t
60 en t
64 en t
68 en t
nt
1 2 en t
e
en
cli
lie
cli
li
li
li
li
li
li
li
li
li
li
li
li
li
li
1_
4_
Number of Clients
Figure 37 presents the results of the benchmarking test we performed on a cluster with eight
processors running Apache. The average number of requests per second per processor is 892.
Table 6 presents the results of Apache benchmarking for all the cluster configurations including the
single standalone node.
Processors in the cluster Maximum requests per second Transaction per second per processor
1 1053 1053
2 1890 945
4 4012 1003
6 5847 974
8 7140 892
10 7640 764
12 8230 685
160
140
Requests Per Second
120
100
80
60
40
20
0
_c s
_c s
_c s
_c s
_c s
_c s
_c s
_c s
_c s
_c s
_c s
_c s
_c s
_c s
_c s
s
8_ nts
4_ ent
_c s
16 en t
20 en t
24 en t
28 en t
32 en t
36 en t
40 en t
44 en t
48 en t
52 en t
56 en t
60 en t
64 en t
72 en t
80 en t
nt
12 ent
li e
ie
i
cl
li
li
li
li
li
li
li
li
li
li
li
li
li
li
li
i
cl
cl
1_
Figure 38: Results of Tomcat running on two processors (requests per second)
Figure 39 presents the results of the benchmarking test we performed of a system with four
processors running Tomcat. The average number of requests per second per processor is 75. Figure 40
presents the results of the benchmarking test we performed of a system with eight processors running
Tomcat. The average number of requests per second per processor is 71.
79
80
Requests per Second Requests Per Second
1_
0
50
100
150
200
250
300
350
400
450
500
c
0
50
100
150
200
250
300
350
4_ lien
cl t
1_client ie
8_ nts
4_clients cl
12 ien
_c ts
8_clients
16 lien
12_clients _c ts
20 lien
16_clients _c ts
l
24 ien
20_clients _c ts
28 lien
24_clients _c ts
32 lien
28_clients _c ts
l
32_clients 36 ien
_c ts
36_clients 40 lien
_c ts
40_clients 44 lien
_c ts
44_clients l
48 ien
_c ts
48_clients 52 lien
_c ts
52_clients
56 lien
_c ts
56_clients
60 ien
60_clients _c ts
64 lien
64_clients _c ts
72 lien
72_clients _c ts
l
80 ien
80_clients _c ts
88_clients 88 lien
_c ts
96_clients 96 lien
_c ts
li e
104_clients
Figure 39: Results of a four-processor cluster running Tomcat (requests per second)
nt
s
Figure 40: Results of an eight-processor cluster running Tomcat (requests per second)
112_clients
Table 7 presents the results of testing the prototyped cluster running the Tomcat application server.
For each cluster configuration, we present number of processors in the cluster, the maximum
performance achieved by the cluster in terms or requests per second, and the average number of
requests per second per cluster processor.
Number of processors in Cluster maximum requests per Transaction per second per
the cluster second processor
1 81 81
2 152 76
4 300 75
6 438 73
8 568 71
10 700 70
12 804 67
81
1200
1053
Number of transactions per second
1003 974
1000 945
892
764
800
per processor
685
600
400
200
0
11 22 43 64 85 6
10 7
12
Number of processors in the cluster
Figure 41: Scalability chart for clusters consisting of up to 12 nodes running Apache
90
81
76
Number of transactions per second
80 75 73 71 70
70 67
60
per processor
50
40
30
20
10
0
11 22 43 64 85 10
6 12
7
Number of processors in the cluster
Figure 42: Scalability chart for clusters consisting of up to 12 nodes running Tomcat
82
3.10 Discussion
Web servers have a limited capacity in serving incoming requests. In the case of Apache, the capacity
limit is around 1,000 requests per second when running on a single processor. Beyond this threshold,
the server starts rejecting incoming requests.
We have demonstrated non-linear scalability with clusters up to 12 nodes running Apache and
Tomcat web servers. In the case of Apache for instance, when we scale the cluster from two
processors to 12 processors, the number of successful requests per second per processor drops from
1,053 to 685, down by -35%. These results present major performance degradation. Theoretically, as
we add more processors into the cluster, we would like to achieve linear scalability and maintain the
baseline performance of 1,000 requests per second per processor.
We experimented with the NAT and DR traffic distribution approaches. The NAT approach, although
widely used, has limited performance and scalability compared to the DR approach as demonstrated
in Section 3.5.4.
Our results demonstrate that the bottleneck occurs at the master node because of bottlenecks at the
master node level and because of the inefficiencies in the traffic distribution mechanism. We plan to
propose an enhanced method based on the DR approach for our highly available and scalable
architecture. The planned improvements include a running daemon on all traffic nodes that reports the
node load to the distribution mechanism to allow it to perform a more efficient and dynamic
distribution.
We experienced some problems with the Ethernet device drivers crashing with high traffic load
(throughput of 5,903 KB/s – Figure 33). We improved the device driver code and as a result, it is now
able to sustain a higher throughput. We contributed the improvements back to the Ethernet card
provider (ZNYX Networks) and to the open source community.
As far as the benchmarking tests, it would have helped to include other metrics such as processor
utilization as well as file system and disk performance metrics to provide more insight on bottlenecks.
These metrics are not available in WebBench, and therefore the only way to have those metrics is to
either implement a separate tool ourselves or use an existing tool.
We acknowledge that the performance of the network file system has certain effects on the total
performance of the system since the network file system hosts the web documents shared between all
traffic nodes. However, the performance of the file system is out of our scope.
83
3.11 Contributions of the Preparatory Work
We examined current ways of solving scalability challenges and demonstrated scalability problems in
a real system through building a web cluster and benchmarking it for performance and scalability.
Many factors differentiate our early experimental work from others. We did not rely on simulation
models to define our system and benchmark it. Instead, we followed a systematical approach building
the web cluster using existing system components and following best practices. In many instances, we
have contributed system software such as Ethernet and NFS redundancy and introduced
enhancements to existing implementations. Similarly, we built our benchmarking environment from
scratch and we did not rely on simulation models to get performance and scalability results. This
approach gave us much flexibility and allowed us to test many different configurations in a real world
setting. One unique aspect of our experiments, which we did not see in related work (Sections 2.9 and
2.10), is the scale of the benchmarking environment and the tests we conducted. Our early prototyped
cluster consisted of 12 processors and our benchmarking environment consisted of 17 machines.
Surveyed projects were limited in their resources and were not able to demonstrate the negative
scalability effects we experienced when reaching 12 processors, simply because they only tested for
up to eight processors. In addition, other works relied on simulation to get a feeling of the how their
architecture would perform. In contrast, we performed our benchmarking using an industry-
standardized tool and workload. The benchmarking tool, WebBench, uses standardized work loads
and is capable of generating more traffic and compiling more test results and metrics than the other
available tools used in the surveyed work such as SURGE [109], S-Clients [110], WebStone [111],
SPECweb99 [112], and TCP-W [113]. The paper comparing these benchmarking tools is available
from [114]. Furthermore, the entire server environment uses open source technologies allowing us
access to the source code and granting the freedom to modify it, and introduce changes to suite our
needs.
84
Chapter 4
The Architecture of the Highly Available and Scalable Web Server
Cluster
This chapter describes the architecture of the highly available and scalable (HAS) web server cluster.
The chapter reviews of the initial requirements discussed in previous chapters, and then presents the
HAS cluster architecture, its tiers and characteristics. It discusses the architecture components,
presents they interact with each other, illustrates the supported redundancy models, and discusses the
various types of cluster nodes and their characteristics. Furthermore, the chapter includes examples of
sample deployments of the HAS architecture as well as case studies that demonstrate how the
architecture scales to support increased traffic. The chapter also addresses the subject of fault
tolerance and the high availability features. It discusses the traffic management scheme responsible
for dynamic traffic distribution and discuss the cluster virtual IP interface which presents the HAS
architecture as a single entity to the outside world. The chapter concludes with the scenario view of
the HAS architecture and examines several use cases.
4.1.1 Scalability
The expansion of the user base requires scaling the capacity of the infrastructure to be in line with
demand. However, with a single server, this means upgrading the server vertically by adding more
memory, or replacing the processor(s) with a faster one. Each upgrade brings the service down for the
duration of the upgrade, which is not desirable. In some cases, the upgrade involves a full deployment
on entirely new hardware resulting in extended downtime. Clustering an application allows us to scale
it horizontally by adding more servers into the service cluster without necessitating service downtime.
However, even when utilizing clustering techniques to scale, the performance gain is not linear.
85
4.1.2 Minimal Response Time
Achieving a minimal response time is a crucial factor for the success of distributed web servers.
Many studies argue that 0.1 second is about the limit for having the user feel that the system is
reacting instantaneously [13][14][15]. Web users expect the system to process their requests and to
provide responses quickly and with high data access rates. Therefore, the server needs to minimize
response times to live up to the expectations of the users. The response time consists of the connect
time, process time and response transit time. The goal is to minimize all of these three parameters
resulting in faster total response time.
87
monitoring the availability of the web server application running on the traffic nodes and ensuring
that the application is up and running. Otherwise, the master nodes will forward traffic to a node that
has an unresponsive web server application. Section 4.20 discusses this functionality.
88
without performance degradation and while maintaining the level of performance for up to 16
processors (Chapter 5).
Figure 43 illustrates the conceptual model of the architecture showing the three tiers of the
architecture and the software components inside each tier. In addition, it shows the supported
redundancy models per tier. For instance, the high availability (HA) tier supports the 1+1 redundancy
model (active/active and active/standby) and can be expanded to support the N+M redundancy model,
where N nodes are active and M nodes are standby. Similarly, the scalability and service availability
(SSA) tier supports the N-way redundancy model where all traffic nodes are active and servicing
requests. This tier can be expanded as well to support the N+M redundancy model. Sections 4.9, 4.10,
and 4.11 discuss the supported redundancy models.
The HAS architecture is composed of three logical tiers: the high availability (HA) tier, the scalability
and service availability (SSA) tier, and the storage tier. This section presents the architecture tiers at a
high level.
The high availability tier: This tier consists of front-end systems called master nodes. Master nodes
provide an entry-point to the cluster acting as dispatchers, and provide cluster services for all HAS
cluster nodes. They forward incoming web traffic to the traffic nodes in the SSA tier according to the
scheduling algorithm. Section 4.5.1 covers the characteristics of this tier. Section 4.17.1 presents the
characteristics of the master nodes. Section 4.9 discusses the supported redundancy models of the HA
tier.
The scalability and service availability tier: This tier consists of traffic nodes that run application
servers. In the event that all servers are overloaded, the cluster administrator can add more nodes to
this tier to handle the increased workload. As the number of nodes increases in this tier, the cluster
throughput increases and the cluster is able to respond to more traffic. Section 4.6.2 describes the
characteristics of this tier. Section 4.17.2 presents the characteristics of the traffic nodes. Section 4.10
discusses the supported redundancy models of the SSA tier.
The storage tier: This tier consists of nodes that provide storage services for all cluster nodes so that
web servers share the same set of content. Section 4.6.3 describes the characteristics of this tier and
Section 4.17.3 describes the characteristics of the storage nodes. Section 4.11 presents the supported
redundancy models of the storage tier. The HAS cluster prototype did not utilize specialized storage
nodes. Instead, it utilized a contributed extension to the NFS to support HA storage.
89
90
Scalability and Service
High Availability (HA) Tier Availability (SSA) Tier Storage Tier
(Optional)
Legend
CCP1 Cluster Communication Path 1 (LAN 1) NTP Network Time Protocol
CCP2 Cluster Communication Path 1 (LAN 2) TM Traffic Manager
VIP Cluster Virtual IP Layer NFS Network File Server
DHCP Dynamic Host Configuration Protocol TCD Traffic Client Daemon
RAD IPv6 Router Advertisement Daemon HBD Heartbeat mechanism
EthD Ethernet Redundancy Daemon CCM Cluster Configuration Manager
RCM Redundancy Configuration Manager LdirectorD Linux Director Daemon
4.4 HAS Architecture Components
Each of the HAS architecture tiers consists of several nodes, and each node runs specific software
component. A software component (or system software) is a stand-alone set of code that provides
service either to users or to other system software. A component can be internal to the cluster and
represents a set of resources contained on the cluster physical nodes; a component can also be
external to the cluster and represents a set of resources that are external to the cluster physical nodes.
Components can be either software or hardware components. Core components are essential to the
operation of the cluster. On the other hand, optional components are used depending on the usage and
deployment model of the HAS cluster.
The HAS architecture is flexible and allows administrators of the cluster to add their own software
components. The following sub-sections present the components of the HAS architecture, categorize
the components as internal or external, discuss their capabilities, functions, input, output, interfaces,
and describe how they interact with each other.
92
as how many nodes exist or where the applications run. The CVIP allows a virtually infinite number
of clients to reach a virtually infinite number of servers presented as a single virtual IP address,
without impact on client or server applications. The CVIP operates at the IP level, enabling
applications that run on top of IP to take advantage of the transparency it provides. The CVIP
supports IPv6 and is capable of handling incoming IPv6 traffic. Section 4.21 discusses the cluster
virtual IP interface.
The traffic manager daemon (TM) is a core system component that runs on the cluster master nodes.
The traffic manager receives the load index of the traffic nodes from the traffic client daemons,
maintains the list of available traffic nodes and their load index, and executes the distribution of
traffic to the traffic nodes based on the defined distribution policy in its configuration file. The current
traffic manager implementation supports round robin and the HAS distribution. However, the traffic
manager can support more policies. The traffic manager supports IPv6. Section 4.23.3 discusses the
traffic manager.
The cluster configuration manager (CCM) is a system software that manages all the configuration
files that control the operation of the HAS architecture software components. It provides a centralized
single access point for editing and managing all the configuration files. For the purpose of this work,
we did not implement the cluster configuration manager. However, it is a high priority future work.
At the time of publication, with the HAS architecture prototype, we maintain the configuration files if
the various software components on the network file system.
The redundancy configuration manager (RCM) is responsible for switching the redundancy
configuration of each cluster tier from one redundancy configuration to another, such as from the 1+1
active/standby to the 1+1 active/active. It is also responsible for switching service between
components when the cluster tiers follow the N+M redundancy model. Therefore, it should be aware
active nodes in the cluster and their corresponding standby nodes. For the purpose of this work, we
did not implement the redundancy configuration manager. Section 6.2.4 discusses the RMC as a
future work item.
The IPv6 router advertisement daemon is an optional system software component that is used only
when the HAS architecture need to support IPv6. It offers automatic IPv6 configuration for network
interfaces for all cluster server nodes. It ensures that all the HAS cluster nodes can communicate with
each other and with network elements outside the HAS architecture over IPv6.
93
The cluster administrator has the option of using the DHCP daemon (an optional server service) to
configure IPv4 addresses to the cluster server nodes.
The NTP service is a required system service used to synchronize the time on all cluster server nodes.
It is essential to the operation of other software components that rely on time stamps to verify if a
node is in service or not. Alternatively, we can use a time synchronization service provided by an
external server located on the Internet. However, this poses security risks and it is not recommend it.
As for storage, we provide an enhanced implementation of the Linux kernel NFS server
implementation that supports NFS redundancy and eliminates the NFS server as a SPOF. Section 4.16
discusses the storage models and the various available possibilities.
The TFTP service daemon is an optional software component used in collaboration with the DHCP
service to provide the functionalities of an image server. The image server provides an initial kernel
and ramdisk image for diskless server nodes within the HAS system. The TFTP daemon supports
IPv6 and is capable of receiving requests to download kernel and ramdisk images over IPv6.
The heartbeat service (HBD) runs on master nodes and sends heartbeat packets across the network to
the other instances of Heartbeat (running on other master nodes) as a keep-alive type message. When
the standby master node no longer receives heartbeat packets, it assumes that the active master node
is dead, and then the standby node becomes primary. The heartbeat mechanism is a contribution from
the Linux-HA project [20]. We have contributed enhancements to the heartbeat service to
accommodate for the HAS architecture requirements. Section 4.19 discusses the heartbeat service and
its integration with the HAS architecture.
The Linux director daemon (LDirectord) is responsible for monitoring the availability of the web
server application running on the traffic nodes by connecting to them, making an HTTP request, and
checking the result. If the LDirectord module discovers that the web server application is not
available on a traffic node, it communicates with the traffic manager to ensure that the traffic manager
does not forward traffic to that specific traffic node. Section 4.20 presents the functionalities of the
LDirectord.
94
4.5.1 The High Availability Tier
The HA tier consists of master nodes that act as a dispatcher for the SSA tier. The role of the master
node is similar to a connection manager or a dispatcher. The HA tier does not tolerate service
downtime. If the master nodes are not available, the traffic nodes in the SSA tier become unreachable
and as a result, the HAS cluster cannot accept incoming traffic. The primary functions of the nodes in
this tier are to handle incoming traffic and distribute it to traffic nodes located in the SSA tier, and to
provide cluster infrastructure services to all cluster nodes.
The HA tier consists of two nodes configured following the 1+1 active/standby redundancy model.
The architecture supports the extension redundancy model of this tier to the 1+1 active/active
redundancy model. With the 1+1 active/active redundancy model, master nodes share servicing
incoming traffic to avoid bottlenecks at the HA tier level. Another possible extension to the
redundancy model is the support of the N-way and the N+M redundancy models, which allows the
HA tier to scale the number of master nodes one at a time. However, it requires a complex
implementation and it is not yet supported it. Section 4.8 describes the supported redundancy models.
Furthermore, the master nodes provide cluster wide services to nodes located in the SSA and the
storage tiers. The HA tier controls the activity in the SSA tier, since it forwards incoming requests to
the traffic nodes. Therefore, the HA tier needs to determine the status of traffic nodes and be able to
reliably communicate with each traffic node. The HA tier uses traffic managers to receive load
information from traffic nodes. Section 4.6.1 presents the characteristics of the HA tier. Section 4.8
discusses the redundancy models supported by this tier.
95
way model: all nodes are active and there are no standby nodes. As such, redundancy is at node level.
Section 4.6.2 presents the characteristics of the SSA tier. Section 4.10 discusses the redundancy
models supported by this tier.
98
updated kernel and the new version of the ramdisk (if the node is diskless) or a new image disk (if
the node has disk) from the image server. Section 4.27.5 presents this upgrade scenario with the
sequence diagram.
- Hosting application servers: Traffic nodes run application servers that can be stateful or stateless.
If an application requires state information, then the application saves the state information on the
shared storage and makes it available to all cluster nodes.
99
⎛ MTBF ⎞
A = ⎜⎜ ⎟⎟ *100 ,
⎝ MTBF +MTTR ⎠
where A is the percentage of availability, MTBF is the mean time between failures and MTTR is the
mean time to repair or resolve a particular problem. According to the formula, we calculate
availability A as the percentage of uptime for a given period, taking into account the time it requires
for the system to recover from unplanned failures and planned upgrades. As MTTR approaches zero,
the availability percentage A increases towards 100 percent. As the MTBF value increases, MTTR
has less impact on A. Following this formula, there are two possible ways to increase availability:
increasing MTBF and decreasing MTTR. Increasing MTBF involves improving the quality or
robustness of the software and using redundancy to remove single points of failures. As for
decreasing MTTR, our focus in the implementation of the system software is to streamline and
accelerate fail-over, respond quickly to fault conditions, and make faults more granular in time and
scope to allow us to have many short faults than a smaller number of long ones, and to limit scope of
faults to smaller components.
To increase MTTR among the HAS architecture components, we need to avoid a SPOF. The
following subsections discuss eliminating SPOF at master node levels, traffic nodes, application
servers, networks and network interfaces, and storage nodes.
The HAS architecture supports fault tolerance through features such as the hot-standby data
replication to enable node failure recovery, storage mirroring to enable disk fault recovery, and LAN
redundancy to enable network failure recovery. The topology of the architecture enables failure
tolerance because of the various built-in redundancies within all layers of the HAS architecture.
Figure 44 illustrates the supported redundancy at the different layers of the HAS architecture. The
cluster virtual IP interface (1) provides a transparent layer that hides the internal of the cluster. We
can add or remove master nodes from the cluster without interruptions to the services (2). Each
cluster node has two connections to the network (3) ensuring network connectivity. Many factors
contribute towards achieving network and connection availability such as the availability of
redundant routers and switches, redundant network connections and redundant Ethernet cards. We
contributed an Ethernet redundancy mechanism to ensure high availably for network connections. As
for traffic nodes (4), redundancy is at the node level, allowing us to add and remove traffic nodes
transparently and without service interruption. We can guarantee service availability by providing
multiple instances of the application running on multiple redundant traffic nodes. The HAS
100
architecture supports storage redundancy (5) through a customized HA implementation of the NFS
server; alternatively, we can also use redundant specialized storage nodes.
Users
Users
1
1 Cluster Interface to the
Cluster Virtual Interface
Outside World
5 Storage Redundancy
(includes NFS redundancy Storage Node A 5 Storage Node B
and RAID 5)
The following sub-sections discuss eliminating SPOF at each of the HAS architecture layers.
102
4.8.1 The 1+1 Redundancy Model
There are two types of the 1+1 redundancy model (also called two-node redundancy model): the
active/standby, which is also called the asymmetric model, and the active/active or the symmetric
redundancy model [115]. With the 1+1 active/standby redundancy model, one cluster node is active
performing critical work, while the other node is a dedicated standby, ready to take over should the
active master node fails. In the 1+1 active/active redundancy model, both nodes are active and doing
critical work. In the event that either node should fail, the survivor node steps in to service the load of
the failed node until the first node is back to service.
Shared
Network Physically connected
Storage but not logically in use
Dual Redundant
Data Paths
Heartbeat Messages
Active Standby
Master Master
Node Node
Public Network
Clients
Shared
Physically connected Network Physically connected
but not logically in use Storage and in use
due to the failure of
the master node
Heartbeat Messages
Active Standby
Failed Master Now Active
Master Master Node
Node Node
Node
Public Network
Clients
104
Figure 47 illustrates the 1+1 active/standby pair after the failover has completed. The active/standby
redundancy model supports connection synchronization between the two master nodes. Section 4.22
discusses connection synchronization.
The HA tier can transition from the 1+1 active/standby to the 1+1 active/active redundancy model
through the redundancy configuration manager, which is responsible for switching from one
redundancy model to another. The 1+1 active/standby redundancy model provides high availability;
however, it requires a master node to sit idle waiting for the active node to fail so it can take over. The
active/standby model leads to a waste of resources and limits the capacity of the HA tier.
The 1+1 active/active redundancy model, discussed in the following section, addresses this problem
by allowing the two master nodes to be active and to serve incoming requests for the same virtual
service.
105
Shared
Physically connected Network Physically connected
and in use providing Storage and in use providing
redundant data path
redundant data path
for master node 1
for master node 2
Heartbeat Messages
Active Active
Master Master
Node 1 Node 2
Public Network
Clients
106
4.10 SSA Tier Redundancy Models
The SSA tier supports the N+M and the N-way redundancy models. In the N+M model, N is the
number of active traffic nodes hosting the active web server application, and M is the number of
standby traffic nodes. When M=0, it is the N-way redundancy model where all traffic nodes are
active. Following the N-way redundancy model, traffic nodes operate without standby nodes. Upon
the failure of an active traffic node, the traffic manager running on the master node removes the failed
traffic node from its list of available traffic nodes (Section 4.27.9) and redirects traffic to available
traffic nodes.
Figure 49 illustrates the N-way redundancy model. The state information of the web server
application running on the active nodes is saved (1) on the HA shared storage (2). When the
application running on the active node fails, the application on the standby node accesses the saved
state information on the shared storage (3) and provides continuous service.
1 1 3
Standby
Process
2 HA Shared 2 HA Shared
State State
Active Storage Storage
Process
In Figure 50, the state information of the application running on the active node is saved (1) on the
HA shared storage.
107
Active Node Active Node Active Node Standby Node
HA high 2
speed State
connectivity
HA Shared
Storage
Figure 50: The N+M redundancy model with support for state replication
In Figure 51, when the application running on the active node fails, the application on the standby
node accesses the saved state info on the shared storage and provides a continuous service.
HA high
speed State
connectivity
HA Shared
Storage
Figure 51: The N+M redundancy model, after the failure of an active node
Supporting applications that require maintaining state information is not the scope of this dissertation.
However, the WebBench tool provides dynamic test suites. Therefore, in our testing, we used a
combination of both static (STATIC.TST) and dynamic test suites (WBSSL.TST). The static test
suite contains over 6,200 static pages and executable test files. The dynamic test suites use
applications that run on the server and require maintaining a state. Although supporting applications
with state is not in the scope of the work, the HAS architecture still handles them.
108
4.11 Storage Tier Redundancy Models
Although storage is outside the scope of our work, the redundancy models of the storage tier depend
on the physical storage model described in Section 4.16.
routers, r=2
= Active Node
= Standby Node
Figure 52: the redundancy models at the physical level of the HAS architecture
Master nodes follow the 1+1 redundancy model. The HA tier hosts two master nodes that interact
with each other following the active/standby model or the active/active (load sharing) model. Traffic
nodes follow one of two redundancy models: N+M (N active and M standby) or N-way (all nodes are
active). In the N+M active/standby redundancy model, N is the number of active traffic nodes
available to service requests. We need at least two active traffic nodes, N ≥ 2. M is the number of
109
standby traffic nodes, available to replace an active traffic node as soon as it becomes unavailable.
The N-way redundancy model is the N+M redundancy model with M = 0. In the N-way redundancy
model, all traffic nodes are in the active mode and servicing requests with no standby traffic nodes.
When a traffic node becomes unavailable, the traffic manager stops sending traffic to the unavailable
node and redistributes incoming traffic among the remaining available traffic nodes. However, when
standby nodes are available, the throughput of the cluster does not suffer from the loss of a traffic
node since the standby node takes over the unavailable traffic node.
As for the storage tier, the redundancy model depends on various possibilities ranging from hosting
data on the master nodes to having separate and redundant nodes that are responsible for providing
storage to the cluster. The redundancy configuration manager is responsible for switching from one
redundancy model to another. For the purpose of the work, we did not implement the redundancy
configuration manager (Section 6.2.4). Rather, we relied on re-starting the cluster nodes with
modified configuration files when we wanted to experiment with a different redundancy model.
Table 8 provides a summary of the possible redundancy models. The HAS architecture allows the
support for all redundancy models and supporting them is an implementation issue.
Storage X X X X X
Table 8: The possible redundancy models per each tier of the HAS architecture
Table 9 illustrates the implemented redundancy models for the HAS architecture prototype. At the
HA tier, both the 1+1 active/standby and the 1+1 active/active redundancy models are supported. At
the SSA tier, the HAS architecture supports the N-way redundancy model. The storage tier supports
the 1+1 active/active redundancy model.
110
1+1 Active/Active 1+1 Active/Standby N+M N-Way No Redundancy
HA Tier X X X
(one master node)
SSA Tier X X
(one traffic node)
Storage Tier X X
(one NFS server)
Table 9: The supported redundancy models per each tier in the HAS architecture prototype
111
Initial State (Assuming initial
1 state is active for this node)
Active
[The active node is facing
Accept & service problems forcing a it to
traffic change state]
2
5
In-Transition
In-Transition
Switching from
out of cluster Stop accepting
to active new requests
4 3
[Problem is fixed, node
re-joining cluster]
Out of [The node is now considered
Cluster outside the cluster. It does
not serve traffic nor provide
services to cluster nodes]
Figure 53: The state diagram of the state of a HAS cluster node
Figure 54 represents the state diagram after we expand it to include the standby state, in which the
node is not currently providing service but prepared to take over the active state. This scenario is only
applicable to nodes in the HA tier, which supports the 1+1 active/standby redundancy model. When
the node is in the active, in-transition or standby state, and it encounters software or hardware
problems, it becomes unstable and it will not be member of the HAS cluster. Its state becomes out-of
cluster and it is not anymore available to service traffic. If the transition is from active to standby, the
node stops receiving new requests and providing services, but keeps providing service to ongoing
requests until their termination, when possible; otherwise, ongoing requests are terminated.
The system software that manages the transition of states are the traffic manager and the heartbeat
daemon running on the master nodes, and the traffic client and the LDirectord running on the traffic
nodes.
112
Out of
Cluster [Error]
Active
Accept &
service
traffic
In In
Transition [Error] Out of
Out of [Error] Transition
Cluster Switching from Cluster
Stop accepting
Standby to
New requests
active
Standby
[Error]
Out of
Cluster
113
HA Tier LAN 1 LAN 2
SSA Tier
Master
Node A
Traffic Local
Disk
C DFS Node 1
L Node A
Local Disk
U H
S
e
a
Traffic Local
Disk
T r
t
Data
Synchronization
Node 2
E b
R e
a
t Node B
Traffic Local
V Disk
I
Local Disk Node 3
P
Figure 56 illustrates the hardware configuration of an HA-OSCAR cluster [117]. The system consists
of a primary server, a standby server, two LAN connections, and multiple compute clients, where all
the compute clients have homogeneous hardware.
Figure 56: The HA-OSCAR prototype with dual active/standby head nodes
A server is responsible for serving user’s requests and distributing the requests to specified clients. A
compute client is dedicated to computation [118]. Each server has three network interface cards; one
114
interface card connects to the Internet by a public network address, the other two connect to a private
LAN, which consists of a primary Ethernet LAN and a standby LAN. Each LAN consists of network
interface cards, switch, and network wires, and provides communication between servers and clients,
and between the primary server and the standby server. The primary server provides the services and
processes all the user’s requests. The standby server activates its services and waits for taking over
the primary server when its failure is detected. The periodical transmission of heartbeat messages
travels across the Ethernet LAN between the two servers, and works as health detection of the
primary server. When a primary server failure occurs, the heartbeat detection on the standby server
cannot receive any response message from the primary server. After a prescribed time, the standby
server takes over the alias IP address of the primary server, and the control of the cluster transfers
from the primary server to the standby server. The user’s requests are processed on the standby server
from later on. From the user’s point of view, the transfer is almost seamless except the short
prescribed time. The failed primary server is repaired after the standby server taking over the control.
Once the repair is completed, the primary server activates the services, takes over the alias IP address,
and begins to process user’s requests. The standby server releases its alias IP address and goes back to
initial state.
At a regular interval, the running server polls all the LAN components specified in the cluster
configuration file, including the primary LAN cards, the standby LAN cards, and the switches.
Network connection failures are detected in the following manner. The standby LAN interface is
assigned to be the poller. The polling interface sends packet messages to all other interfaces on the
LAN and receives packets back from all other interfaces on the LAN. If an interface cannot receive or
send a message, the numerical count of packets sent and received on an interface does not increment
for an amount of time. At this time, the interface is considered down. When the primary LAN goes
down, the standby LAN takes over the network connection. When the primary LAN is repaired, it
takes over the connection back from the standby LAN.
Whenever a client fails while the cluster is in operation, the cluster undergoes a reconfiguration to
remove the corresponding failing client node or to admit a new node into the cluster. This process is
referred to as a cluster state transition. The HA OSCAR cluster uses a quorum voting scheme to keep
the system performance requirement, where the quorum Q is the minimum number of functioning
clients required for a HPC system. We consider a system with N clients, and assume that each client
is assigned one vote. The minimum number of votes required for a quorum is given by (N+2)/2 [119].
115
Whenever the total number of the votes contributed by all the functioning clients falls below the
quorum value, the system suspends operation. Upon the availability of sufficient number of clients to
satisfy the quorum, a cluster resumption process takes place, and brings the system back to
operational state.
116
The HAS architecture requires the availability of two routers (or switches) to provide a highly
available and reliable communication path.
An image server is a machine that holds the operating system and ramdisk images of the cluster
nodes. This machine, two for redundancy purposes, is responsible for propagating the images over the
network to the cluster nodes every time there is an upgrade or a new node joining the cluster. Master
nodes can provide the functionalities of the image server; however, for large deployments this might
slow down the performance of master nodes. Image servers are external and optional components to
the architecture.
LAN 1 LAN 2
Traffic
Node 1
Traffic
Node 2
C Storage Image
L Master Node 1 Server 2
U Node A
Users
Users S Traffic
T
Node 3
E
R Storage
Node 2
V Master
I Node B Traffic
P
Node 4
We divide the cluster components into the following functional units: master nodes, traffic nodes,
storage nodes, local networks, external networks, network paths, and routers.
The m master nodes form the HA tier in a HAS architecture and implement the 1+1 redundancy
model. The number of master nodes is m = 2. These nodes can be in the active/standby or
active/active mode. When m ≥ 2, then the redundancy model is the N+M model; however, we do not
implemented this redundancy model in the HAS prototype. The t traffic nodes are located in the SSA
tier, where t ≥ 2. If t = 1, then there is a single traffic node that constitutes a SPOF. Let s indicate the
number of storage nodes. If s = 0, then the cluster does not include specialized storage nodes; instead
master nodes in the HA tier provide shared storage using a highly available distributed file system.
The HA file system uses the disk space available on the master nodes to host application data. When s
117
≥ 2, it indicates that at least two specialized nodes are providing storage. When s ≥ 2, we introduce
the notion of d shared disks, where d ≥ 2 x s. The d shared disks are the total number of shared disks
in the cluster. The l local networks provide connectivity between cluster nodes. For redundancy
purposes, l ≥ 2 to provide redundant network paths. However, this is dependent on two parameters:
the number of routers r available (r ≥ 2, one router per network path) and the number of network
interfaces eth available of each node (one eth interface per network path). The HAS architecture
requires a minimum of two Ethernet cards per cluster node; therefore eth ≥ 2. The cluster can be
connected to outside networks, identified as e, where e ≥ 1 to recognize that the cluster is connected
to at least one external network.
118
HA Tier SSA Tier
LAN 1LAN 2
Node A
Local Disk
Traffic Local
Disk
Node 1
C
L Master
U Node A
S Traffic Local
Disk
T Node 2
E Heartbeat
R
Traffic Local
Local Disk
However, it is worthy to mention that other research projects (Section 2.10.5) have adopted this
model as their preferred way of handling data and dividing it across multiple traffic nodes [82].
Following their architectures, a traffic node receives a connection only if it has the data stored locally.
119
HA Tier SSA Tier
LAN 1 LAN 2
Master
Node A
Traffic Local
Disk
C DFS Node 1
L Node A
U H Local Disk
S e Traffic Local
T a Disk
E r Data Node 2
t Synchronization
R
b
…
e
V a Node B
I t Local Disk
P
Traffic Local
Node N Disk
Master
Node B
Figure 59: The HAS storage model using a distributed file system
Figure 60 illustrates how the HAS architecture achieves NFS server redundancy. In Figure 57-A,
master-a, is the name of the Master Node A server, and master-b is the name of the Master Node B
server. Both master nodes are running the modified HA version of the network file system server.
Using the modified mount program, we mount a common storage repository on both master nodes:
% mount –t nfs master-a,master-b:/mnt/CommonNFS
LAN 2 LAN 2
LAN 1 Primary LAN 1
Master Node A NFS Server Primary
Master Node A NFS Server
(master-a)
Secondary Secondary
Master Node B NFS Server Master Node B NFS Server
(master-b)
(Fig 57-A) Storage view from outside the (Fig 57-B) Changes in contents is synchronized
cluster: One storage repository. with the rsync utility.
120
When the rsync utility detects a change in the contents, it performs the synchronization to ensure that
both repositories are identical. If the NFS server on master-a becomes unavailable, data requests to
/mnt/CommonNFS will not be disturbed because the secondary NFS server on master-b is still running
and hosting the /mnt/CommonNFS network file system.
The rsync utility is open source software that provides incremental file transfer between two sets of
files across the network connection, using an efficient checksum-search algorithm [120]. It provides a
method for bringing remote files into synch by sending just the differences in the files across the
network link [121].
The rsync utility can update whole directory trees and file systems, preserves symbolic links, hard
links, file ownership, permissions, devices and times, and uses pipelining of file transfers to minimize
latency costs. It uses ssh or rsh for communication, and can run in daemon mode, listening on a
socket, which is used for public file distribution.
We used the rsync utility to synchronize data on both NFS servers running on the two master nodes in
the HA tier of the HAS architecture.
121
Network
Clients
Users Users
Users Users
Users Users
Network
Active Standby
Disk
Master Master
Replication
Node Node
(DRDB)
Figure 61: DRDB disk replication for two nodes in the 1+1 active/standby redundancy model
The DRBD utility provides intelligent resynchronization as it only resynchronizes those parts of the
device that have changed, which results in less synchronization time. It grants read-write access only
to one node at a time, which is sufficient for the usual fail-over HA cluster.
The drawback of the DRBD approach is that it does not work when we have two active nodes
because of possibly multiple writes to the same block. If we have more than one node concurrently
modifying distributed devices, we have an interesting problem to decide which part of the device is
up-to-date on which node, and what blocks we need to resynchronize in which direction.
122
HA Tier SSA Tier Storage Tier
LAN 1 LAN 2
Traffic
Node 1
C Traffic
L Master Node 2
U
S
Node A Specialized
T Storage Node 1
E
R Traffic
Master Node 3
V Node B
I Specialized
P Storage Node 2
Traffic
Node 4
123
Figure 63 illustrates the software and hardware stack of a master node in the HAS architecture.
Interconnect Protocol
(IPv4 and IPv6)
Interconnect Technology
Ethernet (TCP/IP/UDP)
Processors
Master nodes provide an IP layer abstraction hiding all cluster nodes and provide transparency
towards the end user. Master nodes have a direct connection to external networks. They do not run
server applications; instead, they receive incoming traffic through the cluster virtual IP interface and
distribute it to the traffic nodes using the traffic manager and a dynamic distribution mechanism
(Section 4.23). Master nodes provide cluster-wide services for the traffic nodes such as DHCP server,
IPv6 router advertisement, time synchronization, image server, and network file server. Master nodes
run a redundant and synchronized copy of the DHCP server, a communications protocol that allows
network administrators to manage centrally and automate the assignment of IP addresses. The
configuration files of this service are available on the HA shared storage. The router advertisement
daemon (radvd) [123] runs on the master nodes and sends router advertisement messages to the local
Ethernet LANs periodically and when requested by a node sending a router solicitation message.
These messages are specified by RFC 2461 [124], Neighbor Discovery for IP Version 6, and are
required for IPv6 stateless autoconfiguration. The time synchronization server, running on the master
nodes, is responsible for maintaining a synchronized system time. In addition, master nodes provide
the functionalities of an image server.
When the SSA tier consists of diskless traffic nodes, there is a need for an image server to provide
operating system images, application images, and configuration files. The image server propagates
this data to each node in the cluster and solves the problem of coordinating operating system and
application patches by putting in place and enforcing policies that allow operating system and
software installation and upgrade on multiple machines in a synchronized and coordinated fashion.
Master nodes can optionally provide this service. In addition, master nodes provide shared storage via
124
a modified, highly available version of the network file server. We also prototyped a modified mount
program to allow master nodes to mount multiple servers over the same mounting point. Master
nodes can optionally provide this service.
TCD, LDirectord,Ethd
Interconnect Protocol
(IPv4 and IPv6)
Interconnect Technology
Ethernet (TCP/IP/UDP)
Processors
Traffic nodes run the Apache web server application. They reply to incoming requests, forwarded to
them by the traffic manager running on the master nodes. Each traffic node runs a copy of the traffic
client, LDirectord, and the Ethernet redundancy daemon.
Traffic nodes rely on cluster storage to access application data and configuration files, as well as for
cluster services such as DHCP, FTP, NTP, and NFS services. Traffic nodes have the option to boot
from the local disk (available for nodes with disks), the network (two networks for redundancy
purposes), flash disk (for CompaqPCI architectures), from CDROM, DVDROM, or floppy. The
default booting method is through the network. Traffic nodes also run the NTP client daemon, which
continually keeps the system time in step with the master nodes. With the HAS prototype, we
experienced booting traffic nodes from the local disk, the network, and from flash disk which we
mostly used for troubleshooting purposes.
125
4.17.3 Storage Nodes
Cluster storage nodes provide storage that is accessible to all cluster nodes. Section 4.16 presents the
physical storage model of the HAS architecture.
LAN 1 LAN 2
Router 1 Router 2
Traffic
Node 1
C Traffic
L Master Node 2
U Node A Storage
S
T Node 1
E
R Traffic
Node 3
V Master Storage
I Node B
P Node 2
Traffic
Node 4
Figure 65: The redundant LAN connections within the HAS architecture
This connectivity model ensures high availability access to the network and prevents the network of
being a SPOF. The HAS architecture supports both Internet Protocols IPv4 and IPv6. Supporting
IPv4 does not imply additional implementation considerations. However, supporting IPv6 requires
the need for a router advertisement daemon that is responsible for automatic configuration of IPv6
Ethernet interfaces. The router advertisement daemon also acts as an IPv6 router: sending router
advertisement messages, specified by RFC 2461 [124] to a local Ethernet LAN periodically and when
126
requested by a node sending a router solicitation message. These messages are required for IPv6
stateless autoconfiguration. As a result, in the event we need to reconfigure networking addressing for
cluster nodes, this is achievable in a transparent fashion and without disturbance to the service
provided to end users.
127
Master Master
Node 1 Node 2
2
1 1
2
Router
With heartbeat, master nodes are able to coordinate their role (active and standby) and track their
availability. Heartbeat discussions are presented in [20], [127], and [128].
128
on the Linux Director daemon (LDirectord) to monitor the health of the applications running on the
traffic nodes. Each traffic node runs a copy of the LDirectord daemon.
The LDirectord daemon performs a connect check of the services on the traffic nodes by connecting
to them and making a HTTP request to the communication port where the service is running. This
check ensures that it can open a connection to the web server application. When the application check
fails, the LDirectord connects to the traffic manager and sets the load index of that specific traffic
node to zero. As a result, existing connections to the traffic node may continue, however the traffic
manager stop forwarding new connections to it. Section 4.27.12 discusses this scenario. This method
is also useful for gracefully taking a traffic node offline.
The LDirectord module loads its configuration from the ldirectord.cf configuration file, which
contains the configuration options. An example configuration file is presented below. It corresponds
to a virtual web server available at address 192.68.69.30 on port 80, with round robin distribution
between the two nodes: 142.133.69.33 and 142.133.69.34.
1 # Global Directives
2 Checktimeout = 10
3 Checkinterval = 2
4 Autoreload = no
5 Logfile = "local0"
6 Quiescent = yes
7
8 # Virtual Server for HTTP
9 Virtual = 192.68.69.30:80
10 Fallback = 127.0.0.1:80
11 Real = 142.133.69.33:80 masq
12 Real = 142.133.69.34:80 masq
13 Service = http
14 Request = "index.html"
15 Receive = "Home Page"
16 Scheduler = rr
17 Protocol = tcp
18 Checktype = negotiate
Once the LDirectord module starts, the virtual server kernel table is populated. The capture below
uses the ipvsadm command line to capture the output of the kernel. The ipvsadm command is used
to set up, maintain or inspect the virtual server table in the Linux kernel. The listing illustrates the
129
virtual server session, with the virtual address on port 80, and the two hosts providing this virtual
service.
% ipvsadm -L -n
IP Virtual Server version 1.0.7 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 192.68.69.30:80 rr
-> 142.133.69.33:80 Masq 1 0 0
-> 142.133.69.34:80 Masq 1 0 0
-> 127.0.0.1:80 Local 0 0 0
By default, the LDirectord module uses the quiescent feature to add and remove traffic nodes. When a
traffic node is to be removed from the virtual service, its weight is set to zero and it remains part of
the virtual service. As such, exiting connections to the traffic node may continue, but the traffic node
is not allocated any new connections. This mechanism is particularly useful for gracefully taking real
servers offline. This behavior can be changed to remove the real server from the virtual service by
setting the global configuration option quiescent=no.
130
network terminations. The following sub-sections present the CVIP framework and discuss the
architecture and the various concepts.
Intra/Internet
Figure 68 illustrates the level of distribution within the HAS cluster using the CVIP as the interface
towards the outside networks. There exist two distribution points: the network termination, which
distributes packages based on IP address to correct master (or front end) node, and the traffic manager
that distributes network connections to the applications on the traffic node. Section 4.21.1.1 discusses
the network termination concept.
131
Apache Web Server Software Apache Web Server Software
Processor Processor
Distribution of IP packages from net
termination to web (or application) server
Processor Processor
132
Linux Linux
Kernel Kernel
• All Linux processors have their own IP address
• When in the 1+1 active/standby model, one front-end
node (master node) is owner of the virtual IP address
CPU CPU
• When in the 1+1 active/active model, each master
node claims to be an IP router for the CVIP address
• More “Front Ends” can be added at runtime
Net Net
• The OSPF protocol is used to monitor router links
Termination Termination
Intra/Internet
133
Traffic Node Traffic Node Traffic Node Traffic Node
HTTPD TCD
HTTPD HTTPD
Intra/Internet
134
4.21.3.2 Scalable
The CVIP offers a unique scalability advantage. We can increase network terminations, master nodes,
or traffic nodes independently and without any affecting how the cluster is presented to the outside
world.
With CVIP, we can cluster multiple servers to use the same virtual IP address and port numbers over
a number of processors to share the load. As we add new nodes in the HA and SSA tiers, we increase
the capacity of the system and its scalability. The number of clients or servers using the virtual IP
address is not limited. The framework is scalable and we can add more servers to increase the system
capacity. In addition, although we have only presented HTTP servers, the applications on top may
include server application that runs on IP such as an FTP server for file transfer.
4.21.3.4 Availability
Since CVIP supports multiple servers, it does not provide a SPOF. In the HAS architecture prototype,
CVIP was provided on two master nodes in the HA tier. If one master node crashes, the web clients
and web servers are not affected.
135
4.21.3.6 Support for multiple application servers
Since the CVIP interface operates at IP level and it is transparent to application servers running on
traffic nodes, then it is independent from the type of traffic that it accepts and forwards. As a result,
with CVIP, the HAS architecture supports all types of application servers that work at IP level.
136
cluster can minimize and eliminate the situation of lost connections caused by the failure of an active
master node. When the information of ongoing connections is synchronized between master nodes,
then if the standby master node becomes the active master node, it retains the information about the
currently established and active connections, and as a result, the new active master node continues to
forward their packets to the traffic nodes in the SSA tier.
137
web user opens connection-1 forwarded
connection-1 to traffic node
Active Master Node A
sync-master
connection-1
synchronized
In step 2 (Figure 72), a fail-over occurs and the master node B becomes the active master node.
Connection-1 is able to continue because the connection synchronization took place in step 1.
The master/slave implementation of the connection synchronization works with two master nodes: the
active master node sends synchronization information for connections to the standby master node,
and the standby master node receives the information and updates its connection table accordingly.
The synchronization of a connection takes place when the number of packets passes a predefined
threshold and then at a certain configurable frequency of packets. The synchronization information
for the connections is added to a queue and periodically flushed. The synchronization information for
up to 50 connections can be packed into a single packet that is sent to the standby master node using
multicast. A kernel thread, started through an init script, is responsible for sending and receiving
synchronization information between the active and standby master nodes.
138
the current standby node (previously active) becomes active again. To illustrate this drawback, we
continue discussing the example of connection synchronization from the previous section.
In Step 3 (Figure 73), a web user opens connection-2. Master node B receives this connection, and
forwards it to a traffic node. Connection synchronization does not take place because master node B
is a sync-slave.
No connection
synchronization
web user opens connection-2 forwarded
connection-2 to traffic node
Active Master Node B
sync-slave
In step 4 (Figure 74), another fail-over takes place and master node A is again the active master node.
Connection-2 is unable to continue because it was not synchronized.
139
connections forwarded
Connections
to traffic node
Active Master Node A
sync-master
Connections
synchronized
140
Our survey of identical works (Sections 2.9 and 2.10) has identified that added performance from
complex algorithms is negligible. The recommendations were to focus on a distribution algorithm that
is uncomplicated, has a low overhead, and that minimizes serialized computing steps to allow for
faster execution.
Scalable web server clusters require three core components: a scheduling mechanism, a scheduling
algorithm, and an executor. The scheduling mechanism directs clients’ requests to the best web
server. The scheduling algorithm defines the best web server to handle the specific request. The
executor carries out the scheduling algorithm using the scheduling mechanism. The following sub-
section present these three core components in the HAS architecture.
141
a configuration file that lists the addresses of all traffic nodes, the traffic distribution policy, and the
port of communication, the timeout limit, and the addresses of the master nodes.
The /proc file system is a real time, memory resident file system that tracks the processes running
on the machine and the state of the system, and maintains highly dynamic data on the state of the
operating system. The information in the /proc file system is continuously updated to match the
current state of the operating system. The contents of the /proc files system are used by many
utilities which read the data from the particular /proc directory and display it.
The traffic client uses two parameters from /proc to compute the load_index of the traffic node:
the processor speed and free memory. The /proc/cpuinfo file provides information about the
processor, such as its type, make, model, cache size, and processor speed in BogoMIPS [128]. The
BogoMIPS parameter is an internal representation of the processor speed in the Linux kernel.
Figure 76 illustrates the contents of the /proc/cpuinfo file at a given moment in time and
highlights the BogoMIPS parameter used to compute the load_index of the traffic node. The
processor speed is a constant parameter; therefore, we only read the /proc/cpuinfo file once when
the TC starts.
% more /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 13
model name : Intel(R) Pentium(R) M processor 1.70GHz
stepping : 6
cpu MHz : 598.186
cache size : 2048 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr mce cx8 sep mtrr pge mca
cmov pat clflush dts acpi mmx fxsr sse sse2 ss tm pbe est tm2
bogomips : 1185.43
143
The /proc/meminfo file reports a large amount of valuable information about the RAM usage. The
/proc/meminfo file contains information about the system's memory usage such as current state of
physical RAM in the system, including a full breakdown of total, used, free, shared, buffered, and
cached memory utilization in bytes, in addition to information on swap space.
Figure 77 illustrates the contents of the /proc/meminfo file at a given moment in time and
highlights the MemFree parameter used to compute the load_index of the traffic node. Since the
MemFree is a dynamic parameter, it is read from the /proc/meminfo file every time the TC
calculates the load_index.
% more /proc/meminfo
MemTotal: 775116 kB
MemFree: 6880 kB
Buffers: 98748 kB
Cached: 305572 kB
SwapCached: 2780 kB
Active: 300348 kB
Inactive: 286064 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 775116 kB
LowFree: 6880 kB
SwapTotal: 1044184 kB
SwapFree: 1040300 kB
Dirty: 16 kB
Writeback: 0 kB
Mapped: 237756 kB
Slab: 171064 kB
Committed_AS: 403120 kB
PageTables: 1768 kB
VmallocTotal: 245752 kB
VmallocUsed: 11892 kB
VmallocChunk: 232352 kB
HugePages_Total: 0
HugePages_Free: 0
1 # TC Configuration File
2 # List of master nodes to which the TC daemon reports load
3 master1 <IP Address of Master Node 1>
4 master2 <IP Address of Master Node 2>
5
6 # Port # to connect to at master - this port number can be anywhere
7 # between 1024 and 49151. We can also use ports 49152 through 65535
8 port <port_number>
9
10 # Frequency of load updates in ms.
11 updates <frequency_of_updates>
12
13 # Reporting errors -- needed for troubleshooting purposes
14 ErrorLog <full_path_to_error_log_file>
15
16 # Number of RAM in the node with the least RAM in the cluster
17 RAM <num_of_ram>
⎛ 3358,72 * 524,288 ⎞
Load _ a = ⎜ ⎟ ≈ 6,717
⎝ 262,144 ⎠
The traffic manager maintains a list of nodes and their loads. Figure 78 illustrates a case example of a
HAS cluster that consists of eight traffic nodes.
Figure 78: Example list of traffic nodes and their load index
When the traffic manager receives an incoming request, it examines the list of nodes and forwards the
request to the least busy node on the list. The list of traffic nodes is a sorted linked list that allows us
to maintain an ordered list of nodes without having to know ahead of time how many nodes we will
146
be adding. To build this data structure, we used two class modules: one for the list head and another
for the items in the list. The list is a sorted linked list; as we add nodes into the list, the code finds the
correct place to insert them and adjusts the links around the new nodes accordingly.
Users
Users
5
TM
TM TM
TM 3 TC
TC
2
7
9
saru saru Web
Web
Server
Server
8 Distributed
Storage
HBD HBD
LDirectord
Master
Master Node
Node 22 Master
Master Node
Node 11 Traffic
Traffic Node
Node 11
Figure 79: Illustration of the interaction between the traffic client and the traffic manager
(1) The traffic client reads the /proc file system and retrieves the BogoMIPS processor speed from
/proc/cpuinfo, and then retrieves the amount of free memory from /proc/meminfo.
(2) The traffic client computes the node load index based on the formula in Section 4.23.6.
(3) The traffic client reports the load index to the traffic manager as a string that consists of pair
parameters: the traffic node IP address and the load index (traffic_node_IP, load_index).
(4) The traffic manager receives the load index and updates its internal list of traffic nodes to reflect
the new load_index of the traffic_node_IP.
(5) For illustration purposes, we assume that an incoming request reaches the virtual interface.
147
(6) The routed daemon forwards the request to the traffic manager. The traffic manage examines the
list of traffic nodes and chooses a traffic node as the target for this request.
(7) The traffic manages forwards the request to the traffic node.
(8) The web server running on the traffic node receives the request. It reads from the distributed
storage to retrieve the requested document and fetches it from storage.
(9) The web server sends the request document directly to the web user.
This scenario also illustrates several drawbacks such as the number of steps involved and the
communication overhead between the system software. One of the future work items is to minimize
the number of system software resulting in less communication overhead.
148
4.24 Access to External Networks and the Internet
The two classical methods to access external networks are the direct access method and the restricted
method. In the direct access method, traffic node can reply to web clients directly. In the restricted
method, traffic nodes forward their response to one of the master nodes that then rewrites the
response header and forwards it to the web client. With the latest method, access to external networks
is restricted by master nodes in the HA tier, which is achieved using forwarding, filtering, and
masquerading mechanisms. As a result, master nodes monitor and filter all accesses to the outside
world. The HAS architecture supports both methods as this is architecture independent and depends
on configuration.
Based on the survey of similar work (Sections 2.9 and 2.10), the direct access method is the most
efficient traffic distribution method that help improve the system scalability. This model is supported
by the HAS cluster prototype and with which we have performed our benchmarking tests.
Figure 80 illustrates the scenario of direct access. When a request arrives to the cluster (1), the master
node examines it, decides where the request should be forwarded, and forwards it (3) to the
appropriate traffic node. The request reaches the traffic node that treats it, and replies directly (4) to
the web client. Furthermore, the HAS architecture supports the restricted access method, as this is
implementation specific.
End
EndUsers
Users
LAN 1 LAN 2
1 C
L Master Node A
U
S Traffic Node 1
T 2 3
E
R
V Master Node B 4
I Traffic Node 2
P
Figure 80: The direct routing approach – traffic nodes reply directly to web clients
Figure 81 illustrates the scenario of restricted access, which requires re-writing of IP packets. When a
request arrives to the cluster (1), the master node examines it, decides where the request should be
149
forwarded, rewrites the packets, and forwards it (3) to the appropriate traffic node. The request
reaches the traffic node that treats it and replies (4) to the master node. The master node then re-writes
the packets (5) and sends the final reply to the web client (6).
LAN 1 LAN 2
1 C
L Master Node A Traffic Node 1
U 2
S
T 5 3
End 6
EndUsers
Users E 4
R
V Traffic Node 2
I Master Node B
P
Figure 81: The restricted access approach – traffic nodes reply to master nodes, who in turn
reply to the web clients
150
The example shown above indicates that eth0 has eth1 as its backup link. If we do not specify
parameters in the command, it defaults to the equivalent of "erd eth0 eth1". We automated this
command on system startup.
The servers in our prototype use Tulip Ethernet cards. We patched the tulip.c driver to make the
MAC addresses for ports 0 and 1 identical. Alternatively, we also were able to get the same result by
issuing the following commands (on Linux) to set the MAC address for an Ethernet port:
% ifconfig eth[X] down
% ifconfig eth[X] up
We also modified the source code of the Ethernet device driver tulip.c to toggle the RUNNING bit
in the devÆflags variable, which allows ifconfig to present the state of an Ethernet link. The
state of the RUNNING bit for the primary link is accessed by erd via the system call ioctl.
151
Master Node A Master Node B
TM
TM
saru
saru heartbeat
heartbeat heartbeat
heartbeat
routed
routed
/proc
/proc file
file system
system LDirectord /proc
/proc file
file system
system LDirectord
LDirectord LDirectord
Legend:
Depends on
# Depends on, only in the 1+1 active/ active
redundancy model
Mutual dependency
Figure 82: The dependencies and interconnections of the HAS architecture system software
152
4.26.2 The Saru Module and the Heartbeat Daemon
When the HA tier is in the active/active redundancy model, the saru module runs in coordination with
heartbeat on each of the master nodes. The saru module is responsible for dividing the incoming
connections between the two master nodes. The heartbeat daemon provides a mechanism to
determine which master node is available and the saru module uses this information to divide the
space of all possible incoming connections between both active master nodes.
154
4.26.9 The saru Module and routed Process
The saru module relies on the routed process to receive incoming traffic from the cluster virtual IP
interface.
The traffic client daemon depends on the /proc file system to retrieve memory and processor usage,
which are metrics needed to computer the load index of a traffic node.
155
- Traffic node becomes unavailable: In some cases, the traffic node can become unavailable
because of hardware or software error. This scenario illustrates how the cluster reacts to the
unresponsiveness of a traffic node.
- Ethernet port becomes unavailable: A cluster node can face networking problems because of
Ethernet card or Ethernet drivers issues. This scenario examines how a HAS cluster node reacts
when it faces Ethernet problems.
- Traffic node leaving the cluster: When a traffic node is not available to serve traffic, the traffic
manager disconnects it from the cluster. This scenario illustrates how a traffic node leaves the
cluster.
- Application server process dies on a traffic node: When the application becomes unresponsive, it
stops serving traffic. This scenario examines how to recover from such a situation.
- Network becomes unavailable: This scenario presents the chain of events that take place when the
network to which the cluster is connected, becomes unavailable.
156
Traffic Manager (TM) Traffic
CVIP
Active Master Node Node A
Users
User sends
request to Requests
virtual IP arrives to the
address TM on that TM checks for
master node the best
1 available
2 traffic node
Request arrives
at Traffic Node A
4
Traffic Node A
replies to user
5
Figure 83: The sequence diagram of a successful request with one active master node
157
Traffic Manager (TM) Traffic Manager (TM) Traffic
CVIP
Active Master Node 1 Active Master Node 2 Node C
Users
User sends
request to User sends
virtual IP request to
address virtual IP
address
1 Master node 1 is
elected by saru to
2 receive incoming
requests from VIP
and divide it
3 among the active
master nodes
Request is sent to
Master Node 2
4
TM checks for the
5 best available
traffic node
Request arrives
at Traffic Node C
6
Traffic Node C
replies to user
7
Figure 84: The sequence diagram of a successful request with two active master nodes
At this point, traffic node B is added to the list of available traffic nodes.
The traffic manager starts forwarding incoming traffic to traffic node B.
Figure 85: A traffic node reporting its load index to the traffic manager
159
Master Node 1 Master Node 2 Traffic Node B
Traffic Manager (TM) Traffic Manager (TM) Traffic Client Daemon (TCD)
The traffic client daemon is aware of the master nodes since the
IP addresses of those nodes are provided in its configuration file.
At this point, traffic node B is added to the list of available traffic nodes.
The traffic manager starts forwarding incoming traffic to traffic node B.
160
mounts other file systems, and starts the init process. The init process brings up the customized Linux
services for the node, and the node is now fully booted and all initial processes are started.
Diskless
DHCP/Image
Traffic
Server
Node
6 TFTP (diskless_node_ramdisk)
1. Ensure that the MAC address of the NIC on that diskless node is associated with a traffic node
and configured on the master nodes as a diskless traffic node. The notion of diskless is important
since the traffic node will download from the imager server a kernel and ramdisk image. Traffic
nodes have their BIOS configured to do a network boot. When the administrator starts traffic
nodes, the PXE client that resides in the NIC ROM, sends a DHCP_DISCOVER message.
2. The DHCP server, running on the master node, sends the IP address for the node with the address
of the TFTP server and the name of the PXE bootloader file that the diskless traffic node should
download.
3. The NIC PXE client then uses TFTP to download the PXE bootloader.
4. The diskless traffic node receives the kernel image (diskless_node) and boots with it.
5. Next, the diskless traffic node sends a TFTP request to download a ramdisk.
6. The image server sends the ramdisk to the diskless traffic node. The diskless traffic node
downloads the ramdisk and executes it.
161
When the diskless traffic node executes the ramdisk, the traffic client daemon starts and reports the
load of the node periodically to the master nodes. The traffic manager, running on the master nodes,
adds the traffic nodes to its list of available traffic nodes and starts forwarding traffic to it.
Traffic
Node DHCP/Image
With Disk Server
Figure 88: The boot process of a traffic node with disk – no software upgrades are performed
Figure 89 illustrates the process of upgrading the ramdisk on a traffic node. To rebuild a traffic node
or upgrade the operating system and/or the ramdisk image, we re-point the symbolic link in the
DHCP configuration to execute a specific script, which results in the desired upgrade. At boot time,
the DHCP server checks if the traffic node requires an upgrade and if so, it executes the
corresponding script.
162
Traffic
DHCP/Image
Node with
Server
Disk
7 TFTP (node_with_disk)
9 FTP (node_disk_ramdisk)
163
Traffic Image
Node B Server
Figure 90: The process of upgrading the kernel and application server on a traffic node
164
Master Master CVIP
Node 1 Node 2 (master node 2)
Users
Users Admin
Admin
Administrator shuts
down master node 1
1
Figure 91: The sequence diagram of upgrading the hardware on a master node
Master Node 1 is
1 X unavailable due to
a major failure Heartbeat on Master
Node 1 does not send
heartbeat message to
2 X the heartbeat instance
running on Master Node
2 (timeout limit is The heartbeat instance on
exhausted) Master Node 2 declares
3 Master Node 1 unavailable
and Makes Master Node
2 as primary
4
Master Node 2
5 Is owner of the
New requests from web user arrive to Master Node 2 virtual services
6
Figure 93 illustrates the sequence diagram of synchronizing storage when one of the master nodes
fails.
Master Node 2
Master Node 1
HA NFS Daemon
Figure 93: The NFS synchronization occurs when a master node becomes unavailable
166
When the initial active master node becomes available again for service, there is no need for a
switchback to active status between the two master nodes. The new master node acts as a hot standby
for the current active master node. As a future work, we would like the standby master node to switch
to the load sharing mode (1+1 active/active), helping the active master node to direct traffic to the
traffic nodes once the active master node reaches a pre-defined threshold limit. When master node 1
becomes available again (3), the NFS server is re-started; it mounts the storage and re-syncs its local
content with the master node 2 using the rsync utility.
1 X Traffic Node C
becomes unavailable
When a traffic node becomes unavailable (1), the traffic client daemon (running on that node)
becomes unavailable and does not report the load index to the master nodes. As a result, the traffic
manager daemons do not receive the load index from the traffic node (2). After a timeout, the traffic
167
managers remove the traffic node from their list of available traffic nodes (3). However, if the traffic
nodes becomes available again (4), the traffic client daemon reports the load index to the traffic
manager running on the master node (5). When the traffic manager receives the load index from the
traffic node, it is an indication that the node is up and ready to provide service. The traffic manager
then adds the traffic node (6) to the list of available traffic nodes. A traffic node is declared
unavailable if it does not send its load statistics to the master nodes within a specific configurable
time.
Web Users
Web Users
LAN LAN
Active Standby Active Standby Active Standby Active Standby
Figure 95: The scenario assumes that node C has lost network connectivity
Figure 95 illustrates a traffic node losing network connectivity. The scenario assumes that traffic node
C lost network connectivity, and as a result, it is not a member of the HAS cluster. The traffic
manager now forwards incoming traffic is to the other remaining traffic nodes.
Ethernet
Ethernet Redundancy Ethernet
Port 1 Daemon Port 2
Ethernet port is
1 X
unavailable
Ethernet Redundancy
Daemon detects the
2 failure of Ethernet Port 1
and performs failover to
Ethernet Port 2
The traffic node was servicing request until (1) an error took place
and the node is not communicating with master nodes.
Traffic node is
1 unavailable, suffering
TM does not receive traffic load index – the from hardware or
traffic node did not send out its load index software problem …
2 2
TM
TM updating
updating thethe list
list of
of
available
available traffic
traffic nodes
nodes After timeout, TM removes the traffic node
based
based on
on traffic
traffic nodes
nodes from its list of available traffic nodes
reporting
reporting their
their load
load
index
index
3 3
Figure 97: The sequence diagram of a traffic node leaving the HAS cluster
169
When traffic managers stop receiving messages from the traffic node reporting its load index (1)(2),
after a defined timeout, the traffic managers remove the node from the list of available traffic nodes
(3). The scenario of a traffic node leaving the HAS cluster is similar to the scenario Traffic Node
Becomes Unavailable presented in Section 4.27.9.
170
(4) The traffic manager updates its list of available traffic nodes and stops forwarding traffic to the
traffic node.
(5) The LDirectord needs to ensure that the traffic client does not update the load_index while the
application is not responsive. The LDirectord sets the load_index_report_flag to 0.
(6) When the load_index_report_flag = 0, the traffic client stops reporting its load to the
traffic managers.
(7) On the next loop cycle, the LDirectord checks if the application is still unresponsive. If the
application is still not available, then no action is required from LDirectord.
(8) If the application checks returns positive then LDirectord connects to the traffic manager and
reset the load_index_report_flag to 1. When the load_index_report_flag = 1, the
traffic client resumes reporting its load to the traffic manager.
(9) The traffic client reports the new load_index that overwrites the 0 value.
(10) The traffic manager updates its list of available traffic nodes and starts forwarding traffic to
the traffic node.
Users
Users
6
TM
TM TC
TC
9
Web
Web
saru Server
Server
3
5 8
1 2 7
HBD LDirectord
cron
apphbd
Master
Master Node
Node 11 Traffic
Traffic Node
Node 11
171
4.27.13 Network Becomes Unavailable
In the event that one network becomes unavailable, the HAS cluster needs to survive such a failure
and switch traffic to the redundant available network. Figure 99 illustrates this scenario.
Request is
User sends request
forwarded to
1 Ethernet port 1 of
Traffic Node
2
3 X Switch/Router
becomes
unavailable
Reply is sent
4
Timeout
5
Reply is re-sent through
Switch/Router 2
6
User receives reply 7
8
When the router becomes unavailable (3), Ethernet port 1 gets a reply with a timeout (5). At this
point, the Ethernet port 1 uses its secondary route through router 2. We can use the heartbeat
mechanism to monitor the availability of routers. However, since routers are outside our scope, we do
not pursue how to use heartbeat to discover and recover from router failures.
172
advertisement message also includes an indication of whether the host should use a stateful address
configuration protocol.
There are two types of auto-configuration. Stateless configuration requires the receipt of router
advertisement messages. These messages include stateless address prefixes and preclude the use of a
stateful address configuration protocol. Stateful configuration uses a stateful address configuration
protocol, such as DHCPv6, to obtain addresses and other configuration options. A host uses stateful
address configuration when it receives router advertisement messages that do not include address
prefixes and require that the host use a stateful address configuration protocol. A host also uses a
stateful address configuration protocol when there are no routers present on the local link. By default,
an IPv6 host can configure a link-local address for each interface. The main idea behind IPv6
autoconfiguration is the ability of a host to auto-configure its network setting without manual
intervention.
Autoconfiguration requires routers of the local network to run a program that answers the
autoconfiguration requests of the hosts. The radvd (Router ADVertisement Daemon) provides these
functionalities. This daemon listens to router solicitations and answers with router advertisement.
Master Node
Traffic Node C
or Router
1 X Node boots
2 Generate link
Send solicitation local address
message
3
router advertisement,
specifying subnet prefix,
lifetimes, and default router.
4
173
Figure 100 illustrates the process of auto-configuration. This scenario assumes that the router
advertisement daemon is started on at least one master node, and that cluster nodes support the IPv6
protocol at the operating system level, including its auto-configuration feature. The node starts (1). As
the node is booting, it generates its link local address (2). The node sends a router solicitation
message (3). The router advertisement daemon receives the router solicitation message from the
cluster node (4); it replies with the router advertisement, specifying subnet prefix, lifetimes, default
router, and all other configuration parameters. Based on the received information, the cluster node
generates its IP address (5). The last step is when the cluster node verifies the usability of the address
by performing the Duplicate Address Detection process. As a result, the cluster node has now fully
configured its Ethernet interfaces for IPv6.
LAN 1LAN 2
IPv6 DNS
Upstream Server Traffic
Provider
Node 1
175
Chapter 5
Architecture Validation
5.1 Introduction
The initial goal of this work was to propose an architecture that allows web clusters to scale for up to
16 nodes while maintaining the baseline performance of each individual cluster node. The validation
of the architecture is an important activity that allows us to determine and verify if the architecture
meets our initial requirements. Network and telecom equipment providers use professional services of
specialized validation test centers to test and validate their products.
This chapter presents three types of validation for the HAS architecture. The first is the validation of
scalability and high availability. It present the benchmarking results that demonstrate the ability to
scale the HAS architecture for up to 18 nodes (2 master nodes and 16 traffic nodes) while maintaining
the baseline performance across all traffic nodes. In addition, this chapter presents the results of the
HA testing to validate the HA capabilities. The second validation is the external validation by open
source projects. It describes the impact of the work on the HA-OSCAR project. The third validation is
the adoption of the architecture by the industry as the base architecture for communication platforms
that run telecom applications providing mission critical services.
176
returns to the client. These client machines simulate web browsers. When the server replies to a client
request, the client records information such as how long the server took and how much data it
returned and then sends a new request. When the test ends, WebBench calculates two overall server
scores, requests per second and throughput in bytes per second, as well as individual client scores.
WebBench maintains at run-time all the transaction information and uses this information to compute
the final metrics presented when the tests are completed.
Figure 102: A screen capture of the WebBench software showing 379 connected clients
Figure 102 is a screen capture from the WebBench controller that shows 379 connected clients from
the client machines that are ready to generate traffic.
The benchmarking tests took place at the Ericsson Research lab in Montréal, Canada. Although the
lab connects to the Ericsson Intranet, our LAN segment is isolated from the rest of the Ericsson
network and therefore our measurement conditions are under well-defined control.
Figure 103 illustrates the network setup in the lab. The client computers run WebBench to generate
web traffic with one computer running WebBench as the test manager. These computers connect to a
fiber capable Cisco switch (2) through 100 MB/s links. The Cisco switch connects to the HAS cluster
(3) through a one GB/s fiber link. Most benchmarking tests we conducted over IPv4, with some
177
additional tests conducted over IPv6. The results demonstrate that we are able to achieve similar
results as with IPv4, however, with a slight decrease in performance [131].
1
2
31 client machines
100Mbps Links
running WebBench to
generate web traffic 3
One fiber-capable 1Gbps fiber link HAS
Cisco switch c2948g Cluster
1 machine running
WebBench acting as
the test manager
Permanent 100Mbps Links
Rest of lab backbone
We experienced a decrease in the number of successful transactions per second per processor ranging
between -2% and -4% [133]. We believe that this is the direct result of the immaturity of the IPv6
networking stack compared to the mature IPv4 networking stack.
178
HA Tier SSA Tier
Traffic Node 8
1+1 Active/Hot-Standby
Traffic Node 9
Traffic Node 10
Traffic Node 11
Traffic Node 12
Traffic Node 13
Traffic Node 14
Traffic Node 15
Traffic Node 16
179
5.4 Test-0: Experiments with One Standalone Traffic Node
This test consists of generating web traffic to a single standalone server running the Apache web
server software. This test reveals the performance limitation of a single node. We use the results of
this test to define the baseline performance. Apache 2.0.35 was running on this node and the NFS
server running on the network segment hosting the document repository.
Table 10 presents the results of the benchmark with a single server node. The results of Test-0 are
consistent with the tests conducted in 2002 and 2003 with an older version of Apache (Section 3.7).
The main lesson to learn from this benchmark is that the maximum capacity for a standalone server is
an average of 1,033 requests per second. If the server receives requests over its baseline capacity, it
becomes overloaded and unable to respond to all of them. Hence, the high number of failed requests
as illustrated in the table below. Table 10 presents the number of clients generating web traffic, the
number of requests per second the servers has completed, and the throughput. WebBench generates
this table automatically as it collects the results of the benchmarking test.
Table 10: The performance results of one standalone processor running the Apache web server
180
1200
1000
Requests Per Second
800
600
400
200
0
nt
s
nt
nt
nt
nt
nt
nt
nt
nt
nt
nt
nt
nt
nt
e
cli
lie
lie
lie
lie
lie
lie
lie
lie
lie
lie
lie
cli
cli
_c
_c
_c
_c
_c
_c
_c
_c
_c
_c
_c
1_
4_
8_
12
16
20
24
28
32
36
40
44
48
52
Number of Clients
Figure 105: The results of benchmarking a standalone processor -- transactions per second
In Figure 105, we plot the results from Table 10, the number of clients versus the number of requests
per second. We notice that as we reach 16 clients, Apache is unable to process additional incoming
web requests and the scalability curve levels-off. Even though we are increasing the number of clients
generating traffic, the application sever has reached its maximum capacity and is unable to process
more requests. With this exercise, we conclude that the maximum number of requests per second we
can achieve with a single process is 1,035. We use this number to measure how our cluster scales as
we add more processors.
Figure 106 presents the throughput achieved with one processor. We plot the results from Table 10,
the number of clients versus the throughput in terms of KB/s. The maximum throughput possible with
a single processor averages around 5,800 KB/s. In addition, WebBench provides statistics about
failed requests. Table 10 presents the number of clients generating traffic and the number of failed
requests. Apache starts rejecting incoming requests when we reach 16 simultaneous WebBench
clients generating over 1,300 requests per second.
181
182
Requests Per Second Throughput (KBytes/Sec)
0
1000
2000
3000
4000
5000
6000
1_ 7000
1_ cli
e
0
500
1000
1500
2000
2500
cli nt
4_
4_ ent cli
cli e nt
e 8_ s
8_ nt s cli
cli e nt
en 12 s
12 t _c
_c s lie
l ie nt
16 n 16 s
_c ts _c
lie
li nt
20 en t 20 s
_c s _c
l ie lie
nt
24 nt 24 s
_c s _c
li lie
nt
_c s _c
l ie lie
Number of Clients
nt nt
Throughput (KBytes/Sec)
40 40 s
_c s _c
li lie
44 en t nt
_c s 44 s
li _c
lie
48 en t nt
_c s 48 s
l ie _c
52 nt lie
nt
_c s 52 s
l ie _c
nt lie
Figure 106: The throughput benchmarking results of a standalone processor
s nt
Figure 107: The number of failed requests per second on a standalone processor
Figure 107 illustrates the curve of successful requests per second combined with the curve of failed
requests per second. As we increase the number of clients generating traffic to the processor, the
number of failed requests increases. Based on the benchmarks with a single node, we can draw two
main conclusions. The first is that a single processor can process up to one thousand requests per
second before it reaches its threshold. The second conclusion is that after reaching the threshold, the
application server starts rejecting incoming requests.
The traffic nodes in the HAS cluster start rejecting new incoming requests as WebBench has reached
generating 2,073 requests (1,976 successful requests versus 97 failed requests). As WebBench adds
more web clients to generate traffic, we notice an increase of failed requests, with the number of
successful requests being almost constant ranging between 2,030 and 2,089 requests per second.
183
Requests Per Second: 2 Master Nodes and 2 Traffic Nodes
2500
Requests Per Second
2000
1500
1000
500
0
nt
s
nt
nt
nt
nt
nt
nt
nt
nt
nt
nt
nt
nt
nt
e
cli
lie
lie
lie
lie
lie
lie
lie
lie
lie
lie
lie
cli
cli
_c
_c
_c
_c
_c
_c
_c
_c
_c
_c
_c
1_
4_
8_
12
16
20
24
28
32
36
40
44
48
52
Number of Clients
Figure 108: The number of successful requests per second on a HAS cluster with four nodes
Figure 108 illustrates the results with a 4-processor cluster, illustrating the curve of the number of
transactions per second versus the number of clients. Figure 109 shows the throughput curve
illustrating the achieved throughput per second with a 4-processor cluster as we increase the number
of client machines generating traffic to the cluster.
184
Throughput (KBytes/Sec): 2 Master Nodes and 2 Traffic Nodes
14000
Throughput (KBytes/Sec)
12000
10000
8000
6000
4000
2000
0
nt
s
nt
nt
nt
nt
nt
nt
nt
nt
nt
nt
nt
nt
nt
e
e
cli
l ie
l ie
l ie
l ie
l ie
l ie
l ie
l ie
l ie
l ie
l ie
cli
cli
_c
_c
_c
_c
_c
_c
_c
_c
_c
_c
_c
1_
4_
8_
12
16
20
24
28
32
36
40
44
48
52
Number of Clients
Throughput (KBytes/Sec)
Figure 109: The throughput results (KB/s) on a HAS cluster with four nodes
2500
2000
Requests per Second
1500
1000
500
0
t
s
en
nt
nt
nt
nt
nt
nt
nt
nt
nt
nt
nt
nt
nt
lie
lie
lie
lie
lie
lie
lie
lie
lie
lie
lie
e
e
cli
cli
cli
_c
_c
_c
_c
_c
_c
_c
_c
_c
_c
_c
1_
4_
8_
12
16
20
24
28
32
36
40
44
48
52
Number of Clients
Figure 110: The number of failed requests per second on a HAS cluster with four nodes
185
Figure 110 shows the curve of successful requests per second combined with the curve of failed
requests per second. As we increase the number of clients generating traffic to the processor, the
number of failed requests increases.
Table 12: The results of benchmarking a HAS cluster with six nodes
The maximum number of successful requests per second is 4,220, and the maximum throughput
reached is 26,491 KB/s.
186
Requests Per Second: 2 Master Nodes and 4 Traffic Nodes
4500
4000
Requests per Second
3500
3000
2500
2000
1500
1000
500
0
nt
s
nt
nt
nt
nt
nt
nt
nt
nt
nt
nt
nt
nt
e
cli
l ie
l ie
l ie
l ie
l ie
l ie
l ie
l ie
l ie
l ie
l ie
cli
_c
_c
_c
_c
_c
_c
_c
_c
_c
_c
_c
1_
8_
16
24
32
40
48
56
64
72
78
86
94
Number of Clients
Figure 111: The number of successful requests per second on a HAS cluster with six nodes
30000
Throughput (KBytes/Sec)
25000
20000
15000
10000
5000
0
nt
s
nt
nt
nt
nt
nt
nt
nt
nt
nt
nt
nt
nt
e
cli
l ie
l ie
l ie
l ie
l ie
l ie
l ie
l ie
l ie
l ie
l ie
cli
_c
_c
_c
_c
_c
_c
_c
_c
_c
_c
_c
1_
8_
16
24
32
40
48
56
64
72
78
86
94
Number of Clients
Throughput (KBytes/Sec)
Figure 112: The throughput results (KB/s) on a HAS cluster with six four nodes
187
4500
4000
3500
Requests per Second
3000
2500
2000
1500
1000
500
0
nt
s
nt
nt
nt
nt
nt
nt
nt
nt
nt
nt
nt
nt
e
cli
lie
lie
lie
lie
lie
lie
lie
lie
lie
lie
lie
cli
_c
_c
_c
_c
_c
_c
_c
_c
_c
_c
_c
1_
8_
16
24
32
40
48
56
64
72
78
86
94
Number of Clients
Figure 113: The number of failed requests per second on a HAS cluster with six nodes
Figure 113 presents the curve of successful requests per second combined with the curve of failed
requests per second. As we increase the number of clients generating traffic to the processor, the
number of failed requests increases.
188
32_clients 3339 21372529 20872
36_clients 3660 23498460 22948
40_clients 4042 25416830 24821
44_clients 4340 26272234 25656
48_clients 4560 27561631 26916
52_clients 4800 29637244 28943
56_clients 5090 31580773 30841
60_clients 5352 33656387 32868
64_clients 5674 35681682 34845
68_clients 5930 37298145 36424
72_clients 6324 39770012 38838
76_clients 6641 41770148 40791
80_clients 6910 43462088 42443
84_clients 7211 45550281 44483
88_clients 7460 46789359 45693
92_clients 7680 48292606 47161
96_clients 7871 49192039 48039
100_clients 8052 50833660 49642
104_clients 8158 51116698 49919
100_clients 8209 51311680 50109
104_clients 8278 52060159 50840
92_clients 8293 52154505 50932
96_clients 8310 52267720 51043
100_clients 8312 52280300 51055
104_clients 8307 52248851 51024
108_clients 8316 52299169 51073
112_clients 8313 52280300 51055
114_clients 8311 52274010 51049
118_clients 8310 52261431 51037
122_clients 8306 52242561 51018
126_clients 8319 52318038 51092
130_clients 8302 52217403 50994
134_clients 8308 52255141 51030
138_clients 8311 52267720 51043
142_clients 8312 52280300 51055
Figure 114 presents the curve of performance illustrating the number of successful requests per
second achieved with a HAS cluster that consists of two master nodes and eight traffic nodes. The
master nodes are in the 1+1 active/standby model and the traffic nodes follow the N-way redundancy
model, where all traffic nodes are active. Figure 115 shows the throughput curve of the 10-processor
HAS cluster.
189
190
Throughput (KBytes/Sec) Requests Per Second
1_
1_
c c
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
20000
30000
40000
50000
60000
0
8_ l ien 8_ l ien
cli t cli t
16 en 16 en
_c t s _c t s
24 li en 24 li en
_c ts _c ts
32 li en 32 li en
_c ts _c ts
40 li en 40 li en
_c ts _c ts
48 li en 48 li en
_c ts _c ts
56 li en 56 li en
_c ts _c ts
64 li en 64 li en
_c ts _c ts
72 li en 72 li en
_c ts _c ts
80 li en 80 li en
_c ts _c ts
88 li en 88 li en
_c ts _c ts
96 li en 96 li en
_ t _ t
10 clie s 10 clie s
4_ n t
4_ n t s
10 c lie s 10 c lie
4_ nt 4_ nts
Number of Clients
c s c
Number of Clients
96 lie n 96 lie n
Requests Per Second
_ t _c ts
Throughput (KBytes/Sec)
10 clie s 1 0 li e
4_ n t 4_ n t
s
11 c lie s 11 c lie
2_ nt 2_ nts
11 c lie s 11 c lie
8_ nt 8_ nt
12 c lie s 12 c lie s
6_ nt 6_ nt
s
13 c lie s 13 c lie
Requests Per Second: 2 Master Nodes and 8 Traffic Nodes
4_ nt 4_ ts n
Throughput (KBytes/Sec): 2 Master Nodes and 8 Traffic Nodes
14 c lie s 14 c lie
2_ nt 2_ nt
c li s c li s
Figure 115: The throughput results (KB/s) on a HAS cluster with 10 nodes
en en
ts ts
Figure 114: The number of successful requests per second on a HAS cluster with 10 nodes
5.8 Test-4: Experiments with an 18-nodes HAS Cluster
This test consists of generating web traffic to a HAS cluster made up of two master nodes and 16
traffic nodes. This test is the largest we conducted and consists of 18 nodes in the HAS cluster and 32
machines in the benchmarking environment, 31 of which generate traffic. Figure 116 presents the
number of successful transactions per second achieved with the 18 processors HAS cluster.
18000
16000
14000
Requests Per Second
12000
10000
8000
6000
4000
2000
0
_c nt
_c ts
_c ts
_c ts
_c ts
_ ts
4_ n ts
4_ nts
8_ nts
4_ nts
8_ nts
4_ nts
8_ nts
4_ nts
8_ nts
4_ nts
0_ nts
6_ nts
2_ nts
ts
en
16 clie
32 l i e n
48 l i e n
64 l i e n
80 l i e n
96 l i e n
10 clie
10 c lie
11 c lie
13 c lie
14 c lie
16 c lie
17 c lie
19 c lie
20 c lie
22 c lie
24 c lie
25 c lie
27 c lie
c li
1_
Number of Clients
Figure 116: The number of successful requests per second on a HAS cluster with 18 nodes
In this test, the HAS cluster with 18 nodes achieved 16,001 successful requests per second, an
average of 1,000 successful requests per second per traffic node in the HAS cluster.
191
Throughput (KBytes/Sec): 2 Master Nodes and 16 Traffic
Nodes
120000
Throughput (KBytes/Sec)
100000
80000
60000
40000
20000
0
_c nt
_c ts
_c ts
_c ts
_c ts
_ ts
4_ n ts
4_ nts
8_ nts
4_ nts
8_ nts
4_ nts
8_ nts
4_ nts
8_ nts
4_ nts
0_ nts
6_ nts
2_ nts
ts
en
16 clie
32 l i e n
48 l i e n
64 l i e n
80 l i e n
96 l i e n
10 clie
10 c lie
11 c lie
13 c lie
14 c lie
16 c lie
17 c lie
19 c lie
20 c lie
22 c lie
24 c lie
25 c lie
27 c lie
c li
1_
Number of Clients
Throughput (KBytes/Sec)
Figure 117: The throughput results (KB/s) on a HAS cluster with 18 nodes
Traffic Cluster Nodes Total Transactions Average Transactions per Traffic Node
1 1032 1032
2 2068 1034
4 4143 1036
8 8143 1017
16 16001 1000
Table 14: The summary of the benchmarking results of the HAS architecture prototype
For each testing scenario, we recorded the maximum number of requests per second that each
configuration supported. When we divide this number by the number of processors, we get the
maximum number of request that each processor can process per second in each configuration. Table
192
14 presents the number of successful transactions per traffic node. The total transactions is the total
number of successful transactions of the full HAS cluster as reported by WebBench. The average
transactions per traffic node is the average number of successful transaction served by a single traffic
node in the HAS cluster.
18000
16000 16001
Number of transactions per second
14000
(served by all the traffic nodes in
12000
the HAS cluster)
10000
8000 8143
6000
4000 4143
2000 2068
0
1 2 3 4 5
Total Number of 1032 2068 4143 8143 16001
Transactions in the HAS
Cluster
Average Number of 1032 1034 1036 1017 1000
Transactions per Traffic
Node
Total Number of Transactions in the HAS Cluster
Average Number of Transactions per Traffic Node
Figure 118 presents the scalability of the prototyped HAS cluster architecture. Starting with one
processor, we established the baseline performance to be at 1,032 requests per second. Next, we setup
the HAS prototype and performed benchmarking tests as we scaled the number of traffic nodes from
two to 16. The HAS cluster maintained a 1,000 requests per second per traffic node. As we scaled by
adding more traffic nodes to the SSA tier, we lost -3.1% of the baseline performance per traffic node
193
as defined in Section 5.4. These results represent an improvement compared to the -40% decrease in
performance experienced with a cluster built using existing software that uses traditional methods of
scaling (Section 3.9).
1040
1036
1034
1032
1030
Transactions per second per
1020
1017
processor
1010
1000 1000
990
980
1 22 43 84 16
5
Number of processors in the cluster
Figure 119 illustrates the scalability chart of the HAS architecture prototype. The results demonstrate
a close to linear scalability as we increased the number of traffic nodes up to 16. The results
demonstrate that we were able to scale from a standalone single node performing an average
maximum of 1,032 requests per second to 18 nodes in the HAS cluster (16 of them serve traffic) with
an average of 1,000 requests per second per traffic node. The scaling is achieved with a -3.11%
decrease in performance as compared to the baseline performance.
194
result matrix. However, since we do not have access to specialized HA testing tools, we performed
basic test scenarios to ensure that the claimed HA support is provided and it works as described. An
important feature of a highly available system is its ability to continue providing service even when
certain cluster sub-system fails. Our testing strategy was based on provoking common faults and
observing how this affects the service and if it leads to a service downtime. In our case, the system is
the HAS architecture prototype, and the cluster nodes run web servers that provide service to web
users. If there is a service downtime, users do not get replies to their web requests.
The following sub-sections discuss the high availability testing for the HAS cluster and cover testing
the connectivity (Ethernet connection and routers), data availability (redundant NFS server), master
node, and traffic node availability.
Traffic Node X
4 5
Web Server Application Master Node 1
System Software EthD TCD TMD EthD System Software HBD HBD
Heartbeat daemon
Linux Operating System Linux Operating System running on Master
Node 2
Interconnect Protocol Interconnect Protocol
3
Interconnect Technology Interconnect Technology
Processor Processor
Router 1
Router 2
Ethernet Card 1
Ethernet Card 2 2
LAN 1
LAN 2
Experiment number
The failures tested include Ethernet daemon failure, failure of Ethernet adapter, router failure,
discontinued communication between the traffic manager and the traffic client daemon and
195
discontinued communication between the heartbeat instances running on the master nodes. The
experiments included provoking a failure to monitor how the HAS cluster reacts to the failure, how
the failure affects the service provided, and how the cluster recovers from the failure.
196
5.10.1.4 Discontinued Communication between the TM and the TC
This test case examines the scenario where we disrupt the communication between the TM and the
TC (Figure 120 – experiment 4). As a result, the TM stops receiving load alerts from the TC. After a
predefined timeout, the TM removes the traffic node from its list of available traffic nodes and stops
forwarding traffic to it. When we restore communication between the TM and the TC, the TC starts
sending its load messages to the TM. The TM then adds the traffic node to its list of available nodes
and starts forwarding traffic to it. Section 4.27.9 examines a similar scenario.
197
server daemons running on master nodes were to crash. For this purpose, we have implemented
redundancy in the NFS server code. The two test cases we experimented with are shutting down the
NFS server daemon on a master node and disconnecting the master node from the network. In both
scenarios, there was no interruption to the service provided. Instead, there was a delay ranging
between 450 ms and 700 ms to receive the requested document.
LAN 2
LAN 1
Primary NFS
Master Node A Server Daemon
/mnt/CommonNFS
Secondary
Master Node B NFS Server
Daemon
199
language for SPNP called CSPL (C-based SPN Language) which is an extension of the C
programming language with additional constructs that facilitate easy description of SPN models.
Additionally, if the user does not want to describe his model in CSPL, a graphical user interface is
available to specify all the characteristics as well as the parameters of the solution method chosen to
solve the model [138].
Server sub-model
Network connection
sub-model
Clients sub-model
Figure 122: The modeled HA-OSCAR architecture, showing the three sub-models
Figure 123 shows a screen shot of the SPNP modeling tool. The HA-OSCAR team also studied the
overall cluster uptime and the impact of different polling interval sizes in the fault monitoring
mechanism.
200
`
where rk represents the reward rate assigned to state k of the SRN, τ is the set of tangible marking,
and π k (t ) is the probability of being in marking k at time t [138][139].
System Configuration Quorum Value System Availability Mean cluster down time
(N) (Q) (A) (t)
4 3 0.999933475091 34.9654921704
6 4 0.999933335485 35.0388690840
8 5 0.999933335205 35.0390162520
16 9 0.999933335204 35.0390167776
We notice that the system availabilities for the various configurations are very close, within a small
range of difference. After we introduce the quorum voting mechanism in the client sub-model, the
system availability is not sensitive to the change of client configuration. When we add more clients to
202
improve the system performance, the availability of the system almost remains unchanged. In Table
16, we notice that when we add more clients to improve the system performance in the first column N
is increasing, and we keep the value of Q to N/2+1, the availability of the system almost remains
unchanged. In column 3, the system availability is almost the same as we increase the number of
nodes in the system.
Figure 124 illustrates the instantaneous availabilities of the system when it has eight clients and the
quorum is five. The modeling and availability measurements using the SPNP offered the calculated
instantaneous availabilities of the system and its parameters.
Figure 125 illustrates the total availability (including planned and unplanned downtime) improvement
analysis of the HA-OSCAR architecture versus single head node Beowulf architecture [140].
The results show a steady-state system availability of 99.9968% compared to 92.387% availability for
a Beowulf cluster with a single head node [140]. Additional benefits include higher serviceability
such as the ability to upgrade incrementally and hot-swap cluster nodes, operating system, services,
applications, and hardware, further improve planned downtime, which benefits the overall aggregate
performance.
203
The HA-OSCAR vs. the Beowulf Architecture
Total Availability impacted by service nodes
100.00% 100.000%
99.00% 99.995%
99.9966% 99.9968%
98.00% 99.9951% 99.9962% 99.990%
97.00% 99.9896% 99.985%
Availability
96.00% 99.980%
95.00% 99.975%
94.00% 99.970%
99.9684%
93.00% 92.251% 92.336% 92.387% 99.965%
92.081%
91.575%
92.00% 99.960%
90.580% Model assumption:
91.00% 99.955%
- Scheduled downtime=200 hrs
90.00% 99.950% - Nodal MTTR = 24 hrs
1000 2000 4000 6000 8000 10000
- Failover time=10s
Beowulf 0.905797 0.915751 0.920810 0.922509 0.923361 0.923873 - During maintenance on the
HA-OSCAR 0.999684 0.999896 0.999951 0.999962 0.999966 0.999968 head, standby node acts as
primary
Mean time to failure (hr)
Figure 125: Availability improvement analysis of HA-OSCAR versus the Beowulf architecture
5.11.4 Discussion
The HA-OSCAR architecture proof-of-concept implementation, experimental and analysis results,
suggest that the HA-OSCAR architecture offers a significant enhancement and a promising solution
to providing a high-availability Beowulf cluster class architecture [140][142][143]. The availability of
the experimental system improves substantially from 92.387% to 99.9968%. The polling interval for
failure detection indicates a linear behavior to the total cluster availability.
204
The goal of the HA-OSCAR project is to enhance a Beowulf cluster system for mission critical
applications, to achieve high availability and eliminate single points of failure, and to incorporate
self-healing mechanisms, failure detection and recovery, automatic failover and failback.
On March 23, 2004, the HA-OSCAR group announced the HA-OSCAR 1.0 release, with over 5000
downloads within the first 24 hours of the announcement. It provides an installation wizard and a
web-based administration tool that allows a user to create and configure a multi-head Beowulf cluster.
Furthermore, the HA-OSCAR 1.0 release supports high availability capabilities for Linux Beowulf
clusters. To achieve high availability, the HA-OSCAR architecture adopts component redundancy to
eliminate SPOF, especially at the head node. The HA-OSCAR architecture also incorporates a self-
healing mechanism, failure detection and recovery, automatic failover and failback [146]. In addition,
it includes a default set of monitoring services to ensure that critical services, hardware components,
and important resources are always available at the control node.
Clients
Head Node
Router
Compute Nodes
Figure 126 illustrates the architecture of a Beowulf cluster. However, the single head-node
architecture represented in Beowulf cluster is a single-point-of-failure prone, similar to cluster
205
communication, where an outage of wither can render the entire cluster unusable. There are various
techniques to implement cluster architecture with high availability. These techniques include
active/active, active/standby, and active/cold standby. In the active/active, both head nodes
simultaneously provide services to external requests. If one head node goes down, the other node
takes over total control. A hot-standby head node, on the other hand, monitors system health and only
takes over control if there is an outage at the primary head node. The cold standby architecture is
similar to the hot standby, except that the backup head node is activated from a cold start.
The key effort focused on simplicity by supporting a self-cloning of cluster master node (redundancy
and automatic failover). While the aforementioned failover concepts are not new, HA-OSCAR
effortless installation, combined HA and HPC architecture are unique and its 1.0 release is the first
known field-grade HA Beowulf cluster release [34]. The HA-OSCAR experimental and analysis
results, discussed in Section 5.11, suggested a significant improvement in availability from the dual
head architecture [145].
Figure 127 illustrates the HA-OSCAR architecture. The HA-OSCAR architecture deploys duplicate
master nodes to offer server redundancy, following the active/standby approach, where one primary
master node is active and the second master node is standby [35]. Each node in the HA-OSCAR
architecture has two network interface cards (NIC); one has a public network address, and the other
has a private local network.
The HA-OSCAR project uses the SystemImager [147] utility for building and storing system images
as well as a providing a backup for disaster recovery purposes. The HA-OSCAR 1.0 release supports
high availability capabilities for Linux Beowulf clusters. It provides a graphical installation wizard
and a web-based administration tool to allow the administrator of the HA-OSCAR cluster to create
and configure a multi-head Beowulf cluster. In addition, HA-OSCAR includes a default set of
monitoring services to ensure that critical services, hardware components, and certain resources are
always available at the master node. The current version of HA-OSCAR, 1.0, supports active/standby
for the head nodes.
206
Users Users
Users Users
Users Users
Public Network
Heartbeat
Primary Standby
Head Node Head Node
Optional
Reliable
Storage Redundant image servers
sitting outside the cluster
Redundant
Router 1 Router 2 Network
Connections
Compute Nodes
207
other hand, the HAS architecture focuses on client/server applications that run over the web and
characterized with short transactions, short response time, a thin control path, and static delivery data
pack.
208
5.14.6 Failure Discovery and Recovery Mechanisms
Both the HA-OSCAR architecture and the HAS architecture support failure detection and recovery
mechanism. However, these mechanisms target different system component with variable failure
detection and recovery time. The failure recovery in the HA-OSCAR takes between 3 seconds and 5
seconds [148], compared to a failure recovery ranging between 200 ms and 700 ms in the HAS
cluster, depending on the type of failure.
209
5.15 HAS Architecture Impact on Industry
The Open Source Development Labs (OSDL) is a non-for-profit organization founded in 2000 by IT
and Telecommunication companies to accelerate the growth and adoption of Linux based platforms
and standardized platform architectures. The Carrier Grade Linux (CGL) initiative at OSDL aims to
standardize the architecture of telecommunication servers and enhance the Linux operating system for
such platforms.
The CGL Working Group has identified three main categories of application areas into which they
expect the majority of applications implemented on CGL platforms to fall. These application areas
include gateways, signaling, and management servers, and have different characteristics. A gateway,
for instance, processes a large number of small requests that it receives and transmits them over a
large number of physical interfaces. Gateways perform in a timely manner, close to hard real time.
Signaling servers require soft real time response capabilities, and manage tens of thousands of
simultaneous connections. A signaling server application is context switch and memory intensive,
because of the quick switching and capacity requirements to manage large numbers of connections.
Management applications are data and communication intensive. Their response time requirements
are less stringent compared to those of signaling and gateway applications.
Figure 128: The CGL cluster architecture based on the HAS architecture
210
The OSDL released version 2.0 of the Carrier Grade specifications in October 2003. Version 2.0 of
the specifications introduced support for clustering requirements and the cluster architecture is based
on the work presented in this dissertation. Figure 128 illustrates the CGL architecture, which is based
on the HAS architecture. In June 2005, the OSDL released version 3.1 of the specification. The
Carrier Grade architecture is a standard for the type of communication applications presented earlier.
211
Chapter 6
Contributions, Future Work, and Conclusion
This chapter presents the contributions of the work, future work, and the conclusions.
6.1 Contributions
The initial goal of this dissertation was to design and prototype the necessary technology
demonstrating the feasibility of a web cluster architecture that is highly available and able to linearly
scale for up to 16 processors to meet the increased web traffic.
We achieved our goal with the HAS architecture that supports continuous service through its high
availability capabilities, and provides close to linear scalability through the combination of multiple
parameters which include efficient traffic distribution, the cluster virtual IP layer and the connection
synchronization mechanism. Figure 129 provides an illustration of other contributions grouped into
eight distinct areas: application availability, network availability, data availability, master nodes
212
availability, connection synchronization, single cluster IP interface, and traffic distribution. Since the
HAS architecture prototype follows the building block approach, these contributions can be reused in
a different environments outside of the HAS architecture and can function completely independently
outside of a cluster environment.
The HAS architecture is based on loosely coupled nodes, provides a building block approach for
designing and implementing software components that can be re-used for other environments and
architectures. It provides the infrastructure for cluster membership, cluster storage, fault management,
recovery mechanisms, and traffic distribution. It supports various redundancy models for each tier of
the architecture and provides a seamless software and hardware upgrade without interruption of
service. In addition, the HAS architecture is able to maintain baseline performance for up to 18
cluster nodes (16 traffic nodes), validating a close to linear scaling.
The HAS architecture integrates these contributions within a framework that allows us to build
scalable and highly available web clusters. The following sub-sections examine these contributions.
213
redundancy model, the architecture does not force us to deploy traffic nodes in pairs. As a result, we
can deploy exactly the right number of traffic nodes to meet our traffic demands without having
traffic nodes sitting idle. In addition, we are able to scale each tier of the architecture independently of
the other tiers.
214
6.1.5 High Availability
The architecture tiers support two essential redundancy models: the 1+1 (active/standby and
active/active) and the N-way redundancy models. As a result, the HAS cluster architecture achieves
high availability through redundancy at various levels of the architecture: network, processors,
application servers, and data storage. As a result, we are able to do such actions as reconfigure the
network setting, upgrade the hardware, software, and the operating system, without service downtime.
Other areas of contributions include a mechanism to detect and recover from Ethernet failures, master
node failures, NFS failures, and application (web server software) failures.
215
more in the weeks and months after. This public rush is a indication of the community of user who
are using, testing, and deploying the HA-OSCAR architecture for their specific needs. It is also
important to note that there is an active community of users on the HA-OSCAR project discussion
board and mailing list. The HA-OSCAR architecture is based on the HAS architecture and provides
an open source implementation that is freely available for download with a substantial user
community.
Section 5.15 described our contributions to the Carrier Grade specifications that define an
architectural model for telecommunication platforms providing voice and data communication
services. The Carrier Grade architecture mode is an industry standard largely based on the HAS
architecture.
216
rsync utility provides the synchronization between the two NFS servers. Table 17 lists the Linux
kernel files modified to support the NFS redundancy.
The implementation of the HA NFS server is stable, however, it requires upgrading to the latest stable
Linux Kernel release, version 2.6.
Furthermore, the HA NFS implementation requires a new implementation of the mount program to
support mounting multi-host NFS servers, instead of single file server mount. This functionality is
provided, and in the new mount program, the addresses of the two redundancy NFS servers are
passed as parameters to the new mount program, and then to the kernel. The new command line for
mounting two NFS server looks as follows:
% mount –t nfs server1,server2:/nfs_mnt_point
217
port. When the link goes up again, the daemon waits to make sure the connection does not drop again,
and then switches back to the primary Ethernet port.
In addition to this contribution, smaller supporting contributions included fixes and re-writes of the
Ethernet device drive; all of these supporting contributions are now integrated in the original Ethernet
device driver code in the Linux kernel.
Further improvements to the current implementation include stabilizing the source and optimizing the
performance of the Ethernet daemon, which include optimizing the failure detection time of the
Ethernet driver. In addition, the source code of the Ethernet redundancy daemon is to be upgraded to
run on the latest release of the Linux kernel, version 2.6.
218
6.1.9.6 Cluster virtual IP Interface (CVIP)
The CVIP interface is a cluster virtual IP interface that presents the HAS cluster as a single entity to
the outside world, making all nodes inside the cluster transparent to end users. Section 4.21 discusses
the CVIP interface. Sections 6.2.7 and 6.2.8 discuss the future work items for CVIP.
6.1.9.7 LDirectord
The improvements and adaptations to the LDirectord module include capabilities to connect to the
traffic manager and the traffic client. The current implementation is not fully optimized; rather, the
implementation is a working prototype that requires further testing and stabilization. Furthermore, as
future work, we would like to minimize the number of sequential steps to improve the performance,
and investigate the possibility of integrating the LDirectord module with the traffic client running on
the traffic node.
219
- We ported the Apache and Tomcat web servers to support IPv6 and performed bechmarking tests
to compare with benchmaking tests of Apache and Tomcat running over IPv4.
- As part of the dissertation, we needed a flexible cluster installation infrastructure that help us built
and setup clusters within hours instead of days, and that would accommodate for nodes with disks
and diskless nodes and network boot. This infrastructure did not exist and we had to design it and
build it from scratch. This cluster installation infrastructure is now being used at the Ericsson
Research lab in Montréal, Canada.
- Other contributions included the influence of the work on the industry. The Carrier Grade Linux
specifications are industry standards with a defined architecture for telecom platforms and
applications running on telecom servers in mission critical environments. The carrier grade
architecture relies on the work proposed in this thesis with little modifications to accommodate
specific type of telecommunication applications. The author of the thesis is publicly recognized as
a contributor to the Carrier Grade specification. Furthermore, since January 2005, he has been
employed by the OSDL to focus on advancing the specifications and the architecture.
220
traffic distribution mechanism. The goal of this activity is to investigate the source of bottlenecks and
explore solutions.
We expect to achieve high performance levels when both master nodes are receiving incoming traffic
and forwarding it to traffic nodes. Furthermore, we would like to benchmark the HAS architecture
prototype using specialized storage nodes and compare the results to when using the HA NFS
implementation to provide storage. These tests will give us insights on the most efficient storage
solution.
221
6.2.4 Redundancy Configuration Manager
The current prototype of the HAS architecture does not support dynamic changes to the redundancy
configurations nor transitioning from one redundancy configuration to another. This feature is very
useful when the nodes reach a certain pre-defined threshold, then the redundancy configuration
manager would for example transition the HA tier from the 1+1 active/standby to the 1+1
active/active, allowing both master nodes to share and service incoming traffic. Such a transition in
the current HAS architecture prototype requires stopping all services on master nodes, updating the
configuration files, and restarting all software components running on master nodes.
The redundancy configuration manager would be the entity responsible for switching the redundancy
configuration of the cluster tiers from one redundancy model to another. For instance, when the SSA
tier is in the N+M redundancy model, the redundancy configuration manager will be responsible to
activate a standby traffic node when a traffic node becomes unavailable. As such, the configuration
manager should be aware of active traffic nodes and the states of their components, and their
corresponding standby traffic nodes.
222
6.2.7 Merging the Functionalities of the CVIP and Traffic Management Scheme
One possibility of further investigation is to couple the functionalities of the cluster virtual IP
interface with the traffic management scheme. With the current implementation, incoming traffic
arrives to the cluster through the cluster virtual IP Interface and then it is handled by the traffic
manager before it reaches its final destination on one of the traffic nodes. A future enhancement is to
eliminate the traffic management scheme and incorporate traffic distribution within the CVIP
interface. Eliminating the traffic manager daemon and integrating its functionalities with the CVIP
results in increased performance and a faster response time as we eliminate one serialized step in
managing an incoming request.
However, our proposal is to combine the functionalities of the cluster virtual IP interface and the
traffic distribution mechanism to eliminate a step of forwarding traffic in between web users and the
application server running the web server.
223
have two main challenges in this area: the first is to provide the virtualization of the cluster zones, and
the second is the ability to migrate dynamically cluster nodes between several zones based on traffic
trends.
LAN 1 LAN 2
Streaming
Node 2 1
Streaming
Node 1
FTP
Node 3 LAN 1
LAN 2
FTP Streaming
Node 2 Node 2
FTP
Node 1 Storage Streaming
Node 1 Node 1
C
L
U Master HTTP Storage
S Node A Node 5 Node 2 FTP
T HTTP Node 3
E
Node 4
R FTP
Master HTTP
Node 2
V
Node B Node 3
I FTP
HTTP
P Node 1 Storage
Node 2
HTTP Node 1
C
Node 1 L
U
Master HTTP 2 Storage
S Node A Node 5 Node 2
T HTTP
E
R Node 4
Master HTTP
V Node 3
I Node B
P HTTP
Node 2
LAN 1 LAN 2
HTTP
Streaming Node 1
Node 2
Streaming
Storage
Node 1
Node 1
FTP 3 Storage
Node 5 Node 2
FTP
Node 4
FTP
Node 3
FTP
C
L Node 2
U
Master FTP
S Node A Node 1
T
E
R
HTTP
Master Node 3
V
I Node B HTTP
P Node 2
HTTP
Node 1
Figure 131 illustrates the concept of cluster zones. The cluster in the figure consists of three zones:
one provides HTTP service, the other provides FTP service, and the third provides streaming service.
In (1) the FTP cluster zone is receiving traffic, while some nodes in the HTTP cluster zone are sitting
idle due to low traffic. The traffic manager running on the master node disconnects (2) two nodes
from the HTTP cluster zone and transition (3) them into the FTP cluster zone to accommodate the
increase in FTP traffic. There are several possible areas of investigations such as defining cluster
224
zones as logical entities in a larger cluster, dynamic node(s) selection to be part of a specialized
cluster zone, transitioning the node into the new zone, and investigating queuing theories suitable for
such usage models.
225
6.3 Conclusion
This dissertation covers a range of technologies for highly available and scalable web clusters. It
addresses the challenges of designing a scalable and highly available web server architecture that is
flexible, components based, reliable and robust under heavy loads.
The first chapter provides a background on Internet and web servers, scalability challenges, and
presents the objectives and scope of the study.
The second chapter looks at clustering technologies, scalability challenges, and related work. We
examine clustering technologies and techniques for designing and building Internet and web servers.
We argue that traditional standalone server architecture fails to address the scalability and high
availability need for large-scale Internet and web servers. We introduce software and hardware
clustering technologies, their advantages and drawbacks, and discuss our experience prototyping a
highly available, and scalable, clustered web server platform. We present and discuss the various
ongoing research projects in the industry and academia, their focus areas, results, and contributions.
In the third chapter, the thesis summarizes the preparatory technical work with the prototyped web
cluster that uses existing components and mechanisms.
Chapter four presents and discusses the HAS architecture, its components and their characteristics,
eliminating single points of failure, the conceptual, physical, and scenario architecture views,
redundancy models, cluster virtual interface and traffic distribution scheme. The HAS architecture
consists of a network of server nodes connected over highly available networks. A virtual IP interface
provides a single point of entry to the cluster. The software and hardware components of the
architecture do not present a SPOF. The HAS cluster architecture supports multiple redundancy
models for each of its tiers allowing you to choose the best redundancy model for your specific
deployment scenario. The HAS architecture manages incoming traffic from web clients through a
lightweight, efficient, and dynamic traffic distribution scheme that takes into consideration the
capacity of each traffic. This approach has proven to be an effective method to distribute traffic based
on the performance testing we conducted.
In chapter five, we validate the scalability of the architecture. Our results demonstrate that the HAS
architecture is able to reach close to linear scaling for up to 16 processors and attain high performance
levels with a robust behavior under heavy load. In addition, the chapter presents the results of the
availability validation testing the availability features in a HAS architecture.
The final chapter illustrates the contributions and future work.
226
The HAS architecture brings together aspects of high availability, concurrency, dynamic resource
management and scalability into a coherent framework. Our experience and evaluation of the
architecture demonstrate that the approach is an effective way to build robust, highly available, and
scalable web clusters. We have developed an operational prototype based on the HAS architecture;
the prototype focused on building a proof-of-concept for the HAS architecture that consists of a set of
necessary system software components.
The HAS architecture relies on the integration of many system components into a well-defined, and
generic cluster platform. It provides the infrastructure for cluster membership to recognize and
manage the nodes membership in the cluster, cluster storage service, fault management service to
recognize hardware and software faults and recovery mechanisms, and traffic distribution service to
distribute the incoming traffic across the nodes in the cluster. The HAS architecture represents a new
design point for large-scale Internet and web servers that support scalability, high availability, and
high performance.
227
Bibliography
[1] K. Coffman, A. Odlyzko, The Growth Rate of the Internet, Technical Report, First Monday,
Volume 3 Number 10, October 1998, http://www.firstmonday.dk/issues/issue3_10/coffman
[2] E. Brynjolfsson, B. Kahin, Understanding the Digital Economy: Data, Tool, and Research, MIT
Press, October 2000
[6] E. Hansen, Email outage takes toll on Excite@Home, CNET News.com, June 28, 2000,
http://news.cnet.com/news/0-1005-200-2167721.html
[7] Bloomberg News, E*Trade hit by class-action suit, CNET News.com, February 9, 1999,
http://news.cnet.com/news/0-1007-200-338547.html
[8] W. LeFebvre, Facing a World Crisis, Invited talk at the 15th USENIX LISA System
Administration Conference, San Diego, California, USA, December 2-7, 2001
[9] British Broadcasting Corporation, Net surge for news sites, September 2001,
http://news.bbc.co.uk/hi/english/sci/tech/newsid_1538000/1538149.stm
[10] R. Lemos, Web worm targets White House, CNET News.com, July 2001,
http://news.com.com/2100-1001-270272.html
[11] The Hyper Text Transfer Protocol Standardization at the W3C, http://www.w3.org/Protocols
[13] J. Nielsen, The Need for Speed, Technical Report, March 1997,
http://www.useit.com/alertbox/9703a.html
228
[16] Inktomi Corporation, Web surpasses one billion documents, Press Release, January 2000
http://www.inktomi.com/new/press/2000/billion.html
[17] A. T. Saracevic, Quantifying the Internet, San Francisco Examiner, November 5, 2000,
http://www.sfgate.com
[28] G. Pfister, In Search of Clusters, Second Edition, Prentice Hall PTR, 1998
[29] The Open Group, The UNIX® Operating System: A Robust, Standardized Foundation for
Cluster Architectures, White Paper, June 2001, http://www.unix.org/whitepapers/cluster.htm
[30] I. Haddad, E. Paquin, MOSIX: A Load Balancing Solution for Linux Clusters, Linux Journal,
May 2001
229
[34] M. J. Brim, T. G. Mattson, and S. L. Scott, OSCAR: Open Source Cluster Application
Resources, Ottawa Linux Symposium 2001, Ottawa, Canada, July 2001
[35] J. Hsieh, T. Leng, and Y.C. Fang, OSCAR: A Turnkey Solution for Cluster Computing, Dell
Power Solutions, Issue 1, 2001, pp. 138-140
[43] Trilliurn Digital Systems, Distributed Fault-Tolerant and High-Availability Systems White
Paper, http://www.trilliurn.com
[45] X. Gan, T. Schroeder, S. Goddard, and B. Ramamurthy, LSMAC vs. LSNAT: Scalable
cluster-based Web servers, IEEE Cluster Computing, November 2000, pp. 175-185
[46] O. Damani, P. Chung, Y. Huang, C. Kintala, and Y. M. Wang, ONE-IP: Techniques for
Hosting a Service on a Cluster of Machines, IEEE Computer Networks, Volume 29, Numbers 8-
13, September 1997, pp. 1019-1027
[48] RFC 2391, Load Sharing using IP Network Address Translation (LSNAT),
http://www.faqs.org/rfcs/rfc2391.html
230
[49] X. Gan, T. Schroeder, S. Goddard, and B. Ramamurthy, LSMAC and LSNAT: Two
Approaches for Cluster-Based Scalable Web Servers, EEE International Conference on
Communications, June 2000, pp. 1164-1168
[56] M. Williams, EBay, Amazon, Buy.com hit by Internet attacks, Network World, February 9,
2000, http://www.nwfusion.com/news/2000/0209attack.html
[57] G. Sandoval and T. Wolverton, Leading Web Sites Under Attack, News.com, February 9,
2000, http://news.cnet.com/news/0-1007-200-1545348.html
[59] D. LaLiberte, and A. Braverman, A Protocol for Scalable Group and Public Annotations,
Computer Networks and ISDN Systems, Volume 27, Number 6, January 1995, pp. 911-918
[63] L. Aversa, and A. Bestavros, Load Balancing a Cluster of Web Servers Using Distributed
Packet Rewriting, Proceedings of IEEE International Performance Conference, Phoenix, Arizona,
USA, February 2000, pp. 24-29
231
[64] S. N. Budiarto, and S. Nishio, MASEMS: A Scalable and Extensible Multimedia Server, The
1999 International Symposium on Database Applications in Non-Traditional Environments,
Kyoto, Japan, November 1999, pp. 28-30
[65] C. Roe, and S. Gonik, Server-Side Design Principles for Scalable Internet Systems, IEEE
Software, Volume 19, Number 2, March/April 2002, pp. 34-41
[66] D. Norman, The Design of Everyday Things, Double-Day, New York, 1998
[67] D. Dias, W. Kish, R. Mukherjee, and R. Tewari, A Scalable and Highly Available Web
Server, Proceedings of the Forty-First IEEE Computer Society International Conference:
Technologies for the Information Superhighway, Santa Clara, California, USA, February 25-28,
1996, pp. 85-92
[68] E. Casalicchio, and S. Tucci, Static and Dynamic Scheduling Algorithms for Scalable Web
Server Farm, IEEE Network 2001, pp. 368-376
[70] H. Bryhni, E. Klovning, and O. Kure, A Comparison of Load Balancing Techniques for
Scalable Web Servers, IEEE Network, July/August 2000, pp. 58-64
[71] D. Kim, C. H. Park, and D. Park, Request Rate Adaptive Dispatching Architecture for
Scalable Internet Server, IEEE International Conference on Cluster Computing, Chemmnitz,
2000, pp. 289-296
[72] L. Aversa, and A. Bestavros, Load Balancing a Cluster of Web Servers Using Distributed
Packet Rewriting, Proceedings of the 2000 IEEE International Performance, Computing, and
Communications Conference, February 2000, pp. 24 - 29
[73] B. Ramamurthy, LSMAC vs. LSNAT: Scalable Cluster-based Web Servers, Seminar presented
at Rice University, http://www-ece.rice.edu/ece/colloq/00-01/Oct23br-00.html, October 23, 2000
[74] A. N. Murad, and H. Liu, Scalable Web Server Architectures, Technical Report BL0314500-
961216TM, Bell Labs, Lucent Technologies, December 1996
232
[75] E. D. Katz, M. Butler, and M. McGrawth, A Scalable HTTP Server: The NCSA Prototype,
Proceedings of the 1st International WWW Conference, Geneva, Switzerland, May 25-27, 1994,
pp. 155-164
[77] D. Kim, C. H. Park, and D. Park, Request Rate Adaptive Dispatching Architecture for
Scalable Internet Server, IEEE Network 2000, pp. 289-296
[78] D. Anderson, T. Yang, V. Holmedahl, and O. Jbarra, SWEB: Towards a Scalable World Wide
Web Server on Multicomputers, Proceedings of the 10th International Parallel Processing
Symposium, Honolulu, Hawaii, USA, April 15-19, 1996, pp. 850-856
[79] E. Casalicchio, and M. Colajanni, Scalable Web Clusters with Static and Dynamic Contents,
Proceedings of the IEEE Conference on Cluster Computing, Chemnitz, Germany, November 28 –
December 1, 2000, pp. 170-177
[80] X. Gan, T. Schroeder, S. Goddard, and B. Ramamurthy, LSMAC and LSNAT: Two
Approaches for Cluster-based Scalable Web Servers, Proceedings of the 2000 IEEE International
Conference on Communications, New Orleans, USA, June 18-22, 2000, pp. 1164-1168
[81] X. Zhang, M. Barrientos, B. Chen, and M. Seltzer, HACC: An Architecture for Cluster-Based
Web Servers, Proceedings of the 3rd USENIX Windows NT Symposium, Seattle, Washington,
USA, July 12-15, 1999, pp. 155-164
233
[88] A. Tucker, and A. Gupta, Process Control and Scheduling Issues for Multiprogrammed
Shared-Memory Multiprocessors, Proceedings of the 12th Symposium on Operating Systems
Principles, ACM, Litchfield Park, Arizona, USA, December 1989, pp. 159-166
[90] Microsoft Developer Network Platform SDK, Performance Data Helper, Microsoft, July
1998
[91] W. B. Ligon III, and R. Ross, Server-Side Scheduling in Cluster Parallel I/O Systems, The
Calculateurs Parallèles Journal, October 2001
[92] W.B. Ligon III, and R. Ross, PVFS: Parallel Virtual File System, Beowulf Cluster
Computing with Linux, MIT Press, November 2001, pp. 391-430
[93] B. Nishio, and S. Nishio, MASEMS: A Scalable and Extensible Multimedia Server, IEEE
Network 2000, pp. 443-450
[94] A. Mourad, and H. Liu, Scalable Web Server Architectures, Proceedings of IEEE
International Symposium on Computers and Communications, Alexandria, Egypt, July 1997, pp.
12-16
[95] M. Andreolini, V. Cardellini, and M. Colajanni, Benchmarking, Models and Tools for
Distributed Web-Server System, Proceedings of the Performance 2002, Rome, Italy, July 24-26,
2002, pp. 208-235
[98] B. Laurie, P. Laurie, and R. Denn, Apache: The Definitive Guide, O'Reilly & Associates,
1999
[103] I. Haddad, W. Hassan, and L. Tao, XWPT: An X-based Web Servers Performance Tool, the
18th International Conference on Applied Informatics, Innsbruck, Austria, February 2000, pp. 50-
55
[106] A. Ching, A. Choudhary, W. Liao, R. Ross, and W. Gropp, Noncontiguous I/O through
PVFS, Proceedings of the 2002 IEEE International Conference on Cluster Computing, September
23-26, 2002, Chicago, Illinois, USA, pp. 405-414
[107] I. Haddad, and M. Pourzandi, Open Source Web Servers Performance on Carrier-Class
Linux Clusters, Linux Journal, April 2001, pp. 84-90
[108] I. Haddad, PVFS: A Parallel Virtual File System for Linux Clusters, Linux Journal,
December 2000
[109] P. Barford, and M. Crovella, Generating Representative WebWorkloads for Network and
Server Performance Evaluation, Proceedings of the ACM Sigmetrics Conference, Madison,
Wisconsin, USA, June 1998, pp. 151-160
[114] M. Andreolini, V. Cardellini, and M. Colajanni, Benchmarking Models and Tools for
Distributed Web-Server Systems, Proceedings of Performance 2002, Rome, Italy, July 24-26,
2002, pp. 208-235
[115] E. Marcus, and H. Sten, BluePrints for High Availability: Designing Resilient Distributed
Systems, Wiley, 2000
235
[117] I. Haddad, C. Leangsuksun, R. Libby, and S. Scott, HA-OSCAR: Towards Non-stop Services
in High End and Grid computing Environments, Poster Presentation at the Fifth Los Alamos
Computer Science Institute Symposium, New Mexico, USA, October 12-14, 2004
[118] The Open Cluster Group, How to Install an OSCAR Cluster, Technical Report, November 3,
2005, http://oscar.openclustergroup.org/public/docs/oscar4.2/oscar4.2-install.pdf
[119] A. S. Tanenbaum, and M. S. Van, Distributed Systems: Principles and Paradigms, Prentice
Hall, July 2001, pp. 371-375
[126] L. Marowsky-Brée, A New Cluster Resource Manager for Heartbeat, UKUUG LISA/Winter
Conference High Availability and Reliability, Bournemouth, UK, February 2004
[127] A. L. Robertson, The Evolution of the Linux-HA Project, UKUUG LISA/Winter Conference
High-Availability and Reliability, Bournemouth, UK, February 25-26, 2004
[128] A. L. Robertson, Linux-HA Heartbeat Design, Proceedings of the 4th International Linux
Showcase and Conference, Atlanta, October 10-14, 2000
[129] S. Horman, Connection Synchronisation (TCP Fail-Over), Technical Paper, November 2003
[131] D. Gordon, and I. Haddad, Apache talking IPv6, Linux Journal, January 2003
[133] I. Haddad, IPv6 on Linux: Ongoing Development Effort and Tutorial, Linux User and
Developer, June 2003
236
[134] D. Gamerman, Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference,
Chapman & Hall/CRC Press, Boca Raton, Fl., USA, 1997
[135] G. Ciardo, and P. Darondeau, Applications and Theory of Petri Nets 2005, 26th International
Conference, ICATPN 2005, Miami, USA, June 20-25, 2005
[136] G Ciardo, J. Muppala, and K. Trivedi, SPNP: Stochastic Petri Net Package, Proceedings of
the International Workshop on Petri Nets and Performance Models, IEEE Computer Society
Press, Los Alamitos, Ca., USA, December 1989, pp. 142-150
[137] H. Choi, Markov Regenerative Stochastic Petri Nets, Computer Performance Evaluation,
Vienna 1994, pp. 337-357
[138] C. Hirel, R. Sahner, X. Zang, and K. S. Trivedi, Reliability and Performability Modeling
using SHARPE 2000, Computer Performance Evaluation/TOOLS 2000, Schaumburg, US, March
2000, pp. 345-349
[139] C.Hirel, B. Tuffin, and K.S. Trivedi, SPNP: Stochastic Petri Nets Version 6.0, Computer
Performance Evaluation/TOOLS 2000, Schaumburg, US, March 2000, pp. 354-357
[140] C. Leangsuksun, L. Shen, T. Liu, H. Song, and S. Scott, Availability Prediction and Modeling
of High Availability OSCAR Cluster, IEEE International Conference on Cluster Computing, Hong
Kong, China, December 2-4, 2003, pp. 227-230
[142] C. Leangsuksun, L. Shen, T. Liu, H. Song, and S. Scott, Dependability Prediction of High
Availability OSCAR Cluster Server, The 2003 International Conference on Parallel and
Distributed Processing Techniques and Applications, Las Vegas, Nevada, USA, 2003, pp. 23-26
[143] C. Leangsuksun, L. Shen, T. Lui, and S. L. Scott, Achieving High Availability and
Performance Computing with an HA-OSCAR Cluster, Future Generation Computer System,
Volume 21, Number 1, January 2005, pp. 597-606
237
[145] C. Leangsuksun, L. Shen, H. Song, and S. Scott, I. Haddad, The Modeling and Dependability
Analysis of High Availability OSCAR Cluster System, The 17th Annual International Symposium
on High Performance Computing Systems and Applications, Sherbrooke, Quebec, Canada, May
11-14, 2003
[146] I. Haddad, C. Leangsuksun, R. Libby, T. Liu, Y. Liu, and S. Scott, Highly Reliable Linux
HPC Clusters: Self-awareness Approach, Proceedings of the 2nd International Symposium on
Parallel and Distributed Processing and Applications, Hong Kong, China, December 13-15, 2004,
pp. 217-222
[148] I. Haddad, and C. Leangsuksun, Building Highly Available HPC Clusters with HA-OSCAR,
Tutorial Presentation, the 6th LCI International Conference on Clusters: The HPC Revolution
2005, Chapel Hill, NC, USA 2005, April 2005
[149] I. Haddad, and G. Butler, Experimental Studies of Scalability in Clustered Web Systems,
Proceedings of the International Parallel and Distributed Processing Symposium 2004, Santa Fe,
New Mexico, USA, April 2004
[150] I. Haddad, Keeping up with Carrier Grade, Linux Journal, August 2004
[151] I. Haddad, Carrier Grade Server Requirements, Linux User and Developer, August 2004
[152] I. Haddad, Moving Towards Open Platforms, LinuxWorld Magazine, May 2004
[153] I. Haddad, Linux Gains Momentum in Telecom, LinuxWorld Magazine, May 2004
[154] I. Haddad, OSDL Carrier Grade Linux, O'Reilly Network, April 2004
[155] I. Haddad, CGL Platforms: Characteristics and Development Efforts, Euro-Par 2003,
Klagenfurt, Austria, August 2003
[156] I. Haddad, C. Leangsuksun, M. Pourzandi, and A. Tikotekar, Feasibility Study and Early
Experimental Results Toward Cluster Survivability, Proceedings of Cluster Computing and Grid
2005, Cardiff, UK, May 9-12, 2005
[157] D. Gordon, and I. Haddad, Building an IPv6 DNS Server Node, Linux Journal, October 2003
238
[158] I. Haddad, C. Leangsuksun, M. Pourzandi, and A. Tikotekar, Experimental Results in
Survivability of Secure Clusters, Proceedings of the 6th International Conference on Linux
Clusters, Chapel Hill, NC, USA 2005
[159] I. Haddad, IPv6: The Essentials You Must Know, Linux User and Developer, May 2003
[160] I. Haddad, Using Freenet6 Service to Connect to the IPv6 Internet, Linux User and
Developer, July 2003
[161] I. Haddad, C. Leangsuksun, R. Libby, T. Liu, Y. Liu, and S. L. Scott , High-Availability and
Performance Clusters: Staging Strategy, Self-Healing Mechanisms, and Availability Analysis,
Proceedings of the IEEE Cluster Conference 2004, San Diego, USA, September 20-23, 2004
[162] I. Haddad, Streaming Video on Linux over IPv6, Linux User and Developer, August 2003
[163] I. Haddad, Voice over IPv6 on Linux, Linux User and Developer, September 2003
[164] I. Haddad, NAT-PT: IPv4/IPv6 and IPv6/IPv4 Address Translation, Linux User and
Developer, October 2003
[165] I. Haddad, Design and Implementation of HA Linux Clusters, IEEE Cluster 2001, Newport
Beach, USA, October 8-11, 2001
[166] I. Haddad, Designing Large Scale Benchmarking Environments, ACM Sigmetrics 2002,
Marina Del Rey, USA, June 2002
[167] I. Haddad, Supporting IPv6 on Linux Clusters, IEEE Cluster 2002, Chicago, USA, September
2002
[168] I. Haddad, IPv6: Characteristics and Ongoing Research, Internetworking 2003, San Jose,
USA, June 2003
[169] I. Haddad, CGL Platforms: Characteristics and Development Efforts, Euro-Par 2003,
Klagenfurt, Austria, August 2003
[170] I. Haddad, Carrier Grade Linux: Status and Ongoing Work, Real World Linux 2004,
Toronto, Canada, April 2004
[171] I. Haddad, Carrier Grade Platforms: Characteristics and Ongoing Efforts, ICETE 2004,
Setúbal, Portugal, August 2004
239
[172] I. Haddad, and C. Leangsuksun, Building HA/HPC Clusters with HA-OSCAR, Tutorial
Presentation at the IEEE Cluster Conference, San Diego, USA, September 2004
[173] I. Haddad, C. Leangsuksun, and S. Scott, Towards Highly Available, Scalable, and Secure
HPC Clusters with HA-OSCAR, the 6th International Conference on Linux Clusters, Chapel Hill,
NC, USA, April 2005
[174] I. Haddad, and S. Scott, HA Linux Clusters: Towards Platforms Providing Continuous
Service, Linux Symposium, Ottawa, Canada, July 2005
[175] I. Haddad, and C. Leangsuksun, HA-OSCAR: Highly Available Linux Cluster at your
Fingertips, IEEE Cluster 2005, Boston, USA, September 2005
[176] I. Haddad, HA Linux Clusters, Open Cluster Group 2001, Illinois, USA, March 2001
[177] I. Haddad, Combining HA and HPC, Open Cluster Group 2002, Montréal, Canada, June 2002
[178] I. Haddad, Towards Carrier Grade Linux Platforms, USENIX 2004, Boston, USA, June 2004
[179] I. Haddad, Towards Unified Clustering Infrastructure, Linux World Expo and Conference,
San Francisco, USA, August 2005
[180] I. Haddad, Carrier Grade Linux: Status and Ongoing Work, Real World Linux 2004,
Toronto, Canada, April 2004
[181] I. Haddad, Carrier Grade Platforms: Characteristics and Ongoing Efforts, ICETE 2004,
Setúbal, Portugal, August 2004
[182] C. Leangsuksun, A Failure Predictive and Policy-Based High Availability Strategy for Linux
High Performance Computing Cluster, The 5th LCI International Conference on Linux Clusters:
The HPC Revolution 2004, Austin, USA, May 18-20, 2004
240
Glossary
The definitions of the terms appearing in this glossary are referenced from [183].
AAA Authentication, authorization, and accounting (AAA) is a term for a framework for
intelligently controlling access to computer resources, enforcing policies, auditing usage, and
providing the information necessary to bill for services. These combined processes are considered
important for effective network management and security. Authentication, authorization, and
accounting services are often provided by a dedicated AAA server, a program that performs these
functions. A current standard by which network access servers interface with the AAA server is the
Remote Authentication Dial-In User Service (RADIUS).
Active/active A redundancy configuration where all servers in the cluster run their own
applications but are also ready to take over for failed server if needed.
Active/standby A redundancy configuration where one server is running the application while
another server in the cluster is idle but ready to take over if needed.
Availability Availability is the amount of time that a system or service is provided in relation to
the amount of time the system or service is not provided. Availability is commonly expressed as a
percentage.
241
C++ C++ is an object-oriented programming language.
C C is a structured, procedural programming language that has been widely used for both
operating systems and applications and that has had a wide following in the academic community.
CGI The common gateway interface (CGI) is a standard way for a Web server to pass a
Web user's request to an application program and to receive data back to forward to the user.
Cluster/server The client/server describes the relationship between two computer programs in which
one program, the client, makes a service request from another program, the server, which fulfills the
request.
Cluster A cluster is a collection of cluster nodes that my change dynamically as nodes join or
leave the cluster.
COTS Commercial off-the-shelf describes ready-made products that can easily be obtained.
Cluster Two or more computer nodes in a system used as a single computing entity to
provide a service or run an application for the purpose of high availability, scalability, and
distribution of tasks.
CMS Cluster Management System (CMS) is a management layer that allows the whole
cluster to be managed as a single entity.
DRAM Dynamic random access memory (DRAM) is the most common random access
memory (RAM) for personal computers and workstations.
242
DRBD Disk Replication Block Device
DNS The domain name system (DNS) is the way that Internet domain name are located
and translated into IP addresses. A domain name is a meaningful and easy-to-remember "handle" for
an Internet address.
DIMM A DIMM (dual in-line memory module) is a double SIMM (single in-line memory
module). Like a SIMM, it is a module containing one or several random access memory (RAM) chips
on a small circuit board with pins that connect it to the computer motherboard.
Failure The inability of a system or system component to perform a required function within
specified limits. A failure may be produced when a fault is encountered. Examples of failures include
invalid data being provided, slow response time, and the inability for a service to take a request.
Causes of failure can be hardware, firmware, software, network, or anything else that interrupts the
service.
FTP File Transfer Protocol (FTP) is a standard Internet protocol that defines one way of
exchanging files between computers on the Internet.
243
Gateways Gateways are bridges between two different technologies or administration domains.
A media gateway performs the critical function of converting voice messages from a native
telecommunications time-division-multiplexed network, to an Internet protocol packet-switched
network.
High Availability The state of a system having a very high ratio of service uptime compared to
service downtime. Highly available systems are typically rated in terms of number of nines such as
fivenines or sixnines.
HLR The Home Location Register (HLR) is the main database of permanent subscriber
information for a mobile network.
HTML Hypertext Markup Language (HTML) is the set of markup symbols or codes inserted
in a file intended for display on a World Wide Web browser page. The markup tells the web browser
how to display a web page's words and images for the user.
HTTP Hypertext Transfer Protocol (HTTP) is the set of rules for exchanging files (text,
graphic images, sound, video, and other multimedia files) on the World Wide Web.
IP The Internet Protocol (IP) is the method or protocol by which data is sent from one
computer to another on the Internet.
244
iptables is a Linux command used to set up, maintain, and inspect the tables of IP packet filter
rules in the Linux kernel. There are several different tables, which may be defined, and each table
contains a number of built-in chains, and may contain user-defined chains. Each chain is a list of rules
which can match a set of packets: each rule specifies what to do with a packet which matches. This is
called a `target', which may be a jump to a user-defined chain in the same table.
IPv6 Internet Protocol Version 6 (IPv6) is the latest version of the Internet Protocol. IPv6
is a set of specifications from the Internet Engineering Task Force (IETF) that was designed as an
evolutionary set of improvements to the current IP Version 4.
ISDN Integrated Services Digital Network (ISDN) is a set of standards for digital
transmission over ordinary telephone copper wire as well as over other media.
I/O I/O describes any operation, program, or device that transfers data to or from a
computer.
ISP Internet service provider (ISP) is a company that provides individuals and other
companies access to the Internet and other related services such as web site building and virtual
hosting.
LAN A local area network (LAN) is a group of computers and associated devices that
share a common communications line or wireless link and typically share the resources of a single
processor or server within a small geographic area.
245
MMP Massively Parallel Processors (MPP) is the coordinated processing of a program by
multiple processors that work on different parts of the program, with each processor using its own
operating system and memory. Typically, MPP processors communicate using some messaging
interface.
MP3 MP3 (MPEG-1 Audio Layer-3) is a standard technology and format for compression
a sound sequence into a very small file (about one-twelfth the size of the original file) while
preserving the original level of sound quality when it is played.
MTTF Mean Time To Failure (MTTF) is the interval in time which the system can provide
service without failure
MTTR Mean Time To Repair (MTTR) is the interval in time it takes to resume service after
a failure has been experienced.
NAS Network-attached storage (NAS) is hard disk storage that is set up with its own
network address rather than being attached to the department computer that is serving applications to
a network's workstation users.
NAT NAT (Network Address Translation) is the translation of an Internet Protocol address
(IP address) used within one network to a different IP address known within another network. One
network is designated the inside network and the other is the outside.
Network A connection of [nodes] which facilitates [communication] among them. Usually, the
connected nodes in a network use a well defined [network protocol] to communicate with each other.
Network Protocols Rules for determining the format and transmission of data. Examples of
network protocols include TCP/IP, and UDP.
246
NIC A network interface card (NIC) is a computer circuit board or card that is installed in
a computer so that it can be connected to a network.
Node A single computer unit, in a [network], that runs with one instance of a real or virtual
operating system.
NTP Network Time Protocol (NTP) is a protocol that is used to synchronize computer
clock times in a network of computers.
OSI The Open System Interconnection, model defines a networking framework for
implementing protocols in seven layers. Control is passed from one layer to the next, starting at the
application layer in one station, proceeding to the bottom layer, over the channel to the next station
and back up the hierarchy.
Perl Perl is a script programming language that is similar in syntax to the C language and
that includes a number of popular Unix facilities such as SED, awk, and tr.
PDA Personal digital assistant (PDA) is a term for any small mobile hand-held device that
provides computing and information storage and retrieval capabilities for personal or business use,
often for keeping schedule calendars and address book information handy.
247
Proxy Server A computer network service that allows clients to make indirect network connections
to other network services. A client connects to the proxy server, and then requests a connection, file,
or other resource available on a different server. The proxy provides the resource either by connecting
to the specified server or by serving it from a cache. In some cases, the proxy may alter the client's
request or the server's response for various purposes.
RAID Redundant array of independent disks (RAID) is a way of storing the same data in
different places (thus, redundantly) on multiple hard disks.
RTT Round-Trip Times (RTT) is the time required for a network communication to travel
from the source to the destination and back. RTT is used by routing algorithms to aid in calculating
optimal routes.
SAN Storage Area Network (SAN) is a high-speed special-purpose network (or sub-
network) that interconnects different kinds of data storage devices with associated data servers on
behalf of a larger network of users.
SCP A Service Control Point server is an entity in the intelligent network that implements
service control function that operation that affects the recording, processing, transmission, or
interpretation of data.
SCSI The Small Computer System Interface(SCSI) is a set of ANSI standard electronic
interfaces that allow personal computers to communicate with peripheral hardware such as disk
248
drives, tape drives, CD-ROM drives, printers, and scanners faster and more flexibly than previous
interfaces.
Session Series of consecutive page requests to the web server from the same user
Signaling Servers Signaling servers handle call control, session control, and radio recourse
control. A signaling server handles the routing and maintains the status of calls over the network. It
takes the request of user agents who want to connect to other user agents and routes it to the
appropriate signaling.
SLA Service Level Agreement (SLA) is a contract between a network service provider and
a customer that specifies, usually in measurable terms, what services the network service provider
will furnish.
SPOF Single point of failure (SPOF) - Any component or communication path within a
computer system that would result in an interruption of the service if it failed.
SSI Single System Image (SSI) is a form of distributed computing in which by using a
common interface multiple networks, distributed databases or servers appear to the user as one
system. In SSI systems, all nodes share the operating system environment in the system.
249
Standby Not currently providing service but prepared to take over the active state.
System A computer system that consists of one computer [node] or many nodes connected
via a computer network mechanism.
Switch-over The term switch-over is used to designate circumstances where the cluster moves the
active state of a particular component/node from one component/node to another, after the failure of
the active component/node. Switch-over operations are usually the consequence of administrative
operations or escalation of recovery procedures.
Tcl Tcl is an interpreted script language developed by Dr. John Ousterhout at the
University of California, Berkeley, and now developed and maintained by Sun Laboratories.
TCP TCP (Transmission Control Protocol) is a set of rules (protocol) used with the
Internet Protocol (IP) to send data in the form of message units between computers over the Internet.
While IP takes care of handling the actual delivery of the data, TCP takes care of keeping track of the
individual units of data (called packets) that a message is divided into for efficient routing through the
Internet.
TFTP Trivial File Transfer Protocol (TFTP) is an Internet software utility for transferring
files that is simpler to use than the File Transfer Protocol (FTP) but less capable. It is used where user
authentication and directory visibility are not required.
TTL Time-to-live (TTL) is a value in an Internet Protocol (IP) packet that tells a network
router whether the packet has been in the network too long and should be discarded.
QoS Quality of Service (QoS) is the idea that transmission rates, error rates, and other
characteristics can be measured, improved, and, to some extent, guaranteed in advance.
URI To paraphrase the World Wide Web Consortium, Internet space is inhabited by many
points of content. A URI (Uniform Resource Identifier; pronounced YEW-AHR-EYE) is the way you
250
identify any of those points of content, whether it be a page of text, a video or sound clip, a still or
animated image, or a program. The most common form of URI is the web page address, which is a
particular form or subset of URI called a Uniform Resource Locator (URL).
USB USB (Universal Serial Bus) is a plug-and-play interface between a computer and
add-on devices (such as audio players, joysticks, keyboards, telephones, scanners, and printers). With
USB, a new device can be added to your computer without having to add an adapter card or even
having to turn the computer off.
User An external entity that acquires service from a computer system. It can be a human
being, an external device, or another computer system.
Web Service Web services are loosely coupled software components delivered over Internet
standard technologies. A web service can also be defined as a self-contained, modular application that
can be described, published, located, and invoked over the web.
251
252