0% found this document useful (0 votes)

52 views264 pages

Haddad PHD Thesis

This document is a thesis submitted by Ibrahim Haddad to Concordia University in partial fulfillment of the requirements for a Doctor of Philosophy degree in Computer Science and Software Engineering. The thesis proposes a novel cluster architecture called the HAS architecture for scalable and highly available web servers. It presents background on related work in clustering technologies and scalable web server architectures. It also describes preparatory work conducted, including prototyping a web cluster and benchmarking its performance. The goal of the thesis is to develop an architecture that provides non-stop service, scales linearly with additional processors, and supports high availability features to eliminate single points of failure.

Uploaded by

Karen Arsenyan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views264 pages

Haddad PHD Thesis

Uploaded by

Karen Arsenyan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 264

The HAS Architecture: A Highly Available and

Scalable Cluster Architecture for Web Servers

Ibrahim Haddad

A Thesis in the Department of

Computer Science and Software Engineering

Presented in Partial Fulfillment of the Requirements

For the Degree of Doctor of Philosophy at
Concordia University
Montréal, Québec, Canada

March 2006

©Ibrahim Haddad, 2006

CONCORDIA UNIVERSITY
SCHOOL OF GRADUATE STUDIES

This is to certify that the thesis prepared

By: _______________________________________________________________________

Entitled:____________________________________________________________________

_____________________________________________________________________

and submitted in partial fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY (Computer Science and Software Engineering)

complies with the regulations of the University and meets the accepted standards with respect to
originality and quality.

Signed by the final examining committee:

__________________________________________ Chair

__________________________________________ External Examiner

__________________________________________ External to Program

__________________________________________ Examiner

__________________________________________ Thesis Supervisor

Approved by

_________________________________________
Chair of Department or Graduate Program Director

____________2006 _______________________
Dr. Nabil Esmail, Dean
Faculty of Engineering and Computer Science

ii
Abstract

The HAS Architecture: A Highly Available and Scalable Cluster Architecture for Web Server
Ibrahim Haddad, Ph.D.
Concordia University, 2006

This dissertation proposes a novel architecture, called the HAS architecture, for scalable and highly
available web server clusters. The prototype of the Highly Available and Scalable Web Server
Architecture was validated for scalability and high availability. It provides non-stop service and is
able to maintain the base line performance of approximately 1000 requests per second per processor,
for up to 16 traffic processors in the cluster, achieving close to linear scalability. The architecture
supports dynamic traffic distribution using a lightweight distribution scheme, and supports connection
synchronization to ensure that web connections survive software or hardware failures. Furthermore,
the architecture supports different redundancy models and high availability capabilities such as
Ethernet and NFS redundancy that contribute in increasing the availability of the service, and
eliminating single points of failures.
This dissertation presents current methods for scaling web servers, discusses their limitations, and
investigates how clustering technologies can help overcome some of these challenges and enables the
design of scalable web servers based on a cluster of workstations. It examines various ongoing
research projects in the academia and the industry that are investigating scalable and highly available
architectures for web servers. It discusses their scope, architecture, provides a critical analysis of their
work, presents their advantages and drawbacks, and their contributions to this dissertation.
The proposed Highly Available and Scalable Web Server Architecture builds on current knowledge,
and provides contributions in areas such as scalability, availability, performance, traffic distribution,
and cluster representation.

iii
Acknowledgments

The work that has gone into this thesis has been thoroughly enjoyable largely because of the
interaction that I have had with my supervisors and colleagues. I would like to express my gratitude
to my supervisor Professor Greg Butler, whose expertise, understanding, and patience, added
considerably to my graduate experience. I appreciate his vast knowledge and skills in many areas, and
his encouragement that provided me with much support, guidance, and constructive criticism.
I would like to thank the other members of my committee, Professor J. William Atwood, Dr. Ferhat
Khendek, and Professor Thiruvengadam Radhakrishnan for the assistance they provided at all levels
of the project. The feedback I received from members of my committee as early as during my
doctoral proposal was very important and had influence on the direction of the work.
I also would like to acknowledge the support I received from Ericsson Research granting me
unlimited access to their remarkable research lab in Montréal, Canada.
I would also like to thank and express my gratitude to my wife, parents, brother, and sister for their
love, encouragement, and support.

Ibrahim Haddad
March 2006

iv
Table of Contents
Abstract .................................................................................................................................................iii
Acknowledgments .................................................................................................................................iv
Table of Contents ................................................................................................................................... v
List of Figures .....................................................................................................................................viii
List of Tables.........................................................................................................................................xi
Chapter 1 Introduction and Motivation .................................................................................................. 1
1.1 Internet and Web Servers ............................................................................................................. 1
1.2 The Need for Scalability............................................................................................................... 2
1.3 Web Servers Overview................................................................................................................. 3
1.4 Properties of Internet and Web Applications ............................................................................... 8
1.5 Study Objectives......................................................................................................................... 10
1.6 Scope of the Study...................................................................................................................... 11
1.7 Thesis Contributions................................................................................................................... 13
1.8 Dissertation Roadmap ................................................................................................................ 14
Chapter 2 Background and Related Work............................................................................................ 16
2.1 Cluster Computing ..................................................................................................................... 16
2.2 SMP versus Clusters................................................................................................................... 22
2.3 Cluster Software Components.................................................................................................... 23
2.4 Cluster Hardware Components................................................................................................... 23
2.5 Benefits of Clustering Technologies .......................................................................................... 23
2.6 The OSI Layer Clustering Techniques ....................................................................................... 26
2.7 Clustering Web Servers.............................................................................................................. 32
2.8 Scalability in Internet and Web Servers ..................................................................................... 36
2.9 Overview of Related Work......................................................................................................... 43
2.10 Related Work: In-depth Examination....................................................................................... 46
Chapter 3 Preparatory Work................................................................................................................. 65
3.1 Early Work ................................................................................................................................. 65
3.2 Description of the Prototyped Web Cluster................................................................................ 65
3.3 Benchmarking Environment....................................................................................................... 67
3.4 Web Server Performance............................................................................................................ 69
3.5 LVS Traffic Distribution Methods ............................................................................................. 70
v
3.6 Benchmarking Scenarios............................................................................................................ 74
3.7 Apache Performance Test Results ............................................................................................. 74
3.8 Tomcat Performance Test Results ............................................................................................. 79
3.9 Scalability Results...................................................................................................................... 81
3.10 Discussion ................................................................................................................................ 83
3.11 Contributions of the Preparatory Work.................................................................................... 84
Chapter 4 The Architecture of the Highly Available and Scalable Web Server Cluster ..................... 85
4.1 Architectural Requirements ....................................................................................................... 85
4.2 Overview of the Challenges....................................................................................................... 87
4.3 The HAS Architecture ............................................................................................................... 88
4.4 HAS Architecture Components ................................................................................................. 91
4.5 HAS Architecture Tiers ............................................................................................................. 94
4.6 Characteristics of the HAS Cluster Architecture ....................................................................... 96
4.7 Availability and Single Points of Failures ................................................................................. 99
4.8 Overview of Redundancy Models............................................................................................ 102
4.9 HA Tier Redundancy Models .................................................................................................. 103
4.10 SSA Tier Redundancy Models............................................................................................... 107
4.11 Storage Tier Redundancy Models.......................................................................................... 109
4.12 Redundancy Model Choices .................................................................................................. 109
4.13 The States of a HAS Cluster Node......................................................................................... 111
4.14 Example Deployment of a HAS Cluster ................................................................................ 113
4.15 The Physical View of the HAS Architecture ......................................................................... 116
4.16 The Physical Storage Model of the HAS Architecture .......................................................... 118
4.17 Types and Characteristics of the HAS Cluster Nodes ........................................................... 123
4.18 Local Network Access ........................................................................................................... 126
4.19 Master Nodes Heartbeat......................................................................................................... 127
4.20 Traffic Nodes Heartbeat using the LDirectord Module ......................................................... 128
4.21 CVIP: A Cluster Virtual IP Interface for the HAS Architecture............................................ 130
4.22 Connection Synchronization .................................................................................................. 136
4.23 Traffic Management............................................................................................................... 140
4.24 Access to External Networks and the Internet ....................................................................... 149
4.25 Ethernet Redundancy ............................................................................................................. 150
vi
4.26 Dependencies and Interactions between Software Components ............................................ 151
4.27 Scenario View of the Architecture ......................................................................................... 155
4.28 Network Configuration with IPv6 .......................................................................................... 172
Chapter 5 Architecture Validation...................................................................................................... 176
5.1 Introduction .............................................................................................................................. 176
5.2 Validation of Performance and Scalability............................................................................... 176
5.3 The Benchmarked HAS Architecture Configurations.............................................................. 178
5.4 Test-0: Experiments with One Standalone Traffic Node ......................................................... 180
5.5 Test-1: Experiments with a 4-nodes HAS Cluster.................................................................... 183
5.6 Test-2: Experiments with a 6-nodes HAS Cluster.................................................................... 186
5.7 Test-3: Experiments with a 10-nodes HAS Cluster.................................................................. 188
5.8 Test-4: Experiments with an 18-nodes HAS Cluster................................................................ 191
5.9 Scalability Charts...................................................................................................................... 192
5.10 Validation of High Availability.............................................................................................. 194
5.11 HA-OSCAR Architecture: Modeling and Availability Prediction......................................... 199
5.12 Impact of the HAS Architecture on Open Source .................................................................. 204
5.13 HA-OSCAR versus Beowulf Architecture............................................................................. 205
5.14 The HA-OSCAR Architecture versus the HAS Architecture................................................. 207
5.15 HAS Architecture Impact on Industry.................................................................................... 210
Chapter 6 Contributions, Future Work, and Conclusion .................................................................... 212
6.1 Contributions ............................................................................................................................ 212
6.2 Future Work ............................................................................................................................. 220
6.3 Conclusion................................................................................................................................ 226
Bibliography....................................................................................................................................... 228
Glossary.............................................................................................................................................. 241

vii
List of Figures
Figure 1: Web server components ......................................................................................................... 4
Figure 2: Request handling inside a web server.................................................................................... 5
Figure 3: Analysis of a web request....................................................................................................... 5
Figure 4: The SMP architecture ........................................................................................................... 17
Figure 5: The MPP architecture ........................................................................................................... 18
Figure 6: Generic cluster architecture .................................................................................................. 19
Figure 7: Cluster architectures with and without shared disks ............................................................ 19
Figure 8: A cluster node stack.............................................................................................................. 20
Figure 9: The L4/2 clustering model.................................................................................................... 26
Figure 10: Traffic flow in an L4/2 based cluster ................................................................................. 27
Figure 11: The L4/3 clustering model.................................................................................................. 28
Figure 12: The traffic flow in an L4/3 based cluster........................................................................... 29
Figure 13: The process of content-based dispatching – L7 clustering model...................................... 30
Figure 14: A web server cluster ........................................................................................................... 32
Figure 15: Using a router to hide the web cluster ................................................................................ 33
Figure 16: Hierarchical redirection-based web server architecture ..................................................... 47
Figure 17: Redirection mechanism for HTTP requests........................................................................ 48
Figure 18: The web farm architecture with the dispatcher as the central component.......................... 51
Figure 19: The SWEB architecture...................................................................................................... 53
Figure 20: The functional modules of a SWEB scheduler in a single processor ................................. 54
Figure 21: The LSMAC implementation ............................................................................................. 56
Figure 22: The LSNAT implementation .............................................................................................. 56
Figure 23: The architecture of the IP sprayer....................................................................................... 58
Figure 24: The architecture with the HACC smart router.................................................................... 58
Figure 25: The two-tier server architecture.......................................................................................... 61
Figure 26: The flow of the web server router ...................................................................................... 62
Figure 27: The architecture of the prototyped web cluster .................................................................. 66
Figure 28: The architecture of the WebBench benchmarking tool ...................................................... 68
Figure 29: The architecture of the LVS NAT method ......................................................................... 71
Figure 30: The architecture of the LVS DR method............................................................................ 72
Figure 31: Benchmarking results of NAT versus DR.......................................................................... 73
Figure 32: Benchmarking results of the Apache web server running on a single processor ............... 75
Figure 33: Apache reaching a peak of 5,903 KB/s before the Ethernet driver crashes ....................... 75
Figure 34: Benchmarking results of Apache on one processor – post Ethernet driver update ............ 76
Figure 35: Results of a two-processor cluster (requests per second) ................................................... 77
Figure 36: Results of a four-processor cluster (requests per second) .................................................. 77
Figure 37: Results of eight-processor cluster (requests per second).................................................... 78
Figure 38: Results of Tomcat running on two processors (requests per second)................................. 79
Figure 39: Results of a four-processor cluster running Tomcat (requests per second)........................ 80
Figure 40: Results of an eight-processor cluster running Tomcat (requests per second).................... 80
Figure 41: Scalability chart for clusters consisting of up to 12 nodes running Apache....................... 82
Figure 42: Scalability chart for clusters consisting of up to 12 nodes running Tomcat....................... 82
Figure 43: The HAS architecture ......................................................................................................... 90
Figure 44: Built-in redundancy at different layers of the HAS architecture ...................................... 101
Figure 45: The process of the network adapter swap......................................................................... 102
Figure 47: The 1+1 active/standby redundancy model ...................................................................... 104
viii
Figure 48: Illustration of the failure of the active node ...................................................................... 104
Figure 49: The 1+1 active/active redundancy model ......................................................................... 106
Figure 50: The N+M and N-way redundancy models........................................................................ 107
Figure 51: The N+M redundancy model with support for state replication ....................................... 108
Figure 52: The N+M redundancy model, after the failure of an active node ..................................... 108
Figure 53: the redundancy models at the physical level of the HAS architecture.............................. 109
Figure 54: The state diagram of the state of a HAS cluster node ....................................................... 112
Figure 55: The state diagram including the standy state .................................................................... 113
Figure 56: A HAS cluster using the HA NFS implementation .......................................................... 114
Figure 57: The HA-OSCAR prototype with dual active/standby head nodes.................................... 114
Figure 58: The physical view of the HAS architecture ...................................................................... 117
Figure 59: The no-shared storage model ........................................................................................... 119
Figure 60: The HAS storage model using a distributed file system ................................................... 120
Figure 61: The NFS server redundancy mechanism .......................................................................... 120
Figure 62: DRDB disk replication for two nodes in the 1+1 active/standby redundancy model ....... 122
Figure 63: A HAS cluster with two specialized storage nodes .......................................................... 123
Figure 64: The master node stack....................................................................................................... 124
Figure 65: The traffic node stack........................................................................................................ 125
Figure 66: The redundant LAN connections within the HAS architecture ........................................ 126
Figure 67: The topology of the heartbeat Ethernet broadcast............................................................. 128
Figure 68: The CVIP generic configuration ....................................................................................... 131
Figure 69: Level of distribution.......................................................................................................... 132
Figure 70: Network termination concept............................................................................................ 133
Figure 71: The CVIP framework........................................................................................................ 134
Figure 72: Step 1 - Connection Synchronization................................................................................ 138
Figure 73: Step 2 - Connection Synchronization................................................................................ 138
Figure 74: Step 3 - Connection Synchronization................................................................................ 139
Figure 75: Step 4 - Connection Synchronization................................................................................ 139
Figure 76: Peer-to-peer approach ....................................................................................................... 140
Figure 77: The CPU information available in /proc/cpuinfo.............................................................. 143
Figure 78: The memory information available in /proc/meminfo ...................................................... 144
Figure 79: Example list of traffic nodes and their load index ............................................................ 146
Figure 80: Illustration of the interaction between the traffic client and the traffic manager .............. 147
Figure 81: The direct routing approach – traffic nodes reply directly to web clients......................... 149
Figure 82: The restricted access approach – traffic nodes reply to master nodes, who in turn reply to
the web clients ........................................................................................................................... 150
Figure 83: The dependencies and interconnections of the HAS architecture system software .......... 152
Figure 84: The sequence diagram of a successful request with one active master node.................... 157
Figure 85: The sequence diagram of a successful request with two active master nodes .................. 158
Figure 86: A traffic node reporting its load index to the traffic manager........................................... 159
Figure 87: A traffic node joining the HAS cluster ............................................................................. 160
Figure 88: The boot process of a diskless node.................................................................................. 161
Figure 89: The boot process of a traffic node with disk – no software upgrades are performed ....... 162
Figure 90: The process of rebuilding a node with disk ...................................................................... 163
Figure 91: The process of upgrading the kernel and application server on a traffic node.................. 164
Figure 92: The sequence diagram of upgrading the hardware on a master node ............................... 165
Figure 93: The sequence diagram of a master node becoming unavailable ....................................... 166
Figure 94: The NFS synchronization occurs when a master node becomes unavailable ................... 166
ix
Figure 95: The sequence diagram of a traffic node becoming unavailable ....................................... 167
Figure 96: The scenario assumes that node C has lost network connectivity .................................... 168
Figure 97: The scenario of an Ethernet port becoming unavailable .................................................. 169
Figure 98: The sequence diagram of a traffic node leaving the HAS cluster .................................... 169
Figure 99: The LDirectord restarting an application process............................................................. 171
Figure 100: The network becomes unavailable ................................................................................. 172
Figure 101: The sequence diagram of the IPv6 autoconfiguration process ....................................... 173
Figure 102: A functional HAS cluster supporting IPv4 and IPv6...................................................... 175
Figure 103: A screen capture of the WebBench software showing 379 connected clients................ 177
Figure 104: The network setup inside the benchmarking lab ............................................................ 178
Figure 105: The benchmarked HAS cluster configurations showing Test-[1..4] .............................. 179
Figure 106: The results of benchmarking a standalone processor -- transactions per second ........... 181
Figure 107: The throughput benchmarking results of a standalone processor................................... 182
Figure 108: The number of failed requests per second on a standalone processor ........................... 182
Figure 109: The number of successful requests per second on a HAS cluster with four nodes ........ 184
Figure 110: The throughput results (KB/s) on a HAS cluster with four nodes.................................. 185
Figure 111: The number of failed requests per second on a HAS cluster with four nodes................ 185
Figure 112: The number of successful requests per second on a HAS cluster with six nodes .......... 187
Figure 113: The throughput results (KB/s) on a HAS cluster with six four nodes ............................ 187
Figure 114: The number of failed requests per second on a HAS cluster with six nodes.................. 188
Figure 115: The number of successful requests per second on a HAS cluster with 10 nodes ........... 190
Figure 116: The throughput results (KB/s) on a HAS cluster with 10 nodes .................................... 190
Figure 117: The number of successful requests per second on a HAS cluster with 18 nodes ........... 191
Figure 118: The throughput results (KB/s) on a HAS cluster with 18 nodes .................................... 192
Figure 119: The results of benchmarking the HAS architecture prototype ....................................... 193
Figure 120: The scalability chart of the HAS architecture prototype ................................................ 194
Figure 121: The possible connectivity failure points......................................................................... 195
Figure 122: The tested setup for data redundancy ............................................................................ 198
Figure 123: The modeled HA-OSCAR architecture, showing the three sub-models ........................ 200
Figure 124: A screen shot of the SPNP modeling tool ...................................................................... 201
Figure 125: System instantaneous availabilities ................................................................................ 203
Figure 126: Availability improvement analysis of HA-OSCAR versus the Beowulf architecture.... 204
Figure 127: The architecture of a Beowulf cluster............................................................................. 205
Figure 128: The architecture of HA-OSCAR .................................................................................... 207
Figure 129: The CGL cluster architecture based on the HAS architecture........................................ 210
Figure 130: The contributions of the HAS architecture..................................................................... 212
Figure 131: The untested configurations of the HAS architecture..................................................... 221
Figure 133: The architecture logical view with specialized nodes .................................................... 224

x
List of Tables
Table 1: Classification of clusters by usage and functionality ............................................................. 21
Table 2: Characteristics of SMP and cluster systems........................................................................... 22
Table 3: Expected service availability per industry type...................................................................... 24
Table 4: Advantages and drawbacks of clustering techniques operating at the OSI layer................... 31
Table 5: Web performance metrics ...................................................................................................... 69
Table 6: The results of benchmarking with Apache............................................................................. 78
Table 7: The results of benchmarking with Tomcat............................................................................. 81
Table 8: The possible redundancy models per each tier of the HAS architecture.............................. 110
Table 9: The supported redundancy models per each tier in the HAS architecture prototype ........... 111
Table 10: The performance results of one standalone processor running the Apache web server..... 180
Table 11: The results of benchmarking a four-nodes HAS cluster .................................................... 183
Table 12: The results of benchmarking a HAS cluster with six nodes............................................... 186
Table 13: The results of benchmarking a HAS cluster with 10 nodes ............................................... 189
Table 14: The summary of the benchmarking results of the HAS architecture prototype ................. 192
Table 15: Input parameters for the HA-OSCAR model ..................................................................... 201
Table 16: System availability for different configurations................................................................. 202
Table 17: The changes made to the Linux kernel to support NFS redundancy.................................. 217

xi
Chapter 1
Introduction and Motivation

1.1 Internet and Web Servers

An Internet server is a server that provides services to users over the Internet using the client/server
model. We differentiate between the various flavors of Internet servers by the type of services they
offer, the characteristics of their workload, degree of high availability, security levels, performance,
throughput, and response times. A web server is one specialization of Internet servers. A web server
is a client/server program that uses the Hypertext Transfer Protocol (HTTP) to serve static and
dynamic contents to web users. We use web servers as a case study throughout this dissertation.
In recent years, the interest in and the deployment of scalable and highly available Internet servers has
increased rapidly for the wide potential such systems offer. The progress of Internet servers has been
feasible, driven by advances in network, software, and computer technologies. However, there are
still many challenges to resolve. Scalability is one of the biggest challenges facing Internet servers
providing interactive services for a large user base, and it presents itself as a crucial factor for the
success or failure of an online service.
The scalability of an Internet server refers to its ability to retain performance levels when adding
additional resources without requiring architectural changes or technology changes, and without
imposing additional overhead. For instance, a web server scales linearly if it continues to be available
and functional at consistent speeds as the number of users and requests continues to grow to high
numbers. As long as the web server continues to provide consistent performance in the face of rising
demand, then it is scalable. For instance, a web server is able to serve up to 1,000 requests per second
with a single processor (Section 3.7). If we double the number of processors, then each processor in
the cluster should theoretically be able to maintain a 1,000 request per second per processor, and the
cluster serving 2,000 requests per second, achieving linear scalability.
The common strategy in measuring server scalability is to measure throughput as the number of users
or traffic increases and identify important trends. For instance, we measure the throughput of the
server with 100 concurrent transactions, then with 1,000, and then with 10,000 transactions. We then
examine how throughput changes and observe how it compares with linear scalability. This
comparison gives us a measure of the scalability of the architecture.
1
To measure scalability, we need to benchmark and measure the performance of the web server.
Benchmarks operate on a real system or a working prototype. A benchmark is a publicly defined
procedure, designed to evaluate the performance of a web server system using a well-defined and
standardized workload model. The web server benchmark is a mechanism that generates a controlled
stream of web requests with standard metrics and aims at reproducing as accurately as possible the
characteristics of real traffic patterns, and reports the results.
The main goals of benchmarking are: to measure the performance and scalability of request
dispatching algorithms and request routing mechanisms, to assess the impact of changes in the
information system such as the system architecture and the distribution of content and services, and to
help tune the system configuration. Furthermore, benchmarking helps evaluate the system capacity
and response time with respect to an existing, expected, or standardized workload.
The common metrics for web server performance include throughput, connection rate, request rate,
reply rate, error rate, connect time, latency time, and transfer time. Section 3.4 discusses the metrics
for web server performance.

1.2 The Need for Scalability

The explosive growth of the Internet in the last few years has given rise to a vast range of new online
services. Current Internet services span a diverse range of categories that require significant
computational and I/O resources to process each request. Furthermore, exponential growth of the
Internet population is placing unprecedented demands upon the scalability and robustness of these
services [1][2]. Yahoo!, for instance, receives over 1.2 billion page views a day [3], while AOL’s web
caches service over 10 billion hits daily [4].
Internet services have become critical both for driving large businesses as well as for personal
productivity. Global enterprises are increasingly dependent upon Internet-based applications for e-
commerce, supply chain management, human resources, and financial accounting, while many
individuals consider e-mail and web access to be indispensable. This growing dependence upon
Internet services underscores the importance of their availability, scalability, and ability to handle
large loads. Popular web sites such as eBay [5], Excite [6], and E*Trade [7] have, at times,
experiences costly and high profile outages during periods of high load. As more people rely upon the
Internet for managing financial accounts, paying bills, and potentially even voting in elections, it is
increasingly important that these services are available at all times, perform well under high demand,

2
and are robust to accommodate to rapid changes in load. Furthermore, the variations of load
experienced by web servers intensify the challenges of building scalable and highly available web
servers. It is not uncommon to experience more than 100-fold increases in demand when a web site
becomes popular [8].
When the terrorist attacks on New York City and Washington DC occurred on September 11, 2001,
Internet news services reached unprecedented levels of demands. CNN.com, for instance, experienced
a two-and-a-half hour outage with load exceeding 20 times the expected peak [8]. Although the site
team managed to grow the server farm by a factor of five by borrowing machines from other sites,
this arrangement was not sufficient to deliver adequate service during the load spike. CNN.com came
back online only after replacing the front page with a text-only summary in order to reduce the load
[9]. Web sites are also subject to sophisticated denial-of-service attacks, often launched
simultaneously from thousands of servers, which can knock a service out of commission. Denial-of-
service attacks have had a major impact on the performance of sites such as Yahoo! and
whitehouse.gov [10]. The number of concurrent sessions and hits per day to Internet sites translates
into a large number of I/O and network requests, placing enormous demands on underlying resources.

1.3 Web Servers Overview

Web servers are a specialization of a server providing services over the Internet. The following sub-
sections describe the functions of a web server, present on the HTTP [11] request and reply cycle, and
discuss the protocol performance issues.

1.3.1 Web Server Definition

A web server, sometimes called HTTP server, is a program that responds to an incoming TCP
connection, and provides a service to the requester. Its primary purpose is to provide data using the
HTTP [11]. There are many variants of web servers for different operating system platforms.
Figure 1 illustrates the three core components of a web server: the web server software, a computer
system with a connection to the Internet, and the information or documents that are available for
serving. In this dissertation, the term web server refers to the whole entity, computer platform, server
software, and the documents.

3
Web Server
Machine Disk
Storage

Web Clients

Apache
Web Server
Software

Network

Figure 1: Web server components

1.3.2 Functions of a Web Server

The primary function of a web server is to service requests made over the HTTP. The web server
receives a request asking for a specific resource, a document for instance, and then it returns the
resource as a response to the web client.
A web server consists of several internal components. Each component is responsible for imparting
certain functionality to the whole system. These components work in tandem, as well as with other
external components. The web server software is composed of a stream manager, a HTTP server, and
a path resolver. These internal components to the web server software interact with external
components, such as the common gateway interface (CGI) [12] and the file system. The file system is
the main source of information of a web server where data resides. The CGI provides an interface
where the use of scripts empowers the server to compute request specific information; it allows the
web clients to view documents created at run time.
The data flow in the server from the time a request enters the server until the server sends the output
back to the client follows a cycle called the data flow cycle. Figure 2 illustrates how a web server
handles an incoming web client request. The stream manager is the first contact point for a web client
with the web server. When a client sends a request to a server (1), it docks on to the stream manager.
A client might reference a file in its request, and as a result, the web server returns that file to the
client. A client can also request a program, a CGI script for instance, and the web server launches that
program and returns the resulting output to the client. The stream manager decodes the request (2)
and pushes the request to the HTTP server module. The HTTP server is the layer between the server
and the underlying operating system. It provides the necessary functionality to gain access to the file
system, either by connecting directly by making system calls to it or through the CGI. The HTTP
4
server decodes the path of the request from the uniform resource locator (URL) of the request (3)
using the path resolver. Next, the HTTP server authenticates the user using the access list, which
stores all the authorized users. On a successful authentication (4), the HTTP server accesses the file
system (5). Depending on the request type, the web server either retrieves a resource from the file
system, or executes a new program in a separate process. In both cases, the result of the above
operation is written on an output stream and passed on to the client via a stream manager (6).

Native File System

CGI 5

Web Client Path

1
2 Resolver
Stream HTTP
Manager Server
6 Access
List
4

Figure 2: Request handling inside a web server

Web Server hosting

www.website.com

Web Object 5

Users
Users HTTP Request
4

1
142.133.17.22

142.133.17.22 3

www.website.com

Local DNS Authoritative

Server DNS Server for
www.website.com

Figure 3: Analysis of a web request

Figure 3 illustrates the two main phases that a web request goes through from outside the web server:
the lookup phase includes steps (1), (2), and (3), and the request phase includes steps (4) and (5).

5
When the user requests a web site from the browser in the form of a URL, the request arrives (1) to
the local DNS server who consults (2) the authoritative DNS server responsible for the requested web
site. The local domain name system (DNS) server then sends back (3) the IP address location of the
web server hosting the requested web site to the client. The client requests (4) the document from the
web server using the web server’s IP address and the web server responds (5) to the client with the
requested web document.
Web servers should cope with the numerous incoming requests using minimum system resources.
They have to be multitasking to deal with more than one request at a time. They provide mechanisms
to control access authorization and ensure that the incoming requests are not a threat for the host
system, where the web server software runs. In addition, web servers respond to error messages they
receive, negotiate a style and a language of response with the client, and in some cases run as a proxy
server. Web servers generate logs of all connections for statistics and security reasons.

1.3.3 The HTTP Request and Reply Cycle

Web servers follow the client/server model, where a client sends a request to the server and the server
replies to the client. This cycle is called the HTTP request and reply cycle. This section illustrates the
steps that take place in a HTTP request and response cycle. The web server, running as a background
process, listens on a specific port for requests addressed to it; the default port where web servers
listen to incoming connections is port 80. A HTTP request arrives from a client to the web server. The
request includes the information needed by the web server to decide what to do with the request. The
web server reads the request and parses it to determine how to handle it. The web server tries to fulfill
the request. If the client is requesting a document, the web server locates the document on the disk.
Each web server has a repository of data that each request accesses. If there is only one copy of the
data, then all concurrently processed requests funnel through the one database of documents. The web
server sends a response to the client, which is in this case the requested document. Finally, the web
server cleans up by closing open files and terminates the network connection.
This cycle illustrates how a web server handles one request per connection. With the newer version of
the HTTP, HTTP version 1.1, the protocol sets up a persistent connection between the web client and
the web server, which stays open and allows the web client to send multiple requests. The operating
system completes several parts of the HTTP request and reply cycle such as creating a new process to

6
handle an incoming request, reading from the network, looking up the requested document, reading
the document from disk, and writing the document onto the network.

1.3.4 Web Servers Requirements

This section discusses the requirements of a highly available and scalable web server related to this
dissertation such as minimal response time, fast processing of requests, high availability, and
reliability.
- Minimal response time: A crucial factor in the success of a web service is the response time
experienced by the web user. The server has to minimize response times to meet the expectations
of the users. Research on a wide variety of web systems has demonstrated that users require
response times of less than one second when moving from one page to another when browsing
the web [13]. Traditional human factors research into response times confirmed the need for
response time faster than one second. The general guidelines regarding response times has been
about the same for almost thirty years [14][15]:
> 0.1 second is the approximate limit for having the user feel that the system is reacting
instantaneously, meaning no special feedback is necessary to display the result.
> 1.0 second is the approximate limit for users’ flow of thought to remain un-interrupted, even
if the user notices the delay. Normally, no special feedback is necessary during delays of
more than 0.1 second and less than 1.0 second, but the user does lose the feeling of operating
directly with the data.
> 10 seconds is the approximate limit for keeping the attention of the user focused on the
dialogue. For longer delays, users want to perform other tasks while waiting for the computer
to finish, so they expect some feedback indicating when the computer expects to finish.
Feedback during the delay is especially important if the response time is likely to be highly
variable, since users do not know what to expect.
- Fast processing capability: In a distributed environment, several factors introduce delays that
lessen the speed of processing of a web request, and the speed in which the traffic is distributed
among the servers within the web cluster. Therefore, a fast and lightweight traffic distribution
mechanism can help speed up processing time of web requests.
- Reliability and availability: Web servers need to be reliable and highly available. As the number
of users and the volume of data handled increases, it becomes difficult to guarantee web server

7
reliability and availability. Therefore, web servers should deploy hardware and software fault-
tolerance and redundancy mechanisms to ensure reliability, to prevent single points of failure, and
to maintain availability in case of a hardware or software failure.
- Ability to sustain a guaranteed number of connections: This requirement entails the web server to
maintain a minimum number of connections per second and to process these connections
simultaneously. The ability to sustain a guaranteed number of connections, also described as
maintaining the base performance, has a direct effect on the total number of requests the web
server can process at any point in time.
- High storage capacity: Web servers provide I/O storage capacity to store data and a variety of
information it is hosting. In addition, with the increased demand on multimedia data, requiring
fast data retrieval is an essential requirement.
- Cost effectiveness: An important requirement governing the future of web servers is their cost
effectiveness. Otherwise, the cost per transaction increases as the number of transactions increase,
and as a result, the cost becomes an important factor governing which server architecture and
software to use in any given deployment case.
Designing a high performance and scalable web and Internet server is a challenging task. This
dissertation aims to understand what causes scalability problems in the web server cluster and
explores how we can scale a web server cluster. The dissertation focuses on the design of next
generation cluster architecture to meet the requirements discussed above. The architecture need to be
able to scale linearly for up to 16 processors, support service availability, and reliability. The
architecture will inherently meet other requirements such as better cluster resource utilization and the
ability to handle different types of traffic.

1.4 Properties of Internet and Web Applications

To scale the performance of a distributed web server, system designers need to understand its
structure and purpose, as well as the characteristics of the provided services and data, for these all
affect the architecture and the scaling methods. The three fundamental properties of web applications,
also applicable to web servers, are massive concurrency demands, an increasing trend towards
complex content, and a need to be extremely robust to load. The following subsections discuss each
of these properties.

8
1.4.1 High Concurrency
The growth in popularity and functionality of Internet and web services has been astounding. While
the world wide web itself is growing in size, with recent estimates anywhere between 1 billion and
2.5 billion unique documents, the number of users on the web is also growing at a staggering rate
[16][17]. In April 2002, Nielsen NetRatings estimated that there are over 422 million Internet users
worldwide [18]. Consequently, Internet and web applications need to support unprecedented
concurrency demands, and these demands are increasing over time.

1.4.2 Dynamic Content

In recent years, the use of dynamic content generation has become more widespread. Processing
requests for dynamic content requires significant amounts of computation and I/O to generate. A
dynamic Internet service involves a range of systems, including web servers, middle-tier application
servers, and back-end databases, to process each request. The processing for each request might
include encryption and decryption, server side scripting, database access, or access to legacy systems
such as mainframes.
Such dynamic services require more resources. The content generated by a dynamic web service is
often not amenable to caching, so each request demands a large amount of server resources. In
addition, the resource demands for a given service are difficult to predict. Apart from placing new
demands on networks, this growth in content requires that Internet and web services to be responsible
for dedicating resources to store and serve vast amounts of data.

1.4.3 Robustness to Load

The demand for Internet services can spike rapidly and unexpectedly, causing the peak load to
achieve higher magnitude than the average load [5][6][7][8]. Given the vast user population on the
Internet, a site can suddenly become flooded with a surge of requests that far exceed its ability to
deliver service. A common approach to dealing with heavy load is to over-provision resources.
However, over provisioning is not feasible when the ratio of peak to average load is very high; it may
not be practical to add 10 or 100 times the number of machines needed to support the average load
case. This approach also neglects cost and system administration issues.
Since we cannot expect Internet services to scale to peak demand, it is critical that we design services
that can accommodate high load. When the demand on a service exceeds its capacity, a service should

9
not over commit its resources and degrade in a way that all clients suffer. Rather, the service need to
be aware of overload conditions and attempt to adapt to them, by degrading the quality of service
delivered to clients, or by predictably shedding load, such as by giving users some indication that the
service is saturated. It is far better for an overloaded service to inform users of the overload than to
silently drop requests.

1.5 Study Objectives

The goal of this study is to design an architecture for a highly available and scalable web server
cluster. The main motivations to design such architecture are to increase system capacity, availability,
and scalability, and to increase the rate at which the server accepts and processes incoming
connections. The architecture has to be flexible and modular, capable of linearly scaling as we add
more processors, while maintaining the baseline performance and without affecting the availability of
the service. As a result, as we double the number of processors in the web cluster, we expect to
double the number of requests per second served per cluster processor, and to maintain the level of
throughput per processor. The emphasis of the study is therefore on scalability through clustering,
effective resource utilization, and efficient distribution of user requests among the cluster nodes
through traffic differentiation. The highly available web server cluster should also adapt itself
dynamically to different numbers of users and amounts of data, with minimal effect on performance.
The goal of the architecture is achieve maximum scalability and performance levels for up to 16
processors, with each processor in the cluster handling N requests per second. N refers to the baseline
performance that reaches a threshold of over 1,000 requests per second (Section 3.7).
The limitation is set to 16 processors for two main reasons. First, based on our initial experiments
with web clusters (Chapter 3), we demonstrated that web clusters suffer from scalability problems
starting with small clusters that consist of four processors. Our experimental work confirmed
performance degradation and loss of scalability as we increased the number of nodes in a clustered
web server from four to eight, 10, and then to 12 nodes. The second reason for choosing 16
processors as the upper limit relies on the facts we collected from our literature review (Section 2.10).
Sections 2.9 and 2.10 present and discuss ongoing projects that focus on scalability and performance
of web servers from an architectural point of view, and experimenting with web clusters that consist
of eight processors and less. In addition, with an upper limit of 16 processors, we can demonstrate

10
and validate the scalability of the architecture in the lab without resorting to building a theoretical
model and simulating it.

1.6 Scope of the Study

This section defines the scope of the study with respect to architecture, type of servers covered by the
study, types of web site content, architectural model, scalability, and high availability. The scope is
on the architecture and its parameters and excludes external factors such as the performance of file
systems, networks, and operating systems.

1.6.1 Goal
The goal of this study is to propose an architecture for scalable and highly available web server
clusters. The architecture allows the following properties: fast access, linear scalability for up to 16
processors, architecture transparency, high availability, and robustness of offered services.

1.6.2 Internet Servers

This dissertation investigates web servers as specific case of an Internet server. A web server is a
program that uses the client/server model and the HTTP to serve files and documents to web users.

1.6.3 Web Site Content

For experimental purposes, we limit the experimental work with web server scalability testing to
static web site contents, which does not include dynamic transaction or web searches. In spite of this
self-set limitation, we supported dynamic test suites in our benchmarking experiments for the
proposed architecture. WebBench [19], the benchmarking tool used in the benchmarks, has the
facilities to include dynamic testing as part of its features.

1.6.4 Architecture and Servers

We aim to reduce architecture complexities by providing a generic architecture that can handle
massive concurrency demands and deal gracefully with large variations in load. The scope of the
proposed architecture is limited to systems that follow the client/server model and run applications
characterized by short transactions, short response time, a thin control path, and static delivery data
pack.

11
This dissertation does not address multimedia servers, streams, sessions, states, and applications
servers. In addition, it does not address nor try to fix problem with networking protocols. In addition,
high performance computing (HPC) is not in the scope of the study. HPC is a branch of computer
science that concentrates on developing software to run on supercomputers. The HPC research
focuses on developing parallel processing algorithms that divide large computational task into small
pieces so that separate processors can execute simultaneously. Architectures in this category focus on
maximizing compute performance for floating point operations. This branch of computing is
unrelated to the dissertation.
The architecture targets servers providing services over the Internet with the characteristics
previously mentioned. The architecture applies to systems with short response times such as, but not
exclusively, web servers, Authentication Authorization and Accounting (AAA) servers, Policy
servers, Home Location Register (HLR) servers, Service Control Point (SCP) servers, without
specialized extensions needed at the architectural level.

1.6.5 Scalability
Existing server scaling methods rely on adding more hardware, upgrading processor and memory, or
distributing the incoming load and traffic by distributing users or data into several servers. These
schemes, discussed in Chapter 2, suffer from different shortcomings, which can cause uneven load
distribution, create bottlenecks, and obstruct the scalability of a system. These scaling methods are
out of our scope and we do not aim to improve them.
Our goals with the architecture are to achieve scalability through clustering, where we dynamically
direct incoming web requests to the appropriate cluster node, and scale the number of serving nodes
with as little overheard or drop in baseline performance as possible. Therefore, our focus is scalability
while maintaining a high throughput. The cluster needs to be able to distribute the application load
across N cluster nodes with linear or close-to linear scalability.

1.6.6 High Availability

This work targets the architecture for highly available (HA) web servers. From this perspective, the
scope of this dissertation covers techniques that allow us to embed high availability capabilities at the
architecture level to ensure the high levels of availability with the least performance degradation. The
goal of high availability is to maximize service availability for the end users. There are two sub-

12
categories: HA stateless, with no saved state information and the HA stateful with state information
that allows the web application to maintain sessions across a failover. Our scope focuses on the HA
stateless web applications, although we can apply the same principles to the HA stateful web
applications.

1.7 Thesis Contributions

This dissertation proposes highly available and scalable web server architecture. It combines the use
of clustering technologies, high availability mechanisms, efficient and dynamic traffic distribution
mechanism to handle additional requests and maintain scalability across cluster nodes as the web
traffic increases. We evaluate and validate the architecture through performance and scalability, and
HA testing, to demonstrate its robustness, scalability for up to 16 processors, and to demonstrate its
availability.
This dissertation offers contributions in the areas of cluster architecture, scalability, traffic
management, high availability, single cluster interface, in addition to contributions to the open source
and the industry.
Architecture for highly available and scalable Web clusters: The architecture consists of multiple
connected server elements that allow the use of server building blocks to achieve high availability,
high performance, and high scalability. Chapter 4 presents the HAS architecture.
Scalability and capacity: We are able to scale the architecture and serve more traffic by adding
processors to the web cluster without affecting the servicability or the uptime of the provided
services. Chapter 5 presents and discusses the results of benchmarking the prototyped cluster.
Traffic management: We designed and prototyped a dynamic traffic distribution mechanism that
consists of a lightweight load monitoring mechanism that reports the load on the cluster nodes to the
scheduler, and a scheduler that distributes incoming traffic based on the distribution policy.
Furthermore, the traffic management scheme incorporates a keep-alive mechanism, through which the
scheduler is aware when a traffic node becomes unavailable and then stops forwarding traffic to it.
Section 4.23 discusses the traffic management scheme.
High Availability: We achieve HA by supporting different redundancy models on all tiers of the
architecture. Sections 4.7, 4.8, and 4.19 discuss our contributions in the area of HA.
A transparent, single cluster IP interface for the cluster: The cluster IP interface is an IP stack
distributed among a number of processors that allows access to a cluster of processors via a single IP

13
address. It provides a single entry to the cluster by hiding the complexity of the cluster, and provides
address location transparency, to address a resource in the cluster without knowing or specifying the
processor location. Section 4.21 presents this contribution.
Application availability: The architecture provides the capabilities to monitor the health of the
application server running on the traffic nodes, and dynamically exclude the node from the cluster in
the event the application process fails. Section 4.20 discusses these capabilities. Furthermore, with
connection synchronization between the two master nodes, in the event of failure of a master node,
the standby node is able to continue serving the established connections. Section 4.22 discusses
connection synchronization.
Contribution to Open Source: This work has resulted in several contributions to the the HA-OSCAR
project [21], whose architecture is based on the HAS architecture. Section 5.12 discusses these
contributions.
Contributions to the industry: The Carrier Grade industry initiative [26] at the Open Source
Development Labs [27] has adopted the HAS architecture as the base standard architecture for carrier
grade clusters running telecommunication applications. Section 5.15 discusses this contribution.
Other contributions include benchmarking current solutions, providing enhancements to their
capabilities and adding functionalities to existing system software, and providing best practices for
building benchmarking environment for large-scale systems.

1.8 Dissertation Roadmap

The dissertation consists of six chapters.
Chapter 1 provides the necessary background on Internet and web servers, discusses scalability
challenges, and presents the objectives and scope of the study.
Chapter 2 focuses on three main topics: clustering technologies, scalability challenges and related
work. The clustering section sets the grounds for all cluster related definitions and approaches to
build clustered web servers. It presents clustering technologies and the benefits resulting from using
these technologies to design and build Internet and web servers. The chapter introduces software and
hardware clustering technologies, their advantages and drawbacks, and discusses our experience
prototyping a highly available, and scalable, clustered web server platform. The chapter describes the
need of scalable servers, discusses how users experience scalability issues, and examines the resulting
performance problems. The last part of this chapter presents our literature review in the area of

14
scaling Internet and web servers. It presents a survey of academic and industry research projects,
discusses their focus areas, results, and contributions. It also presents the contributions of these
projects into this dissertation to help us achieve our goal of a scalable and highly available web server
platform.
Chapter 3 summarizes the technical preparatory work we completed in the laboratory prior to
designing the Highly Available and Scalable (HAS) architecture. This chapter describes the
prototyped web cluster that uses existing components and mechanisms. It also describes the
benchmarking environment we built specifically to test the performance and scalability of web
clusters and present the benchmarking results of the tests we conducted on the prototyped cluster.
Chapter 4 focuses on describing and discussing the HAS architecture. It presents the architecture, its
components, and their characteristics. The chapter then discusses the conceptual, physical, and
scenario architecture views, the supported redundancy models, the traffic distribution scheme, and the
dependencies between the various components. It also covers the architecture characteristics as
related to eliminating single points of failures.
Chapter 5 presents the validation of the architecture and illustrates how it scales for up to 16
processors without performance degradation. The validation covers two aspects: scalability and
availability. The chapter presents the results of the benchmarking tests we conducted on the HAS
architecture prototype. It also presents the results of experiments we conducted to test the availability
features in a HAS cluster.
Chapter 6 presents the contributions and future work in the areas of scalability and performance of
Internet and web servers.

15
Chapter 2
Background and Related Work

2.1 Cluster Computing

Cluster computing has become increasingly popular for a number of reasons, which include low cost,
high performance of commodity-off-the-shelf (COTS) hardware, availability of rapidly maturing
proprietary and open source software components to support high performance and high availability
applications. Furthermore, the high-speed networking and the improved microprocessor performance
have positioned networks of workstations as a more appealing vehicle for distributed computing
compared to the Symmetric Multiprocessor (SMP) and Massively Parallel Processor (MMP) systems.
As result, clusters built using COTS hardware and software are playing a major role in redefining the
concept of supercomputing and are becoming a popular alternative to SMP and MPP systems.
However, there are still several challenges in the areas of performance, availability, manageability,
and scalability that are key issues.
A web server is an example of an application that requires more computing power than what a single
computer can provide, and is a candidate to run on a computer cluster. A viable and cost-effective
web cluster solution consists of connecting multiple computers together and coordinates their efforts
to serve web traffic. The resulting system is a distributed web server that responds to incoming traffic
and processes them on multiple processors.
The following sub-sections introduce the three general classes of distributed systems: SMP, MPP, and
clusters, compare their advantages and drawbacks, and identify the model that is more suitable for
hosting web servers.

2.1.1 Symmetric Multiprocessors (SMP)

An SMP machine consists of tightly coupled series of identical processors, operating on a single
shared bank of memory. There are no multiple memories, input and output (I/O) systems, or operating
systems. SMP systems are shared everything systems, where each processor has access to the shared
memory system and all of the attached devices and peripherals of the system, perform I/O operations,
and interrupt other processors [28][29]. Figure 4 illustrates the architecture of an SMP system. SMP
systems have a master processor at boot time, and then the operating system starts up the second (or

16
more) processor(s) and manages access to the shared resources among all the processors. A single
copy of the operating system is in charge of all the processors. SMP systems available on the market
(at the time of writing) do not exceed 16 processors, with configurations available in two, four, eight,
and 16 processors.

Processor A Processor B Processor N

Cache Cache Cache

System Bus

Memory IO

Figure 4: The SMP architecture

SMP systems are not scalable because all nodes have to access the same resources. In addition, SMP
systems have a limit to the number of processors they can have. They require considerable
investments in upgrades, and an entire replacement of the system to accommodate a larger capacity.
Furthermore, an SMP system runs a single copy of the operating system, where all processors share
the same copy of the operating system data. If one processor becomes unavailable because of either
hardware or software error, it leaves locks unlocked, data structures in partially updated states, and
potentially, I/O devices in partially initialized states. As a result, the entire system becomes
unavailable on the account of a single processor. In addition, SMP architectures are not highly
available. SMP systems have several single points of failure (cache, memory, processor, bus); if one
subsystem becomes unavailable, it brings the system down and makes the service unavailable to the
end users.

2.1.2 Massively Parallel Processors (MPP)

As we add more processors to SMP systems and more hardware such as crossbar switches, to make
memory access orthogonal, these parallel processors become Massively Parallel Processors. Figure 5
illustrates an MPP system that consists of several processing elements interconnected through a high-
speed interconnection network. Each node has its own processor(s), memory, and runs a separate
copy of the operating system [28]. The key distinction between MPP and SMP systems relies in the

17
use of fully distributed memory. In an MPP system, each processor is self-contained with its own
cache and memory chips.

Processor Processor Processor

Cache Cache Cache

Memory Memory Memory

Processor Processor Processor

Cache Cache Cache

Memory Memory Memory

Interconnecting Network

Figure 5: The MPP architecture

Another distinct characteristic of MPP systems is the job scheduling subsystem. We achieve job
scheduling through a single run queue. MPP systems tackle one very large computational problem at
a time and serve to solve HPC problems. In addition, MPP systems suffer from the same issues as
SMP systems in the areas of scalability, single points of failures, and the impact on high availability,
and the need to shutdown the system to perform either software or hardware upgrades.

2.1.3 Computer Clusters

In his book “In Search of Clusters”, Greg Pfister defines a cluster as a parallel or distributed system
consisting of a collection of interconnected whole computers that appear as a single, unified
computing resource [28]. In this dissertation, we consider a cluster as a group of separate computers
that are interconnected, and are used as a single computing entity to provide a service or run an
application for the purposes of scalability, high avsailability, or high performance computing. Our
definition of a cluster is complimentary to Pfister’s definition. We both agree that a cluster consists of
a number of independent computers that appear as a single compute entity to the end users. End users
are not aware that they are interacting with a cluster and they are not aware as well where the
application is running.

18
Figure 6 illustrates the generic cluster architecture, which consists of multiple standalone nodes that
are connected through redundant links, and providing a single entry to the cluster.

Cluster

...
Node A Node B Node C Node N
Users
Users Single
entry
point

The Internet
Redundant LANs

Figure 6: Generic cluster architecture

Cluster nodes interconnect in different ways. Figure 7 illustrates two common variations. In the first
variation, Figure 7-A, cluster nodes share a common disk repository; in the second variation, Figure
7-B, cluster nodes do not share common resources and use their own local disk for storage.

Shared Disk Shared Nothing

Network

Figure 7-A: Shared Storage Figure 7-B: Individual Node Disks

Figure 7: Cluster architectures with and without shared disks

The phrase single, unified computing resource in Greg Pfister definition of a cluster invokes a wide
variety of possible applications and uses, and is deliberately vague in describing the services provided
by the cluster. At one end of the spectrum, a cluster is nothing more than the collection of whole
computers available for use by a sophisticated distributed application. At the other end, the cluster
creates an environment where existing non-distributed programs can benefit from increased

19
availability because of the cluster wide fault masking, and increased performance because of the
increased computing capacity.
A cluster is a group of independent COTS servers interconnected through a network. The servers,
called cluster nodes, appear as a single system, and they share access to cluster resources such as such
as shared disks, network file systems and the network. A network interconnects all the nodes in a
cluster and it is separate from the cluster’s external environment such as the local Intranet or the
Internet. The interconnection network employs local area network or systems area network
technology.
Clusters can be highly available because of the of build-in redundancy that prevents the presence of a
single point of failure (SPOF). As a result, failures are contained within a single node. Monitoring
software continually runs checks, by sending signals also called heartbeats, to ensure that the cluster
node and the application running on are up and available. If these signals stop, then the system
software initiates the failover to recover from the failure. The presumably dead or unavailable system
or application is then isolated from I/O access, disks, and other resources such as access to the
network; furthermore, incoming traffic is redirected to other available nodes within the cluster. As for
performance, clusters allow the possibility to add nodes and scale up the performance, the capacity,
and the throughput of the cluster as the number of users or traffic increases.

Applications

Middleware / System Software

Operating System

Interconnect Protocol

Interconnect Technology

Nodes

Figure 8: A cluster node stack

Figure 8 illustrates at a high level a cluster node stack. The stack consists of the processor, the
interconnect technology and protocol, the operating system, the middleware or system software, and
the application servers. At the lowest level is the node hardware. One level up from the Node level is
the Interconnect Technology (such as Ethernet), followed by the Interconnect Protocol. Up one level
from the Interconnect Protocol is the Operating System, followed by the System Software, which
20
provides all of the support functionalities. Finally at the top level is the Applications Layer. We
utilize clusters in many modes including but not limited to high performance computing, high
capacity or throughput, scalability, and high availability. Table 1 presents the different types of
clusters depending on their functionalities and the types of applications they host.
Purpose of High Performance Scalability High Availability Server
Clustering Computing (HPC) (HA) Consolidation

Goal Maximize floating Maximize throughput and Maximize service Maximize ease of
point computation performance availability management of
performance multiple computing
resources

Description - Many nodes - Many nodes working - Redundancy and - Also called
working together on similar tasks, failover for fault Single System
on a single- distributed in a tolerance of Image (SSI)
compute based defined fashion based services
- Provide a central
problem. on system load
- Support stateless management of
characteristics
- Performance is and stateful cluster resources
measured as the - Performance is applications and treat the
number of measured as cluster as a
- Availability is
floating point throughput in terms single
measured as a
operations of KB/s management
percentage of the
(FLOP) per unit.
- Adding more nodes time the system is
second
to the cluster to up and providing
increase its capacity service. Section
4.7 presents the
- Network oriented
formula for
(network
calculating the
throughput), or Data
availability.
oriented (data
transactions)

Examples Beowulf clusters such Examples include the Examples include the Examples include the
as MOSIX [30][31], Linux Virtual Server [23], HA-OSCAR project OpenSSI project
Rocks [32], OSCAR TurboLinux [37], in [21], in addition to [38], the OpenGFS
[33] [34] [35], and addition to commercial commercial clustering project [39], and the
Ganglia [36] database products products Oracle Cluster File
System [40]

Table 1: Classification of clusters by usage and functionality

21
Clustering for scalability (Table 1, 3rd column) focuses on distributing web traffic among cluster
nodes using distribution algorithms such as round robin DNS.
Clustering for high availability (Table 1, 4th column) relies on redundant servers to ensure that critical
applications remain available if a cluster node fails. There are two methods for failover solutions:
software-based failover solutions discussed in Section 2.7.1.2, and hardware-based failover devices
discussed in Section 2.7.1.1. Software-based failover detects when a server has failed and
automatically redirect new incoming HTTP requests to the cluster members that are available.
Hardware-based failover devices have limited built-in intelligence and require an administrator's
intervention when they detect a failure.
Many of the clustering products available fit into more than one of the above categories. For instance,
some products include both failover and load-balancing components. In addition, SSI products that fit
into the server consolidation category (Table 1, 5th column) provide certain HA failover capabilities.
Our goal with this dissertation is a cluster architecture that targets both scalability and high
availability.

2.2 SMP versus Clusters

Table 2 summarizes the comparisons between SMP and cluster architectures.
Scaling High Availability System Management
SMP - Limited scaling - Not highly available - Single system
capabilities
- Single points of failure in hardware and - Single image of the
- Requires a complete operating system operating system
upgrade of the system

Clusters - Virtually unlimited - Can be configured to have no SPOF - Multi-node system

scaling by adding through redundancy of key cluster
- Each node runs its own
nodes to the cluster components
copy of the operating
- There exist several challenges to achieve system and application
continuous availability of service
- Allows flexibility in
configuration

Table 2: Characteristics of SMP and cluster systems

SMP systems have limited scalability, while clusters have virtually unlimited scaling capabilities
since we can always continue to add more nodes to the cluster. As for high availability, an SMP

22
system has several single points of failure; one single error can lead to a system downtime; in contrast
to a cluster, where functionalities are redundant and spread across multiple cluster nodes. As for
management, an SMP system is a single system, while a cluster is composed of several nodes, some
of which can be SMP machines.

2.3 Cluster Software Components

We classify the software components that comprise the environment of a commodity cluster in four
major categories: the operating system that runs on each of the cluster nodes, application execution
environment such as libraries and debuggers, cluster installation infrastructure, and cluster services
components such as cluster membership, storage, management, traffic distribution and application
services.
The critical software components include cluster membership, storage, fault management, and traffic
distributions services. The cluster membership service includes functions to recognize and manage
the nodes membership in the cluster. The cluster storage service includes the replication and retrieval
of cluster configuration and application data. The fault management service includes functions to
recognize hardware and software faults and recovery mechanisms. The traffic distribution service
includes functions to distribute the incoming traffic across the nodes in the cluster.

2.4 Cluster Hardware Components

The key components, which comprise a commodity cluster, are the nodes performing the computing
and the dedicated interconnection network providing the data communication among the nodes. A
cluster node is different from an MPP node in that a cluster node is an operational standalone
computing system. A cluster node integrates several key subsystems that include the processor,
memory, storage, external interfaces, and network interfaces.

2.5 Benefits of Clustering Technologies

Computer clusters provide several advantages including high availability, scalability, high
performance compared to single server architectures, rapid response to technology improvements,
manageability of multiple servers as a single server, transparency, and flexibility of configuration.
The following sub-sections present these advantages and benefits of clusters.

23
2.5.1 High Availability
High availability (HA) refers to the availability of resources in a computer system [41]. We achieve
HA through redundant hardware, specialized software, or both [41][42][43]. With clusters, we can
provide service continuity by isolating or reducing the impact of a failure in the node, resources, or
device through redundancy and fail over techniques. Table 3 presents the various levels of HA, the
annual downtime and type of applications for various classes of systems [44].
9's Availability Downtime per year Example Areas for Deployments
1 90.00% 36 days 12 hours Personal clients
2 99.00% 87 hours 36 minutes Entry-level businesses
3 99.90% 8 hours 46 minutes ISPs, mainstream businesses
4 99.99% 52 minutes 33 seconds Data centers
5 99.999% 5 minutes 15 seconds Telecom system, medical, banking
6 99.9999% 31.5 seconds Military defense, carrier grade routers

Table 3: Expected service availability per industry type

It is important that a service not only be down except for N minutes a year, but also that the length of
outages be short enough, and the frequency of outages be low enough, that the end user does not
perceive it as a problem. Therefore, the goal is to have a small number of failures and a prompt
recovery time. This concept is termed Service Availability, meaning whatever services the user wants
are available in a way that meets the user’s expectations.

2.5.2 Scalability
Clusters provide means to reach high levels of scalability by expanding the capacity of a cluster in
terms of processors, memory, storage, or other resources, to support users and traffic growth [1].

2.5.3 High Performance

We can achieve a better performance characterized by improved processing speed by using clusters
and the cluster-wide resources instead of the resources of a standalone server.

2.5.4 Rapid Response to Technology Improvements

Commodity clusters are most able to track technology improvements and respond rapidly to new
offerings. Clusters benefit from the latest mass-market technology, as it becomes available at low
cost. As new devices including processors, memory, disks, and network interfaces become available
in the market, we can integrate them into cluster nodes allowing clusters to be the first class of
24
parallel systems to benefit from such advances. Similarly, clusters benefit from the latest operating
system and networking features, as they become available.

2.5.5 Manageability
Clusters require a management layer that allows us to manage all cluster nodes as a single entity [28].
Such cluster management facilities help reduce system management costs. There exists a significant
number of cluster management software, almost all of them originating from research projects, and
are now adopted by commercial vendors.

2.5.6 Cost Efficient Solutions

Clusters take advantage of COTS hardware, which allows a better price to performance ratio when
compared to a dedicated parallel supercomputer. In addition, the availability of open source operating
systems, system software, development tools, and applications has contributed to the provisioning of
cost effective clusters built with open source software.

2.5.7 Expandability and Upgradeability

We can expand clusters by adding more nodes, disk storage, and memory, as necessary. As hardware,
software, operating system and network upgrades become available, we can upgrade a cluster node
independently of the others; this presents a major advantage over SMP and MMP systems.

2.5.8 Transparency
The SSI layer represents the nodes that make up the cluster as a single server. It allows users to use a
cluster easily and effectively without the knowledge of the underlying system architecture or the
number of nodes inside the cluster. This transparency frees the end-user from having to know where
an application runs.

2.5.9 Flexibility of Configuration

Clusters allow flexibility in configurations that is not available through conventional SMP and MPP
systems. The number of cluster nodes, memory capacity per node, number of processors per node,
and interconnect topology, are all parameters of the cluster structure that may be specified in fine
detail on a per system basis without incurring additional cost. Furthermore, we can modify the cluster
structure or augment it over time as need and opportunity dictates. This expanded control over the

25
cluster structure not only benefits the end user but the cluster vendor as well, yielding a wide array of
system capabilities and cost tradeoffs to meet customer demands.

2.6 The OSI Layer Clustering Techniques

The following sub-sections present the clustering techniques that operate at OSI layer two (data link
layer), OSI layer three (network layer), and OSI layer seven (application layer).

2.6.1 L4/2 Clustering

The level 4 web switch works at the TCP/IP level. Figure 9 illustrates the L4/2 clustering model. The
same server machine must process the packets pertaining to the same connection, and the web switch
maintains a binding table to associate each session with its assigned server.

Replies
Server 1

.
Requests .
.

Dispatcher

Server n
Replies

Figure 9: The L4/2 clustering model

In L4/2 based clusters, the dispatcher and all the servers in the cluster share the cluster network-layer
address using primary and secondary IP addresses. While the primary address of the dispatcher is the
same as the cluster address, each cluster server is configured with the cluster address as a secondary
address using either interface aliasing or by changing the address of the loopback device on the
cluster servers. The nearest gateway is configured such that all packets arriving for the cluster address
are addressed to the dispatcher at layer two using a static Address Resolution Protocol (ARP) cache
entry. If the packet received corresponds to a TCP/IP connection initiation, the dispatcher selects one
of the servers in the server pool to service the request (Figure 9).
The selection of the server to respond to the incoming request relies on a traffic distribution algorithm
such as round robin. When an incoming request arrives to the dispatcher, the dispatcher creates an
entry in a connection map that includes information such as the origin of the connection and the
chosen cluster server. The layer two destination address is then rewritten to the hardware address of
26
the chosen cluster server, and the frame is placed back on the network. If the incoming packet is not
for a connection initiation, the dispatcher examines its connection map to determine if it belongs to a
currently established connection. If it does, the dispatcher rewrites the layer two destination address
to be the address of the cluster server previously selected, and forwards the packet to the cluster
server as before. In the event that the received packet does not correspond to an established
connection and is not a connection initiation packet, then the dispatcher drops it.
Figure 10 illustrates the traffic flow in an L4/2 clustered environment [45]. A web client sends an
HTTP packet (1) with A as the destination IP address. The immediate router sends the packet to the
dispatcher at IP address A (2). Based on the traffic distribution algorithm and the session table, the
dispatcher decides which back-end server will handle this packet, server 2 for instance, and sends the
packet to server 2 by changing the MAC address of the packet to server 2's MAC address and
forwarding it (3). Server 2 accepts the packet and replies directly to the web client.

Router
4
1 2 3

Dispatcher Server 1 Server 2 Server 3

IP Address=A IP alias=A IP alias=A IP alias=A

Figure 10: Traffic flow in an L4/2 based cluster

L4/2 clustering has a performance advantage over L4/3 clustering because of the downstream bias of
web transactions. Since the network address of the cluster server to which the packet is delivered is
identical to the one the web client used originally in the request packet, the cluster server handling
that connection may respond directly to the client rather than through the dispatcher. As a result, the
dispatcher processes only the incoming data stream, which is a fraction of the entire transaction.
Moreover, the dispatcher does not need to re-compute expensive integrity codes (such as the IP
checksums) in software since only layer two parameters are modified. Therefore, the two parameters
that limit the scalability of the cluster are the network bandwidth and the sustainable request rate of
the dispatcher, which is the only portion of the transaction actually processed by the dispatcher.

27
One restriction on L4/2 clustering is that the dispatcher must have a direct physical connection to all
network segments that house servers (due to layer two frame addressing). This contrasts with L4/3
clustering (Section 2.6.2), where the server may be anywhere on any network with the sole constraint
that all client-to-server and server-to-client traffic must pass through the dispatcher. In practice, this
restriction on L4/2 clustering has little appreciable impact since servers in a cluster are likely to be
connected via a single high-speed LAN.
Among research and commercial products implementing layer two clustering are the ONE-IP
developed at Bell Laboratories [46], the IBM's eNetwork Dispatcher [47], and the LSMAC from the
University of Nebraska-Lincoln (Section 2.10.4).

2.6.2 L4/3 Clustering

In L4/3 based clusters, the dispatcher appears as a single host to the web clients. Figure 11 illustrates
the L4/3 dispatcher that appears as a gateway for the cluster servers. The web client sends traffic to
the address of the web cluster and the traffic arrives to the dispatcher. If the packet received
corresponds to a TCP/IP connection initiation, the dispatcher selects one of the servers in the server
pool to service the request.

Server 1

Requests .
.
Replies .

Dispatcher

Server n

Figure 11: The L4/3 clustering model

Similar to L4/2 clustering, the selection of the cluster server relies on a traffic distribution algorithm.
The dispatcher then creates an entry in the connection map noting the origin of the connection, the
chosen server, and other relevant information. However, unlike the L4/2 approach, the dispatcher
rewrites the destination IP address of the packet as the address of the cluster server selected to service
this request. Furthermore, the dispatcher re-calculates any integrity codes affected such as packet
checksums, cyclic redundancy checks, or error correction checks. The dispatcher then sends the
modified packet to the cluster server corresponding to the new destination address of the packet. If the
28
incoming web client traffic is not a connection initiation, the dispatcher examines its connection map
to determine if it belongs to a currently established connection. If it does, the dispatcher rewrites the
destination address as the server previously selected, re-computes the checksums, and forwards the
packet to the cluster server as we described earlier. In the event that the packet does not correspond to
an established connection and it is not a connection initiation packet, then the dispatcher drops the
packet.
The traffic sent from the cluster servers to the web clients travels through the dispatcher since the
source address on the response packets is the address of the particular server that serviced the request,
not the cluster address. The dispatcher rewrites the source address to the cluster address, re-computes
the integrity codes, and forwards the packet to the web client.

Router
1 2 3
5 4

Dispatcher
IP Address=A

Server 1 Server 2 Server 3

IP address=B1 IP address=B2 IP address=B3

Figure 12: The traffic flow in an L4/3 based cluster

Figure 12 illustrates the traffic flow in an L4/3 clustered environment [45]. A web client sends an
HTTP packet with A as the destination IP address (1). The immediate router sends the packet to the
dispatcher (2), since the dispatcher machine is the owner of the IP address A. Based on the traffic
distribution algorithm and the session table, the dispatcher decides to forward this packet to the back-
end server, Server 2 (3). The dispatcher then rewrites the destination IP address as B2, recalculates
the IP and TCP checksums, and sends the packet to B2 (3). Server 2 accepts the packet and replies to
the client via the dispatcher (4), which the back-end server sees as a gateway. The dispatcher rewrites
the source IP address of the replying packet as A, recalculates the IP and TCP checksums, and sends
the packet to the web client (5).
RFC 2391, Load Sharing using IP Network Address Translation, presents the L4/3 clustering
approach [48]. The LSNAT from the University of Nebraska-Lincoln provides a non-kernel space
implementation of the L4/3 clustering approach [49]. Section 2.10.4 discusses the project and the
implementation.

29
L4/2 clustering theoretically outperforms L4/3 clustering due to the overhead imposed by L4/3
clustering with the necessary integrity code recalculation coupled with the fact that all traffic must
flow through the dispatcher, resulting that the L4/3 dispatcher processes more traffic than an L4/2
dispatcher does. Therefore, the total data throughput of the dispatcher limits the scalability of the
system more than the sustainable request rate.

2.6.3 L7 Clustering
Level 7 web switch works at the application level. The web switch establishes a connection with the
web client and inspects the HTTP request content to decide about dispatching. The L7 clustering
technique is also known as content-based dispatching since it operates based on the contents of the
client request. The Locality-Aware Request Distribution (LARD) dispatcher developed by the
researchers at Rice University is an example of the L7 clustering. LARD partitions a Web document
tree into disjoint sub-trees. The dispatcher then allocates each server in the cluster one of these sub-
trees to serve. As such, LARD provides content-based dispatching as the dispatcher receives web
clients requests.

Server 1
a a a
a

a a a
.
c c b
.
.
Dispatcher

c c b Server n

b c

Figure 13: The process of content-based dispatching – L7 clustering model

Figure 13 presents an overview of the processing with the L7 clustering approach [45]. Server 1
processes request of type ; Server 2 processes requests of types and . The dispatcher separates
the stream of requests into two streams of requests: one stream for Server with requests of of type ,
and and stream for Server 2 with requests of types and . As requests arrive from clients for the
web cluster, the dispatcher accepts the connection and the request. It then classifies the requested
document and dispatches the request to the appropriate server. The dispatching of requests requires
support from a modified kernel that enables the connection handoff protocol. After establishing the
30
connection, identifying the request, and choosing the cluster server, the dispatcher informs the cluster
server of the status of the network connection, and the cluster server takes over that connection, and
communicates directly with the web client. Following this approach, the LARD allows the file system
cache of each cluster server to cache a separate part of the web tree rather than having to cache the
entire tree, as it is the case with L4/2 and L4/3 clustering. Additionally, it is possible to have
specialized server nodes, where for instance, the dynamically generated content is offloaded to special
compute servers while other requests are dispatched to servers with less processing power. The
LARD requires modifications to the operating system on the servers to support the TCP handoff
protocol.

2.6.4 Discussion of the OSI Layer Clustering Techniques

We broadly classify transparent server clustering into three categories: L4/2, L4/3, and L7. Table 4
summarizes these technologies and highlights their advantages and disadvantages.
L4/2 L4/3 L7
Mechanism Link-layer address Network address translation Content-based routing
translation
Flows Incoming only Incoming/outgoing Varies

HA and fault Varies, several single point of Varies, several single point of Varies, several single point of

tolerance failures failures failures

Restrictions Incoming traffic passes Dispatcher lies between client and Incoming traffic passes
through dispatcher server. All incoming and outgoing through dispatcher
traffic passes through dispatcher

Bottleneck Connection dispatching Integrity code calculations Connections dispatching,

dispatcher complexity
Client At TCP/IP level At TCP/IP level TCP/IP information and

information HTTP header content

Table 4: Advantages and drawbacks of clustering techniques operating at the OSI layer

Each of the approaches creates bottlenecks that limit scalability, and presents several single points of
failure. For L4/2 dispatchers, system performance is constrained by the ability of the dispatcher to set
up, look up, and tear down entries. However, the most telling performance metric is the sustainable
request rate. The limitation of L4/3 dispatchers is their ability to rewrite and recalculate the
checksums for the massive numbers of packets they process. Therefore, the most telling performance

31
metric is the throughput of the dispatcher. Lastly, the L7 clustering approach has limitations related to
the complexity of the content-based routing algorithm and the size of their cache.

2.7 Clustering Web Servers

Clustering for web servers is a technique in which we group together two or more web servers to
collectively accommodate increases in load and provide system redundancy. Figure 14 illustrates an
example of a web cluster that consists of five cluster nodes. The nodes share access to a common
network, which connects it to the public Internet. The figure does not specify whether or not the
nodes share access to common disk repository, or if they have private disk storage. Web clients are
online users of the system; they connect to the Internet and request data from the web server cluster,
which consists of multiple nodes each running an independent copy of the web server software. We
accomplish clustering using software, hardware, or a combination of both.

Web Clients

The Internet

Web Server
Cluster

Node Node Node Node Node

1 2 3 4 5

Figure 14: A web server cluster

The following sub-sections explore the software and hardware techniques used to build web clusters.

2.7.1.1 Hardware Based Clustering Solutions

Figure 15 illustrates the hardware clustering approach using a network device called the packet router.
The packet router sits in front of a number of web servers and directs incoming HTTP requests to
available web servers in the cluster. The web server software running on cluster nodes have access to
the same storage repository. The packet router distributes traffic to the cluster servers has on a
predefined distribution policy such as round robin. The router device, in conjunction with the web
servers, comprises a virtual server.
32
Load balancing switches, such as the Cisco LocalDirector [50], redirect TCP requests to servers
belonging to a cluster. The LocalDirector provides traffic distribution service by presenting a
common virtual IP address to the web clients and then forwarding their incoming requests to an
available node within the web cluster. Using heartbeat algorithms, these switches frequently ping the
servers in the cluster to ensure they are still available making them a better solution than the basic
round robin DNS [51][83].

Web Clients

The Internet

Router

Web
Cluster
Nodes

Network Attached
Storage

Figure 15: Using a router to hide the web cluster

Hardware-based clustering solutions use routers to provide a single IP interface to the cluster and to
distribute traffic among various cluster nodes. These solutions are a proven technology; they are
neither complicated nor complex by design. However, they have certain limitations such as limited
intelligence, un-awareness of the applications running on the cluster nodes, and the presence of
SPOF.
Limited intelligence: Packet routers can load balance in a round robin fashion, and some can detect
failures and automatically remove failed servers from a cluster and redirect traffic to other nodes.
These routers are not fully intelligent network devices. They do not provide application-aware traffic
distribution. While they can redirect requests upon discovering a failure, they do not allow
configuring redirection thresholds for individual servers in a cluster, and therefore, they are unable to
manage load to prevent failures.
Lack of Dynamism: A router cannot measure the performance of a web application server or make an
intelligent decision on where to route the request based on the load of the cluster node and its
hardware characteristics.

33
Single point of failure: Packet router constitutes a SPOF for the entire cluster. If the router fails, the
cluster is not accessible to end users and the service becomes unavailable.

2.7.1.2 Software Clustering Solutions

Software solutions are another approach to clustering. A number of different clustering packages exist
ranging from various platform implementations of the domain name system (DNS) to virtual server
implementations that present a cluster of servers to the outside world as a single server. Section 2.6
presented on the OSI layer clustering techniques. This section presents and discusses other software
clustering techniques, in addition to some implementations of the OSI layer techniques.
One method of software clustering is the round robin DNS, which involves setting up the DNS server
of the web site to return the set of all the IP addresses of the servers in the cluster in a different order
on each successive request [51][52]. The client forwards the request to the first IP address in the list
of IP addresses returned. Consequently, the request arrives at a different server in the cluster, and the
traffic is distributed across the cluster servers. This distribution scheme has some disadvantages. First,
the DNS server has no way of knowing if one or more of the servers in the cluster is overloaded, out
of service, or if the application on that server is still up and running. Second, the servers inside the
cluster can be of different hardware configurations (processor, memory, network bandwidth, and
disk) and the DNS is not aware of the heterogonous nature of the nodes inside the cluster. As a result,
the load is not assigned to the server nodes based on their potential capacity, which leads to
overloading some nodes and under-using, others. While round robin DNS is a popular choice because
of its relative simplicity and low implementation cost, it does not demonstrate intelligence in
distributing the traffic load or reacting to node failures. Therefore, it does not help prevent server
overloads and eventual failures for sites with high traffic.
One example of a commercial clustering solution is the Microsoft Cluster Server (MSCS) [53].
MSCS is a built-in feature of Windows NT Server, Enterprise Edition. It supports the connection of
two servers into a "cluster" for higher availability and easier manageability of data and applications. It
supports the active/standby redundancy model in which two cloned systems provide redundancy for
one another. The MSCS can automatically detect and recover from server or application failures. It
can be used to move server workload to balance utilization and allow scheduled maintenance without
downtime. The initial release of MSCS supports clusters with two servers. Microsoft Corporation
announced its plans to release the MSCS Phase 2 in the third quarter of 2006. MSCS Phase 2

34
promises to support larger clusters and to include enhanced services to simplify the creation of highly
scalable, cluster-aware applications [54]. The current version of MSCS suffers from scalability issues
as it only supports two servers that require upgrading as the traffic increases.
The Linux Virtual Server (LVS) is an open source project that aims to provide a high performance
and highly available software clustering implementation for Linux [23]. It implements layer 4
switching in the Linux kernel, providing a virtual server layer built on a cluster of real servers and
allowing TCP and UDP sessions to be load balanced between multiple real servers. The virtual
service is by either an IP address or a port and protocol. The front-end of the real servers is a load
balancer, which schedules requests to the different servers and makes parallel services of the cluster
appear as a virtual service on a single IP address. The architecture of the cluster is transparent to end
users, and the users who only see the address of the virtual server. The LVS is available in three
different implementations [55]: Network Address Translation (NAT), Direct Routing (DR), and IP
tunneling. Sections 3.5.1, 3.5.2, and 3.5.3, present and discuss the NAT, DR, and IP tunneling
methods, respectively.
We have experimented with both the NAT and DR methods. Section 3.5.4 presents the benchmarking
results comparing the performance of both methods. Each of these techniques for providing a virtual
interface to a web cluster has its own advantages and disadvantages. Based on our lab experiments
discussed in Chapter 3, we concluded that the common disadvantage among these schemes is their
limited scalability (Figure 31). When the traffic load increases, the load balancer becomes a
bottleneck for the whole cluster and the local director crashes under heavy load or stops accepting
new incoming requests. In both cases, the local director replies very slowly to ongoing requests.
Software clustering solutions have three main advantages that make them a better alternative than
clustering solutions: flexibility, intelligence, and availability. First, software clustering solutions can
augment existing hardware devices, thereby providing a more robust traffic distribution and failover
solution. Additionally, by integrating hardware with software, you diminish, if not eliminate, losses
on capital expenditures that your organization has already made. Secondly, they provide a level of
intelligence that enables preventive traffic distribution measures that actually minimize the chance of
servers becoming unavailable. In the event that a server becomes overloaded or actually fails, some
software can automatically detect the problem and reroute HTTP requests other nodes in the cluster.
Thirdly, with software clustering solutions, we can support high availability capabilities to avoid

35
single points of failure. An individual server failure does not affect the service availability since
functionalities and failover capabilities are distributed among the cluster servers.
However, we need to consider several issues when evaluating cluster software solutions, mainly the
differences among feature sets, the platform constraints, their HA and scaling capabilities. Software
clustering solutions have different capabilities and features, such as their capabilities of providing
automatic failure detection, notification, and recovery. Some solutions have significantly delayed
failure detection; others allow the configuration of the load thresholds to enable preventive measures.
In addition, they can support different redundancy models such as the 1+1 active/standby, 1+1
active/active, N+M and N-way. Therefore, we need to determine the needs or requirements for
scalability and failover and pick the solution accordingly. In addition, software solutions have limited
platform compatibility; they are available to run on specific operating system or computing
environments. Furthermore, the capability of the clustering solution to scale is important. Some
solutions have limited capabilities restricted to four, eight, or 16 nodes, and therefore have scaling
limitations.

2.8 Scalability in Internet and Web Servers

Scalability is the ability of an application server, such as a web server, to grow to support a large
number of concurrent users without performance degradation. Generally, scalability refers to how
well a hardware or software system can adapt to increased demands. Perfect scalability is linear: if we
double the number of processors in the system, we expect the system to serve double the number of
requests per second it normally serves. We consider this optimum performance. Unfortunately, this is
not the case in real systems.
As the number of online services continues to grow as well as the number of Internet users, Internet
and web servers have to meet new requirements in areas of availability, scalability, performance,
reliability, and security. They have to cope with the explosive growth of the Internet [1][2],
continuous increasing traffic, and meet all expectations in terms of stringent requirements in those
areas. One additional capability includes supporting geographical mirroring for increased high
availability, especially when providing critical services such as electronic banking and stocks
brokering. However, these servers experience scalability problems – they are slow; they are
continuously hacked and attacked by malicious users, and at times they are not available for service,
making businesses liable to losing money when users are not able to access the services [56][57][58].

36
As such, scalability presents itself as a crucial factor for the success or failure of online services and it
is certainly one important challenge faced when designing servers that provide interactive services for
a wide clientele.
Many factors can affect negatively the scalability of systems [59]. The first common factor is the
growth of user base, which cause serious capacity problems for servers that can only serve a certain
number of transactions per second. If the server is not able to cope with the increased number of users
and traffic, the server starts rejecting requests. A second key factor negatively affecting the scalability
of servers is the number and size of data objects, particularly the size of audio and video files strains
the network and I/O capacity causing scalability problems. The increasing amount of accessible data
makes data search, access, and management more difficult, which causes processing problems and
eventually led to rejecting incoming requests. Finally, the non-uniform request distribution imposes
strains on the servers and network at certain times of the day or at certain requested data. These
factors can cause servers to suffer from bottlenecks, and run out of network, processing, and I/O
resources.

2.8.1 Scalability in Telecommunication Servers

Telecommunication servers provide a wide range of IP applications for mobile phones. These servers
require scaling capabilities to support an increasing number of subscribers and services. They support
some of the rigorous requirements in terms of reliability, performance, availability, and scalability
[58][59]. They aim to provide five nines availability, which translates into a maximum of five minutes
and fifteen seconds of downtime per year that includes downtime associated with operating system,
software upgrades, and hardware maintenance. These expectations place an unprecedented burden on
telecommunication equipment manufacturers to ensure that all the elements needed to support a
service are functioning whenever a user requests that particular service.
While achieving a 100 percent of uptime is desirable, it is difficult to achieve [60]. A fault-tolerant
system needs to function correctly, given the small but practically inevitable presence of a fault in the
system. Supporting five nines service availability depends on the near-flawless interaction of
applications, operating system, management middleware, hardware, as well as on environmental and
operational factors.
The requirements of telecommunication servers are getting more complex as the telecom industry is
moving towards an always connected, always online paradigm [61] with a new suite of third

37
generation interactive and multimedia services. These servers suffer from scalability problems as the
number of mobile subscribers is increasing at a fast pace [1]. To cope with the increased number of
users and traffic, mobile operators are resorting to upgrading servers or buying new servers with more
processing power [59], a process that proved to be expensive and iterative. According to Ericsson
Research, the growth rate of mobile subscribers in 2004 was approximately 500,000 users per day
[62]. This raises the question of whether the Mobile Internet servers and the applications running on
those servers will be able to cope with such growth.
When the servers are not able to cope with increased traffic, it results in failure to meet the high
expectations of paying customers who expect services to be available at all times with acceptable
performance levels [63][64], and meeting and managing service level agreements. Service level
agreements dictate the percentage of the time services will be available, the number of users that can
be served simultaneously, specific performance benchmarks to which actual performance will be
periodically compared, and access availability. If ISPs, for instance, are not able to cope with the
increasing number of users, they will break their service level agreements causing them to loose
money and potentially loose customers. Similarly, mobile operators have to deal with huge money
losses if their servers are not available for their subscribers.

2.8.2 How Users Experience Scalability

Variations in the scalability of a system are transparent to the end users. Users experience the capacity
of a system in various ways such as how fast and accurately the system responds to their requests.
The user expects to receive the response in an acceptable time with no errors [13][14][15]. Therefore,
the server needs to adapt to different numbers of users and amounts of data, without resource
problems or performance bottleneck. We can measure these qualities via the system response time,
availability, and reliability of the server.

2.8.2.1 Response time

The response time is the space of that exists between the moment a user gives an input, or posts a
request, to the moment when a user receives an answer from the server. Total response time includes
the time to connect, the time to process the request on the server, and the time to transmit the response
back to the client:
Total response time = connect time + process time + response transit time

38
When throughput is low, the response transmit time is insignificant. However, as throughput
approaches the limit of network bandwidth, the server has to wait for bandwidth to become available
before it can transmit the response.
The response time in a distributed system consists of all the delays created at the source site, in the
network, and at the receiver site. The possible reasons for the delays and their length depend on the
system components and the characteristics of the transport media. The response time consists of the
delays in both directions.

2.8.2.2 Availability and Reliability

The availability of a service depends on the reliability of the server and network components
providing the service, and the system architecture. Availability is the percentage of time the system
has been available. Reliability, on the other hand, is the ability of the server to respond accurately to
requests without errors.
The system designer needs to design and build the system so that a failure of one server or network
link does not cause the service to become unavailable. We can avoid such situations by duplicating
the service to several servers and having optional redundant network routings to them. In such a
system, the failure of a server network link only means loss of capacity but the system keeps working.
Since web servers offer interactive services, we expect them to be available to service requests.
Alternatively, when the service is not available, the system needs to recover from faults and with
minimal degradation of service.

2.8.3 Current Scalability Methods for Web Servers

Web servers need to be able to scale and maintain high availability, reliability, and performance as
the amount of simultaneous users and traffic increases. To overcome the overloading problem of web
servers, there are two types of solutions. The first, called the single server, consists of upgrading the
server to a higher performance server. However, the new server will be overloaded when requests
increases so that administrators will have to upgrade it again. The upgrading process is complex and
the cost is high. The second, called the multi-server, consists of building a scalable server on a cluster
of servers relying on clustering techniques. When the load increases, we can add more servers into the
cluster to meet the demands; however, it is not as easy as it sounds. In reality, as we add more servers
into the cluster, the total number of transactions per server declines.

39
The widely deployed scaling methods for clustered web servers are round robin DNS and packet
router device to distribute incoming traffic.

2.8.3.1 Round Robin DNS

The round robin DNS maps a single server name to the different IP address in a Round robin manner.
Following this approach, the round robin DNS distributes the traffic among the servers and maps
clients are to different servers in the cluster. The round robin DNS method is not very efficient.
Because of the caching nature of clients and the hierarchical DNS system, it easily leads to dynamic
load imbalance among the servers, making it unrealistic for a server to handle its peak load.
It is difficult to choose the Time-To-Live (TTL) value of a name mapping: with small values, round
robin DNS is a bottleneck, and with high values, the dynamic load imbalance gets worse. Even when
the TTL value is set to zero, the scheduling granularity is per host; different users' access pattern may
lead to dynamic load imbalance, because some people pull many pages from the site, and others just
surf a number of pages and go away.
Furthermore, round robin DNS is not a reliable mechanism. When a cluster node fails, the clients who
map the name to the IP address discovers that the node is down; the problem persists even if the users
press the reload or refresh button in their web browsers.

2.8.3.2 Load Balancer Network Device

A different scaling method is to use a network device called the load balancer to distribute incoming
traffic among the servers. A load balancer can be implemented either in software, such as the LVS, or
as a hardware solution, such as the Cisco LocalDirector [50]. These solutions generally forward
HTTP requests to the web servers inside the cluster, receive the result, and forward it to the clients.
The service provided by the servers constitutes the web cluster that appears to end-users as a virtual
server made visible through a single IP address. The scheduling granularity is per connection, which
can make a sound traffic distribution among the servers. We can achieve traffic distribution at two
levels: application-level and IP-level. Chapter 3 expands on the software and hardware distribution
methods, and discusses the results of our benchmark testing.

40
2.8.4 Principles of Scalable Architecture
This section discusses the principles of scalable architectures. After presenting the HAS architecture
in Chapter 4, we discuss how the HAS architecture design meets the architectural scaling principle
presented in this section.
We can characterize the applications by their consumption of four primary system resources:
processor, memory, file system bandwidth, and network bandwidth. We can achieve scalability by
simultaneously optimizing the consumption of these resources and designing an architecture that can
grow modularly by adding more resources.
Several design principles are required to design scalable systems. The list includes divide and
conquer, asynchrony, encapsulation, concurrency, and parsimony [65]. Each of these principles
presents a concept that is important in its own right when designing a scalable system. There are also
tensions between these principles; we can sometimes apply one principle at the cost of another. The
root of a solid system design is to strike the right balance among these principles. In the following
subsections, we present on each of these principles.

2.8.4.1 Divide and Conquer

The divide and conquer principle indicates that the system need to be partitioned into relatively small
sub-systems, each carrying out some well-focused function [65]. This permits deployments that can
leverage multiple hardware platforms or simply separate processes or threads, thereby dispersing the
load in the system and enabling various forms of load balancing (or traffic distributing) and tuning.
The divide and conquer principle varies slightly from the concept of modularization, as it addresses
the partitioning of both code and data and it can approach the problem from either side.

2.8.4.2 Asynchrony
The asynchrony principle means that the system carries out the work based on available resources
[65]. Synchronization constrains a system under load because application components cannot process
work in random order, even if resources do exist to do so. Asynchrony decouples functions and lets
the system schedule resources more freely and thus potentially more completely. This principle
allows us to implement strategies that effectively deal with stress conditions such as peak load.

41
2.8.4.3 Encapsulation
The encapsulation principle is the concept of building the system using loosely coupled components,
with little or no dependence among components [65] . This principle often, but not always, correlates
with asynchrony. Highly asynchronous systems tend to have well encapsulated components and vice
versa. Loose coupling means that components can pursue work without waiting for work from others.

2.8.4.4 Concurrency

The concurrency principle means that there are many moving parts in a system and the goal is to split
the activities across hardware, processes, and threads [65]. Concurrency aids scalability by ensuring
that the maximum possible work is active at all times and addresses system load by spawning new
resources on demand within predefined limits. Concurrency also maps directly to the ability to scale
by rolling in new hardware. The more concurrency applications exploit, the better the possibilities to
expand by adding new hardware.

2.8.4.5 Parsimony
The parsimony principle indicates that the designer of the system needs to be economical in what he
or she designs [65]. Each line of code and each piece of state information has a cost, and, collectively,
the costs can increase exponentially. A developer has to ensure that the implementation is as efficient
and lightweight as possible. Paying attention to thousands of micro details in a design and
implementation can eventually pay off at the macro level with improved system throughput.
Parsimony also means that designers carefully use scarce or expensive resources. No matter what
design principle a developer applies, a parsimonious implementation is appropriate. Some examples
include algorithms, I/O, and transactions. Parsimony ensures that algorithms are optimal to the task
since several small inefficiencies can add up and kill performance. Furthermore, performing I/O is
one of the more expensive operations in a system and we need to keep I/O activities to the bare
minimum. Moreover, transactions constrain access to costly resources by imposing locks that prohibit
read or write operations. Applications should work outside of transactions whenever feasible and go
out of each transaction in the shortest time possible.

42
2.8.5 Strategies for Achieving Scalability
Section 2.8.5 presented the five principles of scalable architectures. This section presents the design
strategies to achieve a scalable architecture.

2.8.5.1 System Partitioning

A notable characteristic of a scalable system is its ability to balance the load. As the system scales up
to meet demand, it should do so by maximizing resource utilization. In the case of a clustered web
server, the web traffic is distributed among the cluster nodes.
Partitioning breaks system software into domain components with well-bounded functionality and a
clear interface. Each component defines part of the architectural conceptual model and a group of
functional blocks and connectors. Ultimately, the goal is to partition the solution space into
appropriate domain components that map onto the system topology in a scalable manner.
The principles that apply in this strategy include divide and conquer, asynchrony, encapsulation, and
concurrency.

2.8.5.2 Service Based Layered Architecture

By making services available to a client’s application layer, service-based architectures offer an
opportunity to share a single component across many different systems. Having services front-end
allows a component to offer different access rights and visibility to different clients.
The principles that apply in this strategy include divide and conquer and encapsulation.

2.8.5.3 Keeping it Simple

Donald A. Norman warns in the preface of his book ”The Design of Everyday Things”, “Rule of
thumb: if you think something is clever and sophisticated, beware — it is probably self-indulgence”
[66]. The software development cycle’s elaboration phase is an iterative endeavor. In accordance with
Occam Razor [65], when faced with two design approaches, choose the simpler one. If the simpler
solution proves inadequate to the purpose, consider the more complicated counterpart in the project’s
later stages.

2.9 Overview of Related Work

Server scalability is a recognized research area both in the academia and in the industry and it is an
essential factor in the client/server dominated network environments. Researchers around the world
43
are investigating clusters and commodity hardware as an alternative to expensive specialized
hardware for building scalable Internet and web servers. Although the area of cluster computing is
relatively new, there is an abundance of research projects in the areas of cluster computing and
scalable cluster-based servers. Section 2.10 examines six projects that are close to our work in terms
of scope and goal, discuss their approaches, advantages and drawbacks, discuss their prototypes and
implementations, and present our learned lessons from their experiences.
This section serves as a brief overview of some of the other surveyed work that helped us get a better
understanding of the areas of research related to scalable web servers.
In [63], the authors present and discuss a method for traffic distribution inside a cluster of web servers
that uses distributed packet rewriting.
In [67], the authors address strategies for designing a scalable architecture: divide and conquer
asynchrony, encapsulation, concurrency, and parsimony. The authors argue that the strategies
presented are not comprehensive; however, they represent critical strategies for server-side scaling.
In [68], the authors propose a new scheduling policy, called multi-class round robin (MC-RR), for
web switched operating at the layer 7 of the open system interconnection (OSI) protocol stack to
route requests reaching the web cluster. The authors demonstrate through a set of simulation
experiments that MC-RR is more effective than round robin for web sites providing highly dynamic
services.
In [69], the authors describe their architecture for a web server designed to cope with the ongoing
increase of the Internet requirements. The proposed architecture addresses the need for a powerful
data management system to cope with the increase in the complexity of users’ requests, and a caching
mechanism to reduce the amount of redundant traffic. The author extends the architecture with a
caching system that builds up an adaptive hierarchy of caches for web servers, which allow them to
keep up the changes in the traffic generated by the applications they are running.
In [70], the authors present a comparison of different traffic distribution methods for HTTP traffic in
scalable web clusters. The authors present a classification framework for the different load balancing
methods and compare their performance. In particular, they discuss the rotating name server method
in comparison with alternative load balancing method based on remapping requests and responses in
the network. Their results demonstrate that the remapping requests and responses yield better results
that the rotating name serve method.

44
The researchers at the Korea Advanced Institute of Science and Technology have developed an
adaptive load balancing method that changes the number of scheduling entities according to different
workload [71]. It behaves exactly like dispatcher based scheme with low or intermediate workload,
taking advantage of fine-grained load balancing. When the dispatcher is overloaded, the DNS servers
distribute the dispatching jobs to other entities such as and back-end servers. In this way, they relax
the hot spot of the dispatcher. Based on simulation results, they demonstrated that the adaptive
dispatching method improves the overall performance on realistic workload simulation.
In [72], the authors present and evaluate an implementation of a prototype scalable web server
consisting of a balanced cluster of hosts that collectively accept and service TCP connections. The
host IP addresses are advertised using round robin DNS technique allowing a host to receive requests
from a client. They use a low-overhead technique called the distributed packet rewriting (DPR) to
redirect TCP connections. Each host keeps information about the remaining hosts in the system. Their
performance measurements suggest that their prototype outperforms round robin DNS. However,
their benchmarking was limited to a five-node cluster, where each node reached a peak of 632
requests per second, compared to over 1,000 requests per second per node we achieved with our early
prototype (Section 3.7).
In [45], the authors discuss clustering as a preferred technique to build scalable web servers. The
authors examine early products and a sample of contemporary commercial offerings in the field of
transparent web server clustering. They broadly classify transparent server clustering into three
categories: L4/2, L4/3, and L7 clustering, and discuss their advantages and disadvantages.
In [73], the authors present their two implementations for traffic manipulation inside a web cluster:
MAC-based dispatching (LSMAC) and IP-based dispatching (LSNAT). The authors discuss their
results, and the advantages and disadvantages of both methods. Section 2.10.4 discusses those
approaches.
The researchers from Lucent Technologies and the University of Texas at Austin present in [74] their
architecture for a scalable web cluster. The distributed architecture consists of independent servers
sharing the load through a round robin traffic distribution mechanism.
In [75], the authors present on optimizations to the NCSA http server [76] to make it more scalable
and allow it to serve more requests.

45
2.10 Related Work: In-depth Examination
This section discusses six projects that share the common goal of increasing the performance and
scalability of web clusters. These projects had different focus areas such as traffic distribution
algorithms, new architectures, presenting the cluster as a single server through a virtual IP layer. This
section examines these research projects, presents their respective area of research, their architectures,
highlights their status and plans, and discusses the contributions of their research into our work. The
works discussed are the following:
- “Redirectional-based Web Server Architecture” at University of Texas (Austin): The goal of this
project is to design and prototype a redirectional-based hierarchical architecture that eliminates
bottlenecks in the cluster and allows the administrator to add hardware seamlessly to handle
increased traffic [77]. Section 2.10.1 discusses this project.
- “Scalable policies for Scalable Web clusters” at the University of Roma: The goal of the project
is to provide scalable scheduling policies for web clusters [68][78]. Section 2.10.2 discusses this
project.
- “The Scalable Web Server (SWEB)” at the University of California (Santa Barbara): The project
investigates the issues involved in developing a scalable web server on a cluster of workstations.
The objective is to strengthen the processing capabilities of such servers by utilizing the power of
computers to match the huge demand in simultaneous access requests from the Internet [78].
Section 2.10.3 discusses this project.
- “LSMAC and LSNAT”: The project at the University of Nebraska-Lincoln investigates server
responsiveness and scalability in clustered systems and client/server network environments [79].
The project is focusing on different server infrastructures to provide a single entry into the cluster
and traffic distribution among the cluster nodes [73]. Section 2.10.4 examines the project and its
results.
- “Harvard Array of Clustered Computers (HACC)”: The HACC project aims to design and
prototype cluster architecture for scalable web servers [81]. The focus of the project is on a
technology called “IP Sprayer”, a router component that sits between the Internet and the cluster
and is responsible for traffic distribution among the nodes of the cluster [82]. Section 2.10.5
discusses this project.
- “IBM Scalable and Highly Available Web Server”: This project is investigating scalable and
highly available web clusters. The goal with the project is to develop a scalable web cluster that
46
will host web services on IBM proprietary SP-2 and RS/6000 systems [83]. Section 2.10.6
discusses this project.

2.10.1 Hierarchical Redirection-Based Web Server Architecture

The University of Texas at Austin and Bell Labs are collaborating on a research project to design
scalable web cluster architecture. The authors confirm that a distributed architecture consisting of
independent servers sharing the load is most appropriate for implementing a scalable web server [76].
Their study of currently available approaches, such as RR DNS, demonstrates that these approaches
do not scale effectively because they lead to bottleneck in different parts of the system. This section
presents their redirection-based hierarchical server architecture, discusses its advantages and
drawbacks, and presents their contributions to our work.
The study proposes a redirectional-based hierarchical server architecture that includes two levels of
servers: redirectional servers and normal HTTP servers. The administrator of the cluster partitions
data and stores it on different cluster nodes. The directional servers distribute the requests of the web
users to the corresponding HTTP server.

Clients

Round Robin DNS

Redirectional Redirectional
Server 1 …. Server k

Server 1 Server 2 …. Server n

Figure 16: Hierarchical redirection-based web server architecture

Figure 16 illustrates the architecture of the hierarchical redirection based web server approach. Each
HTTP server stores a portion of the data available at the site. The round robin DNS distributes the
load among the redirection servers [75]. The redirection servers in turn redirect the requests to the
HTTP servers where a subset of the data resides. The redirection mechanism is part of the HTTP

47
protocol and it is completely transparent to the user. The browser automatically recognizes the
redirection message, derives the new URL from it, and connects to the new server to fetch the file.
The original goal of the redirection mechanism supported in HTTP was to facilitate moving files from
one server to another. When a client uses the old URL from its cache or from the bookmark after, and
if the file referenced by the old URL was moved to a new server, the old server returns a redirection
message, which contains the new URL. The cluster administrator partitions the documents stored at
the site among the different servers based on their content. For instance, server 1 could store stock
price data, while server 2 stores weather information and server 3 stores movie clips and reviews. All
requests for stock quotes are directed to server 1 while requests for weather information are directed
for server 2.
It is possible to implement the architecture described with server software modifications. However, in
order to provide more flexibility in load balancing and additional reliability, there is a need to
replicate contents on multiple servers. Implementing data replication requires modifying the data
structure containing the mapping information. If there is replication of data, a logical file name is
mapped to multiple URLs on different servers. In this case, the redirection server has to choose one of
the servers containing the relevant information data. Intelligent strategies for choosing the servers can
be implemented to better balance the load among the HTTP servers. Many approaches are possible
including round robin and weighted round robin.

Users
Users

Browser
2 6
File
DNS Target
URL 5
3 4
Base Redirectional
HTTP
URL message with
Server
base URL

Redirectional
Server

Figure 17: Redirection mechanism for HTTP requests

48
Figure 17 illustrates the steps a web request goes through until the client gets a response back from
the HTTP server. The web user types a web request into the web browser (1). The DNS server
resolves the address and returns the IP address of the server, which in this case is the address of the
redirectional server (2). When the request arrives to the redirectional server (3), it is examined and
forwarded to the appropriate HTTP server (4,5). The HTTP server processes the request and replies to
the web client (6).
The authors implement load balancing by having each HTTP server report its load periodically to a
load monitoring coordinator. If the load on a particular server exceeds a certain threshold, the load
balancing procedure is triggered. Some portions of the content on the overloaded server are then
moved to another server with lower load. Next, the redirection information is updated in all
redirection servers to reflect the data move.
The authors implemented a prototype of the redirection-based server architecture using one
redirectional server and three HTTP servers. Measurements using the WebStone [86] benchmark
demonstrate that the throughput scales up with the number of machines added. Measurements of
connection times to various sites on the Internet indicate that additional connection to the redirection
server accounts for a +20% increase in latency [74].
This architecture is implemented using COTS hardware and server software. Web clients see a single
logical web server without knowing the actual location of the data, or the number of current servers
providing the service. The administrator of the system partitions the document store among the
available cluster nodes however using tedious and mechanical mechanisms and does not provide
dynamic load balancing; rather, it requires the interference of a system administrator to move data to
a different server(s) and to update manually the redirection rules.
One important characteristic of the implementation is the size of the mapping table. The HTTP server
stores the redirection information in a table that is created when the server is started and stored in
main memory. This mapping table is searched on every access to the redirection server. If the table
grows too large, it increases the processing time searching in the redirection server.
The architecture assumes that all HTTP servers have disk storage, which is not very realistic as many
real deployments take advantage of diskless nodes and network storage. The maintenance and update
of all copies of the data is difficult. In addition, web requests require an additional connection
between the redirection server and the HTTP server.

49
Other drawbacks of the architecture include the lack of redundancy at the main redirectional server.
The authors did not focus on incorporating high availability capabilities within the architecture. In
addition, since the architecture assumes a single redirectional server, there was no effort to investigate
a single IP interface to hide all the redirectional servers. As a result, the redirectional server poses a
SPOF and limits the performance and scalability of the architecture. Furthermore, the authors did not
investigate the scaling limitation of the architecture. Overall, the architecture promises a limited level
of scalability.
We classify the main inputs from this project in four essential points. First, the research provided us
with a confirmation that a distributed architecture is the right way to proceed forward. A distributed
architecture allows us to add more servers to handle the increase in traffic in a transparent fashion.
The second input is the concept of specialization. Although very limited in this study, node
specialization can be beneficial where different nodes within the same cluster handle different traffic
depending on the application running on the cluster nodes. The third input to our work relates to load
balancing and moving data between servers. The redirectional architecture achieves load balancing by
manually moving data to different servers, and then updating the redirection information stored on the
redirectional server. This load balancing scheme is an interesting concept for small configurations;
however, it is not practical for large web clusters and we do not consider this approach for our
architecture. The fourth input to our work is the need for a dynamic traffic distribution mechanism
that is efficient and lightweight.

2.10.2 Scheduling Policies for Scalable Web Clusters

The researchers at the University of Roma are investigating the design of efficient and scalable
scheduling algorithms for web dispatchers. The research investigates clusters as scalable web server
platform. The researchers argue that one of the main goals for scalability of distributed web system is
the availability of a mechanism that optimally balances the load over the server nodes [79]. Therefore,
the project focuses on traditional and new algorithms that allow scalability of web server farms
receiving peak traffic. The project recognizes that clustered systems are leading architectures for
building web sites that require guaranteed scalable services when the number of users grows
exponentially. The project defines a web farm as a web site that uses two or more servers housed
together in a single location to handle requests [79].

50
Wide Area
Network

Server
Server 11

Server
Server 22

Local Server
Dispatcher
Dispatcher Server 33
Area
Network
Server
Server 44

Primary DNS
URL Æ IP

Server
Server nn

Figure 18: The web farm architecture with the dispatcher as the central component

Figure 18 presents the architecture of the web cluster with n servers connected to the same local
network and providing service to incoming requests. The dispatcher server connects to the same
network as the cluster servers, provides an entry point to the web cluster, and retains transparency of
the distributed architecture for the users [84]. The dispatcher receives the incoming HTTP requests
and distributes it to the back-end cluster servers.
Although web clusters consist of several servers, all servers use one hostname site to provide a single
interface to all users. Moreover, to have a mechanism that controls the totality of the requests
reaching the site and to mask the service distribution among multiple back-end servers, the web
server farm provides a single virtual IP address that corresponds to the address of front-end server(s).
This entity is the dispatcher that acts as a centralized global scheduler that receives incoming requests
and routes them among the back-end servers of the web cluster. To distribute the load among the web
servers, the dispatcher identifies uniquely each server in the web cluster through a private address.
The researchers argue that the dispatcher cannot use highly sophisticated algorithms for traffic
distribution because it has to take fast decision for hundreds of requests per second. Static algorithms
are the fastest solution because they do not rely on the current state of the system at the time of
making the distribution decision. Dynamic distribution algorithms have the potential to outperform
static algorithms by using some state information to help dispatching decisions. However, they
require a mechanism that collects, transmits, and analyzes that information, thereby incurring in
overheads.

51
The research project considered three scheduling policies that the dispatcher can execute [84]:
random (RAN), round robin (RR) and weighted round robin (WRR). The project does not consider
sophisticated traffic distribution algorithms to prevent the dispatcher from becoming the primary
bottleneck of the web farm.
Based on modeling simulations, the project observed that burst of arrivals and skewed service times
alone do not motivate the use of sophisticated global scheduling algorithms. Instead, an important
feature to consider for the choice of the dispatching algorithm is the type of services provided by the
web site. If the dispatcher mechanism has a full control on client requests and clients require HTML
pages or submit light queries to a database, the system scalability is achieved even without
sophisticated scheduling algorithms. In these instances, straightforward static policies are as effective
as their more complex dynamic counterparts are. Scheduling based on dynamic state information
appears to be necessary only in the sites where the majority of client requests are of three or more
orders of magnitude higher than providing a static HTML page with some embedded objects.
The project observes that for web sites characterized with a large percentage of static information, a
static dispatching policy such as round robin provides a satisfactory performance and load balancing.
Their interpretation for this result is that a light-medium load is implicitly balanced by a fully
controlled circular assignment among the server nodes that is guaranteed by the dispatcher of the web
farm. When the workload characteristics change significantly, so that very long services dominate,
the system requires dynamic routing algorithms such as WRR to achieve a uniform distribution of the
workload and a more scalable web site. However, in high traffic web sites, dynamic policies become
a necessity.
The researchers did not prototype the architecture into a real system and run benchmarking tests on it
to validate the performance, scalability, and high availability. In addition, the project did not design or
prototype new traffic distribution algorithms for web servers; instead, it relied on existing distribution
algorithms such as the DNS routing and RAN, RR, and WRR distribution. The architecture presents
several single points of failure. In the event of the dispatcher failure, the cluster becomes unreachable.
Furthermore, if a cluster node becomes unavailable, there is no mechanism is place to notify the
dispatcher of the failure of individual nodes. Moreover, the dispatcher presents a bottleneck to the
cluster when under heavy load of traffic.

52
The main input from this project is that dynamic routing algorithms are a core technology to achieve a
uniform distribution of the workload and a reach scalable web cluster. The key is in the simplicity of
the dynamic scheduling algorithms.

2.10.3 Scalable World Wide Web Server

The Scalable Web server (SWEB) project grew out of the needs of the Alexandria Digital Library
(ADL) project [85] at the University of California at Santa Barbara [78] that has a potential to become
the bottleneck in delivering digitized documents over high speed Internet. For web-based network
information systems such as digital libraries, the servers involve much more intensive I/O and
heterogeneous processor activities. The SWEB project investigates the issues involved in developing
a scalable web server on a cluster of workstations and parallel machines [78]. The objective of the
project is to strengthen the processing capabilities of such servers by utilizing the power of computers
to match the huge demand in simultaneous access requests from the Internet. The project aims to
demonstrate how to utilize inexpensive commodity networks, heterogeneous workstations, and disks
to build a scalable web server, and to attempt to develop dynamic scheduling algorithms for
exploiting task and I/O parallelism adaptive to run time change of system resource load and
availability. The scheduling component of the system actively monitors the usages of processors, I/O
channels, and the interconnection network to distribute effectively HTTP request across cluster nodes.

Users
Users
Users HTTP Users
Users
Users
Requests

- Scheduler DNS
Internet
- Load info
- httpd

Server

Disk

Internal Network

SWEB Logical Server

Figure 19: The SWEB architecture

53
Figure 19 illustrates the SWEB architecture. The DNS routes the user requests to the SWEB
processors using round robin distribution. The DNS assigns the requests without consulting the
dynamically changing system load information. Each processor in the SWEB architecture contains a
scheduler, and the SWEB processors collaborate with each other to exchange system load
information. After the DNS sends a request to a processor, the scheduler on that processor decides
whether to process this requests or assign it to another SWEB processor. The architecture uses URL
redirection to achieve re-assignment. The SWEB architecture does not allow SWEB servers to
redirect HTTP requests more than once to avoid the ping-pong effect.

SWEB Broker
- Manages servers
. - chooses sever for request
.
accept_request(r)
choose_server(s) Oracle
if (s != me) - Characterize requests
reroute_request(r)
else
handle_request(r)
. loadd
. - Manages distributed
load info

httpd loadd

Figure 20: The functional modules of a SWEB scheduler in a single processor

Figure 20 illustrates the functional structure of the SWEB scheduler. The SWEB scheduler contains a
HTTP daemon based on the source code of the NCSA HTTP [76] for handling http requests, in
addition to the broker module that determines the best possible processor to handle a given request.
The broker consults with two other modules, the oracle module and the loadd module. The oracle
module is a miniature expert system, which uses a user-supplied table that characterizes the processor
and disk demands for a particular task. The loadd module is responsible for updating the system
processor, network and disk load information periodically (every 2 to 3 seconds), and making the
processors, which have not responded within the time limit, unavailable. When a processor leaves or
joins the resource pool, the loadd module is aware of the change as long as the processor has the
original list of processor that was setup by the administrator of the SWEB system.
The SWEB architecture investigates several concepts. It supports a limited flavor of dynamism while
monitoring the processor and disk usage on processors. The loadd module collects processor and disk

54
usage information and feed back this information to the broker to make better distribution decisions.
The drawback of this mechanism is that it does not report available memory as part of the metrics,
which is as important as the processor information; instead it reports local disk information for an
architecture that relies on a network file system for storage.
The SWEB architecture does not provide high availability features, making it vulnerable to single
points of failures. The oracle module expects as input from the administrator a list of processors in
the SWEB system and the processor and disk demands for a particular task. It is not able to collect
this information automatically. As a result, the administrator of the cluster interferes every time we
need to add or remove a processor.
The SWEB implementation modified the source code of the web server and created two additional
software modules [78]. The implementation is not flexible and does not allow the usage of those
modules outside the SWEB specific architecture.
The researchers have benchmarked the SWEB architecture built using a maximum of four processors
with an in-house benchmarking tool, not using a standardized tool such as WebBench with a
standardized workload. The results of the tests demonstrate a maximum of 76 requests per second for
1 KB/s request size, and 11 requests per seconds for 1.5 MB/s request size, which ranks low
compared to our initial benchmarking results (Section 3.7).
The project contributes to our work by providing us how-to on actively monitoring the usages of
processor, I/O channels, and the network load. This information allows us to distribute effectively
HTTP requests across cluster nodes. Furthermore, the concept of web cluster without master nodes,
and having the cluster nodes provide the services master nodes usually provide, is a very interesting
concept.

2.10.4 University of Nebraska-Lincoln: LSMAC and LSNAT

This research project recognizes that server responsiveness and scalability are important in
client/server network environments [13]. The researchers are considering clusters that use commodity
hardware as an alternative to expensive specialized hardware for building scalable web servers. The
project investigates different server infrastructures namely MAC-based dispatching (LSMAC) and IP-
based dispatching (LSNAT), and focuses on the single IP interface approach [44]. The resulting
implementations are the LSMAC and LSNAT implementations, respectively. LSMAC dispatches
each incoming packet by directly modifying its media access control (MAC) addresses.

55
Figure 21 presents the LSMAC approach [80]. A client sends an HTTP packet (1) with A as the
destination IP address. The immediate router sends the packet to the dispatcher at IP address A (2).
Based on the load sharing algorithm and the session table, the dispatcher decides that this packet
should be handled by the back-end-server, Server 2, and sends the packet to Server 2 by changing the
MAC address and forwarding it (3). Server 2 accepts the packet and replies directly to the client (4).

HTTP requests
with dest IP=A
3
1 Router 2

Route AÆD

IP=B1 IP=B2 IP=B3 IP=D

IP_alias=A IP_alias=A IP_alias=A LSMAC

Server 1 Server 2 Server 3 dispatcher

Figure 21: The LSMAC implementation

5
HTTP requests
with dest IP=A 4
3
1 Router 2

Route AÆD

IP=B1 IP=B2 IP=B3 IP=D

IP_alias=A IP_alias=A IP_alias=A LSMAC

Server 1 Server 2 Server 3 dispatcher

Figure 22: The LSNAT implementation

Figure 22 illustrates the LSNAT approach [73][80]. The LSNAT implementation follows RFC 2391
[48]. A client (1) sends an HTTP packet with A as the destination IP address. The immediate router
sends the packet to the dispatcher (2) on A, since the dispatcher machine is assigned the IP address A.
Based on the load sharing algorithm and the session table, the dispatcher decides that this packet
should be handled by the back-end server, Server 2. It then rewrites the destination IP address as
Server 2, recalculates the IP and TCP checksums, and sends the packet to Server 2 (3). Server 2
56
accepts the packet (4) and replies to the client via the dispatcher, which the back-end servers see as a
gateway. The dispatcher rewrites the source IP address of the replying packet as A, and recalculates
the IP and TCP checksums, and send the packet to the client (5).
The dispatcher in both approaches, LSMAC and LSNAT, is not highly available and presents a SPOF
that can lead to service discontinuity. The work did not focus on providing high availability
capabilities; therefore, in the event of a node failure, the node continues to receive traffic. Moreover,
the architecture does not support scaling the number of servers. The largest setup tested was a cluster
that consists of four nodes. The authors did not demonstrate the scaling capabilities of the proposed
architecture beyond four nodes [73][80]. The performance measurements were performed using the
benchmarking tool WebStone [86]. The LSMAC implementation running on a four-node cluster
averaged 425 transactions per second per traffic node. The LSNAT implementation running on a four
nodes cluster averaged 200 transactions per second per traffic node [79].
Furthermore, the architecture does not provide adaptive optimized distribution. The dispatcher does
not take into consideration the load of the traffic nodes nor their heterogeneous nature to optimize its
traffic distribution. It assumes that all the nodes have the same hardware characteristics such as the
same processor speed and memory capacity.

2.10.5 Harvard Array of Clustered Computers (HACC)

The goal of the HACC project is to design and develop a cluster architecture for scalable and cost
effective web servers [81][82]. In [81], the authors discuss the approach that places a router called IP
Sprayer between the Internet and a cluster of web servers.
Figure 23 illustrates the architecture of the web cluster with the IP Sprayer. The IP Sprayer is
responsible for distributing incoming web traffic evenly between the nodes of the cluster. A number
of commercial products, such as the Cisco Local Director [50] and the F5 Networks Big IP [87],
employ this approach to distribute web site requests to a collection of machines typically in a round
robin fashion. The HACC architecture focuses on locality enhancements by dividing the document
store among the cluster nodes and dynamic traffic distribution. Rather than distributing requests in a
round robin fashion, the HACC smart router distributes requests so that as to enhance the inherent
locality of the requested documents in the server cluster.

57
Web Clients

HTTP Requests

Router (or IP Sprayer)

a C a C a C
B B B

Node 1 Node 2 Node 3

Figure 23: The architecture of the IP sprayer

Web Clients

HTTP Requests

HACC Smart Router

a B C

Node 1 Node 2 Node 3

Figure 24: The architecture with the HACC smart router

Figure 24 illustrates the concept of the HACC smart router. Instead of being responsible for the entire
working set, each node in the cluster is responsible for only a fraction of the document store. The size
of the working set of each node decreases each time we add a node to the cluster, resulting in a more
efficient use of resources per node. The smart router uses an adaptive scheme to tune the load
presented to each node in the cluster based on that node’s capacity, so that it can assign each node a
fair share of the load. The idea of HACC bears some resemblance to the affinity based scheduling
schemes for shared memory multiprocessor systems [88][89], which schedule a task on a processor
where relevant data already resides.

58
2.10.5.1 HACC Implementation
The main challenge in realizing the potential of the HACC design is building the Smart Router, and
within the Smart Router, designing the adaptive algorithms that direct requests at the cluster nodes
based on the locality properties and capacity of the nodes [81].
The smart router implementation consists of two layers: the low smart router (LSR) and the high
smart router (HSR). The LSR corresponds to the low-level kernel resident part of the system and the
HSR implements the high-level user-mode brain of the system. The authors conceived this
partitioning to create a separation of mechanism and policy, with the mechanism implemented in the
LSR and the policy implemented in the HSR.
The Low Smart Router: The LSR encapsulates the networking functionality. It is responsible for
TCP/IP connection setup and termination, for forwarding requests to cluster nodes, and forwarding
the result back to clients. The LSR listens on the web server port for a connection request. When the
LSR receives connection request, TCP passes a buffer to the LSR containing the HTTP request. The
HSR extracts and copies the URL from the request. The LSR queues all data from this incoming
request and waits for the HSR to indicate which cluster node should handle the request. When the
HSR identifies the node, the LSR establishes a connection with it and forwards the queued data over
this connection. The LSR continues to ferry data between the client and the cluster node serving the
request until either side closes the connection.
The High Smart Router: The HSR monitors the state of the document store, the nodes in the cluster,
and properties of the documents passing through the LSR. It uses this information to decide how to
distribute requests over the HACC cluster nodes. The HSR maintains a tree that models the structure
of the document store. Leaves in the tree represent documents and nodes represent directories. As the
HSR processes requests, it annotates the tree with information about the document store to the applied
in load balancing. This information could include node assignment, document sizes, request latency
for a given document, and in general, sufficient information to make an intelligent decision about
which node in the cluster should handle the next document request. When a request for a particular
file is received for the first time, the HSR adds nodes representing the file and newly reached
directories to its model of the document store, initializing the file’s node with its server assignment.
In the current prototype, incoming new documents are assigned to the least loaded server node. After
the first request for a document, subsequent requests go to the same server and though improve the
locality of references.
59
Dynamic Load Balancing: Dynamic load balancing is implemented using Windows NT’s
performance data helper (PDH) interface [90]. The PDH interface allows collecting a machine’s
performance statistics remotely. When the smart router initializes, it spawns a performance
monitoring thread that collects performance data from each cluster node at a fixed interval. The HSR
then uses the performance data for load balancing in two ways. First, it identifies a least loaded node
and assigns new requests to it. Second, when a node becomes overloaded, the HSR tries to offload a
portion of the documents for which the overloaded node is responsible to the least loaded node.

2.10.5.2 HACC Evaluation and Lessons Learned

There are several drawbacks to the HACC architecture. The architecture is vulnerable to SPOF, has
limited performance and scalability, and suffers from various challenges in aspects such as data
distributions, in addition to specific implementation issues.
Firstly, the smart router presents itself as a SPOF. In the HACC architecture if the smart router
becomes unavailable because of a software or hardware error, the HACC cluster becomes unusable.
Secondly, the performance and scalability of the HACC cluster with the smart router scalability are
limited. The benchmarking tests demonstrate that the prototype of the smart router is capable of
handling between 400 and 500 requests per second (requests are of size 8 KB) [81]. These results are
half the performance that our early prototype achieved with its worst-case scenarios, using the round
robin distribution scheme (Section 3.7). The limited performance of the smart router suggests that the
HACC cluster cannot scale because of the limitation imposed by the bottleneck at the smart router
level. In addition, the prototype of the HACC architecture consists of one node acting as the smart
router and three nodes acting as cluster nodes, suggesting a limited configuration. Furthermore, the
HACC architecture relies on distributing data among the different cluster nodes; we cannot perform
the distribution while the HACC cluster is in operation. This approach is not realistic for systems that
are in operation and require an update to their data store online. In addition, the tree-structured name
space only works for the case when the structure of the document store is hierarchical. Moreover, the
Keep Alive feature of the HTTP poses some potential problems with the Smart Router. If we enable
the Keep Alive option, the browser allows reusing the TCP connection for subsequent requests that
the smart router does not intercept, which interfere with the traffic distribution decision of the smart
router.

60
2.10.6 IBM Scalable and Highly Available Web Server
IBM Research is investigating the concept of a scalable and highly available web server that offers
web services via a Scalable Parallel (SP-2) system, a cluster of RS/6000 workstations. The goal is to
support a large number of concurrent users, high bandwidth, real time multimedia delivery, fine-
grained traffic distribution, and high availability. The server will provide support for large multimedia
files such as audio and video, real time access to video data with high access bandwidth, fine-grained
traffic distribution across nodes, as well as efficient back-end database access. The project is focusing
on providing efficient traffic distribution mechanisms and high availability features. The server
achieves traffic distribution by striping data objects across the back-end nodes and disks. It achieves
high availability by detecting node failures and reconfiguring the system appropriately. However,
there is no mention of the time to detect the failure and to recover.

Users Users
Users Users
Users Users

External network
connection
Load Balancing

Front-End Nodes

App: Application …
specific software App App App

Communication Switch

Back-End Nodes
…

Disks

Figure 25: The two-tier server architecture

Figure 25 illustrates the architecture of the web cluster. The architecture consists of a group of nodes
connected by a fast interconnect. Each node in the cluster has a local disk array attached to it. The
disks of a node either can maintain a local copy of the web documents or can share it among nodes.

61
The nodes of the cluster are of two types: front-end (delivery) nodes and back-end (storage) nodes.
The round robin DNS is used to distribute incoming requests from the external network to the front-
end nodes, which also run httpd daemons. The logical front-end node then forwards the required
command to the back-end nodes that have the data (document), using a shared file system. Next, the
httpd daemons send the results to the front-end nodes through the switch and then the results are
transmitted to the user. The front-end nodes run the web daemons and connect to the external
network. To balance the load among them, the client spread the load across nodes using RR DNS
[51]. All the front nodes are assigned a single logical name and the RR DNS maps the name to
multiple IP addresses.

3
Web
Server
Nodes

Switch

TCP
Router
Nodes
Internet 1

Figure 26: The flow of the web server router

Figure 26 illustrates another approach for achieving traffic distribution. One or more nodes of the
cluster serve as TCP routers, forwarding client requests to the different frond-end nodes in the cluster
in a round robin order. The name and IP address of the router is public, while the addresses of the
other nodes in the cluster are private. If there is more than one router node, a single name is used and
the round robin maps the name to the multiple routes. The flow of the web server router (Figure 26) is
as follows. When a client sends requests to the router node (1), the router node forwards (2) all
packets belonging to a particular TCP connection to one of the server front-end nodes. The router can
use different algorithms to select which node to route to, or use round robin scheme. The server nodes
directly reply to the client (3) without using the router. However, the server nodes change the source

62
address on the packets sent back to the client to be that of the router node. The back-end nodes host
the shared file system used by the front-ends to access the data.
There are four main drawbacks preventing the architecture from achieving a scalable and highly
available web cluster: limited traffic distribution performance, limited scalability, lack of high
availability capabilities, the presence of several SPOF, and the lack of a dynamic feedback
mechanism.
The architecture relies on round robin DNS to distribute traffic among server nodes. The scheme is
static, does not adjust based on the load of the cluster nodes, and does not accommodate the
heterogeneous nature of the cluster nodes. The authors proposed an improved traffic distribution
mechanism [83] that involves changing packet headers but still relies on round robin DNS to
distribute traffic among router server nodes. The concept was prototyped with four front-end nodes
and four back-end node. The project did not demonstrate if the architecture is capable of scaling
beyond four traffic nodes and if failures at node level are observed and accommodated for
dynamically. The architecture does not provide features that allow service continuity. The switch as
shown in Figure 25 and Figure 26 is a SPOF. The network file system where data resides is also
vulnerable to failures and presents another SPOF. Furthermore, the architecture does not support a
dynamic feedback loop that allows the router to forward traffic depending on the capabilities of each
traffic node.

2.10.7 Discussion
The surveyed projects share some common results and conclusions. Current server architectures do
not provide the needed scalability to handle large traffic volumes and a large number of web users.
There seem to be a consensus among surveyed literature for a need to design new server architecture
that is able to meet our visions for next generation Internet servers. A distributed architecture that
consists of independent servers sharing the load is more appropriate than single server architectures
for implementing a scalable web server.
Surveyed projects such as [63], [64], [65], [68], [72], [91], [92], [93], [94], and [95], discuss that
using clustering technologies help increase the performance and scalability of the web server.
Clustering is the dominant technology that will help us achieve better scalability and higher
performance. The focus is not on clustering as a technology, rather, the focus is on using this
technology as a mean to achieve scalability, high capacity, and high availability. Several of the

63
surveyed projects such as [68], [79], [81], and [82], focused on providing efficient traffic distribution
mechanisms and providing high availability features that enable continuous service availability.
However, looking at current benchmarking results, adding high availability, and fault tolerance
features is negatively affecting the performance and scalability of the cluster. Traffic distribution is an
essential aspect towards achieve highly scalable platform. Hardware-base traffic distribution solutions
are not scalable; they constitute both a performance bottleneck and a SPOF.
There is a clear need to provide a software single Virtual IP layer that hides the cluster nodes and
make them transparent to end users. There is a need to have an unsophisticated design and
consequently a simple implementation. The uncomplicated design allows a smooth integration of
many components into a well-defined architecture. The benefits are a lightweight design and a faster
and more robust system. One interesting observation is that the surveyed works, for the exception of
the HACC project [81], use the Linux operating system to prototype and implement Internet and web
servers.
A highly available and scalable web cluster requires is a combination of a smart and efficient traffic
distribution mechanism that distributes incoming traffic from the cluster interface to the least busy
nodes in the cluster based on a dynamic feedback loop. The distribution mechanism and the cluster
interface should provide neither a bottleneck nor a SPOF. To scale such cluster, we would like to
have the ability to add nodes into the cluster without disruption of the provided services and the
capability to increase the number of nodes to meet traffic resulting in close to linear scalability.

64
Chapter 3
Preparatory Work

This chapter describes the preparatory work conducted as part of the early investigations. It describes
the prototyped web server cluster, presents the benchmarking environment, and reports the
benchmarking results.

3.1 Early Work

As part of the early experimental work, I have studied the HTTP protocol [96] and the Apache web
server as case studies [97][98]. These studies resulted in the two technical reports, [99] and [100].
Apache is the most popular web server software in the world, running over 60% of the web servers
[101], and licensed under the GNU General Public License (GPL) [102] that grant users full access to
source code. We divide the experimental work into two tasks: understanding and building a web
server, and implementing a performance measurement tool that is able to execute performance and
scalability tests on web servers. The first task involved understanding the working of web servers, in
addition to setting up a web server at the research lab in Concordia. This task was a necessary step
leading to the study of web server design and performance. This work resulted in three published
technical reports: [96], [99], and [100]. The second task consisted of designing and implementing the
X-Based Web Servers Performance Tool (XWPT) that generates web traffic to a web server and
collects performance metrics such as the number of successful results per second and the throughput
[103].

3.2 Description of the Prototyped Web Cluster

The goal of this exercise was to provide a test platform for a highly available and scalable web server
cluster, and to benchmark a running web cluster to learn how the cluster scales, versus simulating
traffic into a model. Figure 27 illustrates the prototyped web cluster. The cluster consists of 14
Pentium III processors running at 500 MHz with 512 MB of RAM each. The cluster processors (also
called nodes) run the Linux operating system [22]. Each server has six Ethernet ports: two onboard
ports and four ports offered on an auxiliary CompaqPCI card for redundancy purposes. Each of the

65
eight processors has three SCSI disks with a capacity of 54 GB in total; the other eight processors are
diskless. All processors have access to a common storage volume provided by the master nodes.
Although we have used the NFS, we also experienced with the Parallel Virtual File System (PVFS)
provide a shared disk space among all nodes. The PVFS supports high performance I/O over the web
[91][92][105][106]. To provide high availability, we implemented redundant Ethernet connections,
redundant network file systems, in addition to software RAID [104]. The LVS [23] provides a single
IP interface for the cluster and provides HTTP traffic distribution mechanism among the servers in
the cluster. As for the web server software, we run Apache release 2.08 [24] and Tomcat release 3.1
[25] on traffic nodes. Other studies and benchamrking we performed in this area include [99], [100],
[103], [107], and [108].

A single entry
to the cluster
LVS Virtual IP Interface
Master nodes providing
Master Node 1 Master Node 2
Storage and cluster services HA
NFS

LAN 1
Redundant
LANS LAN 2

Traffic Traffic Traffic Traffic Traffic Traffic

Node 1 Node 2 Node 3 Node 4 Node 5 Node 6

Traffic Traffic Traffic Traffic Traffic Traffic

Node 7 Node 8 Node 9 Node 10 Node 11 Node 12

Figure 27: The architecture of the prototyped web cluster

As for the local network, all cluster nodes are interconnected using redundant dedicated links. Each
node connects to two networks through two Ethernet ports over two redundant network switches.
External network access is restricted to master nodes unless traffic nodes require direct access, which
is configurable for some scenarios such as direct routing. The dotted connection indicates that the
processor connects to LAN 1. The solid connection indicates that the processor connects to LAN 2.
The components and parameters of the prototyped system include two master nodes and 12 traffic
nodes, a traffic distribution mechanism, a storage sub-system, and local and external networks. The
master nodes provide a single entry point to the system through the virtual IP address receiving
incoming web traffic and distributing it to the traffic nodes using the traffic distribution algorithm.

66
The master nodes provide cluster-wide services such as I/O services through NFS, DHCP, and NTP
services. They respond to incoming web traffic and distribute it among the traffic nodes. The traffic
nodes run the Apache web server and their primary responsibility is to respond to web requests. They
rely on master nodes for cluster services. We used the LVS in different configurations to distribute
incoming traffic to the traffic nodes. Section 3.5 discusses the network address translation and direct
routing methods, and demonstrates their capabilities.

3.3 Benchmarking Environment

We used web server benchmarking to evaluate the performance and scalability of the prototyped web
cluster. Web benchmarking is a combination of standardized tests that consist of a mechanism to
generate a controlled stream of web requests to the cluster with standard metrics to report results. The
following subsections describe the hardware and software components of the benchmarking
environment.

3.3.1 Hardware Benchmarking Environment

We used 16 Intel Celeron 500 MHz machines, called web client machines, to generate web traffic to
the prototype web cluster illustrated in Figure 27. Each of the web client machines is equipped with
512 MB of RAM and run Windows NT. In addition, the benchmarking environment requires a
controller machine that is responsible for collecting and compiling the test results from all the web
client machines. The controller machine has a Pentium III processors running at 500 MHz with 512
MB of RAM.

3.3.2 Software Benchmarking Environment

The web client machines generate traffic using WebBench [19], a freeware benchmarking tool that
simulates web browsers. When a web client receives a response from the web server, WebBench
records the information associated with the response and immediately sends another request to the
server.
Figure 28 presents the WebBench architecture. WebBench runs on a server running the controller
program and one or more servers, each running the client program. The controller provides means to
set up, start, stop, and monitor the WebBench tests. It is also responsible for gathering and analyzing
the data reported from the clients. The web client machines execute the WebBench tests and send
requests to the web server.
67
WebBench supports up to 1,000 web clients generating traffic to a target web server. WebBench
stresses a web server by using a number of test systems to request URLs. We can configure each
WebBench test system to use multiple threads to make simultaneous web server requests. By using
multiple threads per test system, it is possible to generate a large load on a web server to stress it to its
limit with a reasonable number of test systems. Each WebBench test thread sends an HTTP request to
the web server and waits for the reply. When the reply comes back, the test thread immediately makes
a new HTTP request. WebBench makes peak performance measurements that illustrate the limitations
of a web server platform.

WebBench Controller Target

Machine gathering Clustered
results and generating Client machines System
statistics generating web traffic

Figure 28: The architecture of the WebBench benchmarking tool

The workload tree provided by WebBench contains the test files the WebBench client access when
we execute a test suite. WebBench workload tree is the result of studying real-world sites such as
Microsoft, USA Today, and the Internet Movie Database. The tree uses multiple directories and
different directory depths. It contains over 6,200 static pages and executable test files. The WebBench
provides static (STATIC.TST) and dynamic test suites (WBSSL.TST). The static suites use HTML
and GIF files. The dynamic suites use applications that run on the server.
WebBench keeps at run-time all the transaction information and uses this information to compute the
final metrics presented when the tests are completed. The standard test suites of WebBench begin
with one client and add clients incrementally until they reach a maximum of 60 clients (per single
client machine). WebBench provides numerous standardized test suites. For our testing purposes, we
executed a mix of the STATIC.TST (90%) and WBSSL.TST (10%) tests. Each test takes on average
two hours and a half to run this combination of test suites.

68
3.4 Web Server Performance
In considering the performance of a web server, we should pay special regard to its software,
operating system, and hardware environment, because each of these factors can dramatically
influence the results. In a distributed web server, this environment is complicated further by the
presence of multiple components, which require connection handoffs, process activations, and request
dispatching. A complete performance evaluation of all layers and components of a distributed web
server system is very complex impossible. Hence, a benchmarking study needs to define clearly its
goals and scope. In our case, the goal is to evaluate the end-to-end performance of a web cluster. Our
main interests do not go into the hardware and operating system that in many cases are given.
Web server performance refers to the efficiency of a server when responding to user requests
according to defined benchmarks. Many factors affect a server performance such as application
design and construction, database connectivity, network capacity and bandwidth, and hardware server
resources. In addition, the number of concurrent connections to the web server has a direct impact on
its performance. Therefore, the performance objectives include two dimensions: the speed of a single
user's transaction and the amount of performance degradation related to the increasing number of
concurrent connections.
Metric Name Description
Throughput The rate at which data is sent through the network, expressed in Kbytes per second (KB/s)
Connection rate The number of connections per second
Request rate The number of client requests per second
Reply rate The number of server responses per second
Error rate The percentage of errors of a given type
DNS lookup time The time to translate the hostname into the IP address
Connect time The time interval between sending the initial SYN and the last byte of a client request and
the receipt of the first byte of the corresponding response
Latency time In a network, latency, a synonym for delay, is an expression of how much time it takes for a
packet of data to get from one designated point to another
Transfer time The time interval between the receipt of the first response byte and the last response byte
Web object response time The sum of latency time and transfer time
Web page response time The sum of web object response time pertaining to a single web page, plus the connect time
Session time The sum of all web page response times and user think time in a user session

Table 5: Web performance metrics

69
Table 5 presents the common metrics for web system performance. In reporting the results of the
benchmarking tests, we report the connection rate, number of successful transactions per second, and
the throughput as KB/s.

3.5 LVS Traffic Distribution Methods

The LVS is an open source project to cluster many servers together into one virtual server. It
implements a layer four switching in the Linux kernel that allows the distribution of TCP and UDP
sessions between multiple real servers. The real servers, also called traffic nodes, interconnect
through either a local area network or a geographically dispersed wide area network. The front-end of
the real servers is the LVS director. The LVS director is the load balancing engine, and it runs on the
master cluster processor(s). It provides IP-level traffic distribution to make parallel services of the
cluster appear as a virtual service on a single IP address. All requests come to the front-end LVS
director, owner of the virtual IP address, and then the LVS director distributes the traffic among the
real servers. LVS provides three implementations of its IP load-balancing techniques based on NAT,
IP tunneling, and DR. We experimented with both the NAT and DR methods. We did not test the IP
tunneling method for two reasons: the implementation of the IP tunneling method is still experimental
and it does not offer added advantage over the DR method.
The following sub-sections discuss all three methods, with a focus on NAT and DR, and present the
results of benchmarking a cluster with a NAT and then a DR configuration.

3.5.1 LVS via NAT

Network address translation relies on manipulating the headers of Internet protocols appropriately so
that clients believe they are contacting one IP address, and servers at different IP addresses believe
the clients are contacting them directly. LVS uses this NAT feature to build a virtual server; parallel
services at the different IP addresses can appear as a virtual service on a single IP address via NAT. A
hub or a switch interconnects the master node(s), which run LVS, and the real servers. The real
servers usually run the same service and provide access to the same contents, available on a shared
storage device through a distributed file system.
Figure 29 illustrates the process of address translation. The user first accesses the service provided by
the server cluster (1). The request packet arrives at the load balancer through the external IP address.
The load balancer examines the packet's destination address and port number (2). If they match a

70
virtual server service (according to the virtual server rule table), then the scheduling algorithm, round
robin by default, chooses a real server from the cluster to serve the request, and adds the connection
into the hash table which records all established connections. The load balancer server rewrites the
destination address and the port of the packet to match those of the chosen real server, and forwards
the packet to the real server. The real server processes the request (3) and returns the reply to the load
balancer. When an incoming packet belongs to this connection and the established connection exists
in the hash table, the load balancer rewrites and forwards the packet to the chosen server. When the
reply packets come back from the real server to the load balancer, the load balancer rewrites the
source address and port of the packets (4) to those of the virtual service, and submits the response
back to the client (5). The LVS removes the connection record from the hash table when the
connection terminates or timeouts.

Processing
3
the request
Users Real
Server 1

Real
Internet/Intranet Server 2
Scheduling and
2
rewriting packets

1 Requests

5 Replies Switch/Hub

Load Balancer
Linux Box Real
Server N
Rewriting
4
replies

Figure 29: The architecture of the LVS NAT method

3.5.2 LVS via DR

The direct routing method allows the real servers and the load balancer server(s) to share the virtual
IP address. The load balancer has an interface configured with the virtual IP address. The load
balancer uses this interface to accept request packets and directly route them to the chosen real server.
Figure 30 illustrates the architecture of the LVS following the direct routing method. When a user

71
accesses a virtual service provided by the server cluster (1), the packet destined for the virtual IP
address arrives to the load balancer. The load balancer examines (2) the packet's destination address
and port. If it matches a virtual service, the scheduling algorithm chooses a real server (3) from the
cluster to serve the request and adds the connection into the hash table that records connections. Next,
the load balancer forwards the request to the chosen real server. If new incoming packets belong to
this connection and the chosen server is available in the hash table, the load balancer directly routes
the packets to the real server. When the real server receives the forwarded packet, the server finds that
the packet is for the address on its alias interface or for a local socket, so it processes the request (4)
and returns the result directly to the user (5). The LVS removes the connection record from the hash
table when the connection terminates or timeouts.

5 Reply going to the user directly

Processing
4
the request
Users

1 Requests
Real
Examine packet Server 1
Internet/Intranet 2
destination

Virtual IP
Address Internal Real
Network Server 2

.
Linux .
Director
.
Forward request
3 Real
to real server
Server N

Figure 30: The architecture of the LVS DR method

3.5.3 LVS via IP Tunneling

IP Tunneling allows packets addressed to an IP address to be redirected to another address, possibly
on a different network. In the context of layer four switching, the behavior is very similar to that of
the DR method, except that when packets are forwarded they are encapsulated in an IP packet, rather
than just manipulating the Ethernet frame.

72
The main advantage of using tunneling is that the real servers (i.e. traffic nodes) can be on a different
network. We did not experiment with the IP Tunneling method because of the unstable status of the
implementation, and because it does not provide additional capabilities over the DR method.
However, we present it for completion purpose.

3.5.4 Direct Routing versus Network Address Translation

We performed a test to benchmark the same web cluster using the NAT approach for traffic
distribution and then then using the direct routing approach. With direct routing, the LVS load
balancer distributes traffic to the cluster processors, which in turn would reply directly to the web
clients, without going through address translation. We run the same test on the same cluster running
Apache, however, using different traffic distribution mechanisms.
Figure 31 presents the results in terms of successful requests per second. Following the NAT
approach, the load balancer node saturates at approximately 2,000 requests per second compared to
4,700 requests per second with the DR method.

5000
4500
Requests per Second

4000
3500
3000
2500
2000
1500
1000
500
0
8_ ents

_c t s
cl t

_c ts
_c ts
_c ts
_c ts
_c ts
_c ts

s
_c ts
_c ts
_c ts
_c ts
_c ts
_c ts
4_ lien

nt
12 lien
16 lien
20 lien
24 lien
28 lien
32 lien
36 lien
40 lien
44 lien
48 lien
52 lien
56 lien
60 lien
lie
c

i
1_

LVS Direct Routing LVS Network Address Translation

Figure 31: Benchmarking results of NAT versus DR

73
In both tests, the bottleneck occurs at the load balancer node that was unable to accept more traffic
and distribute it to the traffic servers. Instead, the LVS director was rejecting incoming connections
resulting in unsuccessful requests. This test demonstrates that the DR approach is more efficient than
the NAT approach and allows better performance and scalability. In addition, it demonstrates the
bottleneck at the director level of the LVS.

3.6 Benchmarking Scenarios

The goal of the benchmarking activities was to demonstrate performance and scalability issues in
clustered web servers, to uncover these issues and to demonstrate where and when bottlenecks occur
and when the system stops scaling. We carried out six benchmark tests, starting with one processor
and scaling up to 12 processors. The benchmark of a single server is to demonstrate the limitation of a
single server. We also carried out a benchmarking to test demonstrate scalability limitation in NAT
and DR methods.
We performed two sets of benchmarking tests on the prototyped cluster (Figure 27): benchmark a
single processor to determine the baseline performance, and several benchmarks while scaling the
prototyped cluster from two to 12 processors. We performed the first benchmarking test on a single
processor running the Apache web server software. The goal with this test is to reveal the limitations
of a single server. We use the results of this test to define the baseline performance and use it as a
basis for comparison. For the second set of testing, we scaled the cluster to two, four, six, eight, 10,
and 12 traffic nodes at a time. Each node is running its copy of the Apache web server software. The
benchmarking tool, WebBench, generates the web traffic to the LVS IP layer of the cluster that is
running on the master nodes. Since the LVS DR method is more efficient and scalable than its NAT
equivalence (Figure 31), we used it as the distribution mechanism for the incoming traffic. The LVS
load director performs traffic distribution among all the cluster nodes. The goal with this test is to
identify the performance and scaling trend as we add more processors into the cluster.

3.7 Apache Performance Test Results

We conducted benchmarking tests on one processor running Apache to determine the baseline
performance on a single processor. Figure 32 presents the number of web clients on the x-axis and the
number of successful requests per second on the y-axis.

74
Throughput (Kbytes/Sec) Requests Per Second

0
1000
2000
3000
4000
5000
6000
7000
0
200
400
600
800
1000
1200

1_ 1_
cli cli
e nt e nt
4_ 4_
cli cli
e nt e nt
8_ s 8_ s
cli cli
e nt e nt
12 s 12 s
_c _c
lie lie
nt nt
16 s 16 s
_c _c
lie lie
nt nt
20 s 20 s
_c _c
lie lie
nt nt
24 s 24 s
_c _c
lie lie
nt nt
28 s 28 s
_c _c
lie lie
nt nt
32 s 32 s
_c _c
lie lie
nt nt
36 s 36 s

Number of Clients
Number of Clients

_c _c
lie lie
Requests Per Second

nt nt

Throughput (Kbytes/Sec)
40 s 40 s
_c _c
lie lie
nt nt
44 s 44 s
_c _c
lie lie
nt nt
48 s 48 s
_c _c
lie lie
nt nt
52 s 52 s
_c _c
lie lie
nt nt
s s

Figure 33: Apache reaching a peak of 5,903 KB/s before the Ethernet driver crashes
Figure 32: Benchmarking results of the Apache web server running on a single processor

75
Apache served 1,053 requests per second before it suddenly stops servicing incoming requests; in
fact, as far as the WebBench tool, Apache crashed (Figure 32). We thought that the Apache server has
crashed under heavy load. However, that was not the case. The Apache server process was still
running when we logged locally into the machine. It turned out that the Ethernet device driver crashed
and caused the processor to disconnect from the network and became unreachable. Figure 33
illustrates the throughput achieved on one processor (5,903 KB/s), before the processor disconnects
from the network dues to the device driver crash. We investigated the device driver problem, fixed it,
and made the updated source code publicly available. We did not face the driver crash problem in
further testing. Figure 34 presents the benchmarking result of Apache on a single processor after
fixing the device driver problem. Apache served an average of 1,043 requests per second.

1200

1000
Requests Per Second

800

600

400

200

0
nt

s
nt

nt
nt

nt
ie

lie

lie
cl

_c
1_

Number of Clients

Requests Per Second

Figure 34: Benchmarking results of Apache on one processor – post Ethernet driver update

Next, we setup the cluster in several configurations with two, four, six, eight, 10 and 12 processors
and we performed the benchmarking tests. In these benchmarks, the LVS was forwarding the HTTP
traffic to the traffic nodes following the DR distribution method.
Figure 35 presents the results of the benchmarking test we performed on a cluster with two processors
running Apache. The average number of requests per second per processor is 945.
76
Requests Per Second Requests Per Second

1_ 1_

0
500
1000
1500
2000
2500
3000
3500
4000
4500
0
200
400
600
800
1000
1200
1400
1600
1800
2000

cli cli
e nt e nt
4_ 4_
cli cli
e nt e nt
8_ s 8_ s
cli cli
e nt e nt
12 s 12 s
_c _c
lie lie
nt nt
16 s 16 s
_c _c
lie lie
nt nt
20 s 20 s
_c _c
lie lie
nt nt
24 s 24 s
_c _c
lie lie
nt nt
28 s 28 s
_c _c
lie lie
nt nt
32 s 32
_c _c s
lie lie
nt
36 s 36 n ts
_c _c
lie lie
nt
40 s 40 n ts
_c _c

Number of Clients
Number of Clients

lie lie
nt
44 s 44 n ts

Requests Per Second

_c _c
lie lie
nt
48 s 48 n ts
_c _c
lie lie
nt
52 s 52 n ts
_c _c
lie lie
nt
56 s 56 n ts
_c _c
lie lie
Figure 35: Results of a two-processor cluster (requests per second)

Figure 36: Results of a four-processor cluster (requests per second)

nt
60 s 60 n ts
_c _c
lie lie
nt nt
s s

77
Figure 36 presents the results of the benchmarking test we performed on a cluster with four
processors running Apache. The average number of requests per second per processor is 1,003.

8000

7000

6000
Requests Per Second

5000

4000

3000

2000

1000

0
cli t
8_ t s

_c s

s
n

16 en t

20 en t

24 en t

28 en t

32 en t

36 en t

40 en t

44 en t

48 en t

52 en t

56 en t

60 en t

64 en t

68 en t

nt
1 2 en t
e
en
cli

lie
cli

li
1_
4_

Number of Clients

Requests Per Second

Figure 37: Results of eight-processor cluster (requests per second)

Figure 37 presents the results of the benchmarking test we performed on a cluster with eight
processors running Apache. The average number of requests per second per processor is 892.
Table 6 presents the results of Apache benchmarking for all the cluster configurations including the
single standalone node.

Processors in the cluster Maximum requests per second Transaction per second per processor
1 1053 1053
2 1890 945
4 4012 1003
6 5847 974
8 7140 892
10 7640 764
12 8230 685

Table 6: The results of benchmarking with Apache

78
For each cluster configuration, we present the maximum performance achieved by the cluster in terms
of requests per second; the third column presents the average number of requests per second per
processor. As we add more processors into the cluster, the number of requests per second per
processor decreases (Table 6, 3rd column). Section 3.9 discusses the scalability of the prototyped web
cluster.

3.8 Tomcat Performance Test Results

Similarly, we conducted the benchmarking tests with Tomcat running on one, two, four, six, eight, 10
and 12 processors. Figure 38 presents the results of the benchmarking test we performed on a cluster
with two processors running Tomcat. The average number of requests per second per processor is 76.

160

140
Requests Per Second

120

100

0
_c s

_c s

s
8_ nts
4_ ent

_c s
16 en t

20 en t

24 en t

28 en t

32 en t

36 en t

40 en t

44 en t

48 en t

52 en t

56 en t

60 en t

64 en t

72 en t

80 en t

nt
12 ent

li e
ie
i
cl

li
i
cl

cl
1_

Num be r of Clie nts

Requests Per Second

Figure 38: Results of Tomcat running on two processors (requests per second)

Figure 39 presents the results of the benchmarking test we performed of a system with four
processors running Tomcat. The average number of requests per second per processor is 75. Figure 40
presents the results of the benchmarking test we performed of a system with eight processors running
Tomcat. The average number of requests per second per processor is 71.

79
80
Requests per Second Requests Per Second
1_

0
50
100
150
200
250
300
350
400
450
500
c

0
50
100
150
200
250
300
350
4_ lien
cl t
1_client ie
8_ nts
4_clients cl
12 ien
_c ts
8_clients
16 lien
12_clients _c ts
20 lien
16_clients _c ts
l
24 ien
20_clients _c ts
28 lien
24_clients _c ts
32 lien
28_clients _c ts
l
32_clients 36 ien
_c ts
36_clients 40 lien
_c ts
40_clients 44 lien
_c ts
44_clients l
48 ien
_c ts
48_clients 52 lien
_c ts
52_clients

Num be r of Clie nts

Num ber of Clie nts

56 lien
_c ts
56_clients

Requests per Second

l
Requests Per Second

60 ien
60_clients _c ts
64 lien
64_clients _c ts
72 lien
72_clients _c ts
l
80 ien
80_clients _c ts
88_clients 88 lien
_c ts
96_clients 96 lien
_c ts
li e
104_clients
Figure 39: Results of a four-processor cluster running Tomcat (requests per second)

nt
s

Figure 40: Results of an eight-processor cluster running Tomcat (requests per second)
112_clients
Table 7 presents the results of testing the prototyped cluster running the Tomcat application server.
For each cluster configuration, we present number of processors in the cluster, the maximum
performance achieved by the cluster in terms or requests per second, and the average number of
requests per second per cluster processor.

Number of processors in Cluster maximum requests per Transaction per second per
the cluster second processor
1 81 81
2 152 76
4 300 75
6 438 73
8 568 71
10 700 70
12 804 67

Table 7: The results of benchmarking with Tomcat

3.9 Scalability Results

We collected benchmarking results achieved with both Apache (Table 6) and Tomcat (Table 7) for
clusters with one, two, four, six, eight, 10 and 12 processors. For each testing scenario, we recorded
the maximum number of requests per second that each configuration can service. When we divide
this number by the number of processors, we get the maximum number of request that each processor
can process per second in each configuration.
Figure 41 and Figure 42 illustrate the transaction capability per processor plotted against the cluster
size. In both figures, the curve is not flat; as we add more processors into the cluster, the number of
successful transactions per second per processor drops. In the case of Apache, when the cluster scaled
from two to 12 processors, the number of successful transactions per second per processor drops by a
factor of -35%. In the case of Tomcat, when the cluster scales from two to 12 processors, the number
of successful transactions per second per processor drops by a factor of -18%. In both cases with
Apache and Tomcat, as we add more processors into the cluster, we experience a decrease in
performance and scalability. The more processors we have in the cluster, the less performance we get
per processor.

81
1200
1053
Number of transactions per second

1003 974
1000 945
892
764
800
per processor

685

600

400

200

0
11 22 43 64 85 6
10 7
12
Number of processors in the cluster

Transactions per second per processor

Figure 41: Scalability chart for clusters consisting of up to 12 nodes running Apache

90
81
76
Number of transactions per second

80 75 73 71 70
70 67

60
per processor

40
30

20
10

0
11 22 43 64 85 10
6 12
7
Number of processors in the cluster

Transaction per second per processor

Figure 42: Scalability chart for clusters consisting of up to 12 nodes running Tomcat

82
3.10 Discussion
Web servers have a limited capacity in serving incoming requests. In the case of Apache, the capacity
limit is around 1,000 requests per second when running on a single processor. Beyond this threshold,
the server starts rejecting incoming requests.
We have demonstrated non-linear scalability with clusters up to 12 nodes running Apache and
Tomcat web servers. In the case of Apache for instance, when we scale the cluster from two
processors to 12 processors, the number of successful requests per second per processor drops from
1,053 to 685, down by -35%. These results present major performance degradation. Theoretically, as
we add more processors into the cluster, we would like to achieve linear scalability and maintain the
baseline performance of 1,000 requests per second per processor.
We experimented with the NAT and DR traffic distribution approaches. The NAT approach, although
widely used, has limited performance and scalability compared to the DR approach as demonstrated
in Section 3.5.4.
Our results demonstrate that the bottleneck occurs at the master node because of bottlenecks at the
master node level and because of the inefficiencies in the traffic distribution mechanism. We plan to
propose an enhanced method based on the DR approach for our highly available and scalable
architecture. The planned improvements include a running daemon on all traffic nodes that reports the
node load to the distribution mechanism to allow it to perform a more efficient and dynamic
distribution.
We experienced some problems with the Ethernet device drivers crashing with high traffic load
(throughput of 5,903 KB/s – Figure 33). We improved the device driver code and as a result, it is now
able to sustain a higher throughput. We contributed the improvements back to the Ethernet card
provider (ZNYX Networks) and to the open source community.
As far as the benchmarking tests, it would have helped to include other metrics such as processor
utilization as well as file system and disk performance metrics to provide more insight on bottlenecks.
These metrics are not available in WebBench, and therefore the only way to have those metrics is to
either implement a separate tool ourselves or use an existing tool.
We acknowledge that the performance of the network file system has certain effects on the total
performance of the system since the network file system hosts the web documents shared between all
traffic nodes. However, the performance of the file system is out of our scope.

83
3.11 Contributions of the Preparatory Work
We examined current ways of solving scalability challenges and demonstrated scalability problems in
a real system through building a web cluster and benchmarking it for performance and scalability.
Many factors differentiate our early experimental work from others. We did not rely on simulation
models to define our system and benchmark it. Instead, we followed a systematical approach building
the web cluster using existing system components and following best practices. In many instances, we
have contributed system software such as Ethernet and NFS redundancy and introduced
enhancements to existing implementations. Similarly, we built our benchmarking environment from
scratch and we did not rely on simulation models to get performance and scalability results. This
approach gave us much flexibility and allowed us to test many different configurations in a real world
setting. One unique aspect of our experiments, which we did not see in related work (Sections 2.9 and
2.10), is the scale of the benchmarking environment and the tests we conducted. Our early prototyped
cluster consisted of 12 processors and our benchmarking environment consisted of 17 machines.
Surveyed projects were limited in their resources and were not able to demonstrate the negative
scalability effects we experienced when reaching 12 processors, simply because they only tested for
up to eight processors. In addition, other works relied on simulation to get a feeling of the how their
architecture would perform. In contrast, we performed our benchmarking using an industry-
standardized tool and workload. The benchmarking tool, WebBench, uses standardized work loads
and is capable of generating more traffic and compiling more test results and metrics than the other
available tools used in the surveyed work such as SURGE [109], S-Clients [110], WebStone [111],
SPECweb99 [112], and TCP-W [113]. The paper comparing these benchmarking tools is available
from [114]. Furthermore, the entire server environment uses open source technologies allowing us
access to the source code and granting the freedom to modify it, and introduce changes to suite our
needs.

84
Chapter 4
The Architecture of the Highly Available and Scalable Web Server
Cluster

This chapter describes the architecture of the highly available and scalable (HAS) web server cluster.
The chapter reviews of the initial requirements discussed in previous chapters, and then presents the
HAS cluster architecture, its tiers and characteristics. It discusses the architecture components,
presents they interact with each other, illustrates the supported redundancy models, and discusses the
various types of cluster nodes and their characteristics. Furthermore, the chapter includes examples of
sample deployments of the HAS architecture as well as case studies that demonstrate how the
architecture scales to support increased traffic. The chapter also addresses the subject of fault
tolerance and the high availability features. It discusses the traffic management scheme responsible
for dynamic traffic distribution and discuss the cluster virtual IP interface which presents the HAS
architecture as a single entity to the outside world. The chapter concludes with the scenario view of
the HAS architecture and examines several use cases.

4.1 Architectural Requirements

The architecture provides the infrastructure that allows the platform to meet the requirements of
Internet and web servers running in environments that require high availability and high performance.
In our context, the architecture requirements for a highly available and scalable web cluster fall in the
following categories: scalability, minimizing response time, providing reliable storage, continuous
service availability, high performance, high reliability of service tolerance, and supporting new the
Internet Protocol version 6 (IPv6).

4.1.1 Scalability
The expansion of the user base requires scaling the capacity of the infrastructure to be in line with
demand. However, with a single server, this means upgrading the server vertically by adding more
memory, or replacing the processor(s) with a faster one. Each upgrade brings the service down for the
duration of the upgrade, which is not desirable. In some cases, the upgrade involves a full deployment
on entirely new hardware resulting in extended downtime. Clustering an application allows us to scale
it horizontally by adding more servers into the service cluster without necessitating service downtime.
However, even when utilizing clustering techniques to scale, the performance gain is not linear.

85
4.1.2 Minimal Response Time
Achieving a minimal response time is a crucial factor for the success of distributed web servers.
Many studies argue that 0.1 second is about the limit for having the user feel that the system is
reacting instantaneously [13][14][15]. Web users expect the system to process their requests and to
provide responses quickly and with high data access rates. Therefore, the server needs to minimize
response times to live up to the expectations of the users. The response time consists of the connect
time, process time and response transit time. The goal is to minimize all of these three parameters
resulting in faster total response time.

4.1.3 Reliable, High Capacity and High Performance Storage

The web server has to provide reliable, high performance and highly available storage. Storage is
essential to save and maintain application data, and to provide fast data retrieval.

4.1.4 Continuous Service Availability

Web servers have to provide a connection that is highly available and reliable. The larger the number
of users and the greater the volume of data the server handles, the more difficult it becomes to
guarantee the availability of the service under high loads [17]. All web server components need be
highly available and redundant; otherwise, they present a SPOF. While there are many definitions for
high availability, in this context, it is the ability of the web cluster to provide continuous service even
in the event of failures of individual components.

4.1.5 High Performance

The web server has to sustain a guaranteed number of requests per second. It needs to maximize
resource utilization and serve the maximum number of transactions per second that a single web
server node can handle. Section 3.7 presents on the base performance a web server should sustain
under high load of traffic.

4.1.6 High Reliability and Fault Tolerance

High reliability and fault tolerance are essential characteristics for highly available web servers that
provide continuous online services. Several challenges arise in web clusters in relation to high
reliability and fault tolerance such as preventing software or hardware failures from interrupting the
provided service.
86
4.1.7 Supporting the IPv6
The IPv6 is the next generation Internet protocol designed by the Internet Engineering Task Force
(IETF) as a replacement of the Internet protocol version 4 (IPv4). Despite its age, IPv4 has been
remarkably resilient; however, it is beginning to demonstrate problems in various features areas. Its
most visible shortcoming is the growing shortage of IPv4 addresses needed by the growing number of
devices connecting to the Internet. Other limitations in this area include quality of service, security,
auto-configuration, and mobility aspects. To maintain future competitiveness and to transition into
IPv6-only Internet, web clusters need to support IPv6 at the operating system and at application
layer.

4.2 Overview of the Challenges

There are many challenges to address in order to achieve high availability and linear scalability for
web clusters. The following subsections discuss the challenges in the specified areas, and provide a
background on the problems we need to address with the HAS architecture.

4.2.1 High Performance

High performance in terms of high throughput and maximizing the number of requests per second are
important characteristic of web servers. The challenges in this area are to enhance response time and
provide the maximum possible throughput in term of KB/s. Section 3.7 establish the baseline
performance of a single processor ranging between 1,000 and 1,050 requests per second. Our goal
with the HAS cluster is to maintain this level of performance and throughput as we scale the number
of processors in the cluster.

4.2.2 Continuous Service Availability

In an environment providing critical Internet and web services, it is necessary to provide non-stop
operations and eliminate single points of failure. A SPOF exists if a cluster component providing a
cluster function fails leading to service discontinuity. Potential single points of failure in a cluster
include the following cluster components: master and traffic nodes, cluster software components,
cluster support services, networks and network adapters, and storage nodes and disks. The HAS
architecture supports various redundancy capabilities and provides functionalities to eliminate single
points of failures. Section 4.7 discusses these capabilities. Furthermore, we need the capability of

87
monitoring the availability of the web server application running on the traffic nodes and ensuring
that the application is up and running. Otherwise, the master nodes will forward traffic to a node that
has an unresponsive web server application. Section 4.20 discusses this functionality.

4.2.3 Cluster Single Interface

One of the challenges with clusters that consist of multiple nodes is representing the cluster nodes as a
single entity to the outside world. For instance, one common scenario is to have thousands of users
requesting web pages over the network from a web cluster that consists of multiple web server nodes.
This creates the necessity for a cluster single interface that is scalable to support high number of
connections and highly available so that it does not present a bottleneck or a SPOF. As a result, the
challenge is the ability to allow a virtually infinite amount of clients to reach a virtually infinite
amount of web servers presented as a single virtual IP address. Section 4.21 addresses this challenge.

4.2.4 Efficient and Dynamic Traffic Distribution

A cluster consists of multiple nodes of heterogeneous hardware receiving thousands of requests per
second. The challenge in this area is to distribute the incoming web traffic efficiently amongst all
cluster nodes, taking into consideration the characteristics of each node. Section 4.23 addresses this
challenge, and discusses the solution of the HAS architecture.

4.2.5 Support for IPv6

The main challenge in this area is to support IPv6 across the architecture and all of its software
components such as the traffic distribution mechanism, the cluster virtual interface, application
servers, and Ethernet redundancy daemons. In addition, we need to support IPv6 at the kernel level on
all cluster nodes, and support IPv6 routing and domain name system and name lookup. Section 4.28
presents the support for IPv6 in the HAS architecture.

4.3 The HAS Architecture

The HAS architecture consists of a collection of loosely coupled computing elements, referred to as
nodes or processors, that form what appears to users as a single highly available web cluster. There
are no shared resources between nodes with the exception of storage and access to networks. The
HAS architecture allows the addition of nodes to the cluster to accommodate increased traffic,

88
without performance degradation and while maintaining the level of performance for up to 16
processors (Chapter 5).
Figure 43 illustrates the conceptual model of the architecture showing the three tiers of the
architecture and the software components inside each tier. In addition, it shows the supported
redundancy models per tier. For instance, the high availability (HA) tier supports the 1+1 redundancy
model (active/active and active/standby) and can be expanded to support the N+M redundancy model,
where N nodes are active and M nodes are standby. Similarly, the scalability and service availability
(SSA) tier supports the N-way redundancy model where all traffic nodes are active and servicing
requests. This tier can be expanded as well to support the N+M redundancy model. Sections 4.9, 4.10,
and 4.11 discuss the supported redundancy models.
The HAS architecture is composed of three logical tiers: the high availability (HA) tier, the scalability
and service availability (SSA) tier, and the storage tier. This section presents the architecture tiers at a
high level.
The high availability tier: This tier consists of front-end systems called master nodes. Master nodes
provide an entry-point to the cluster acting as dispatchers, and provide cluster services for all HAS
cluster nodes. They forward incoming web traffic to the traffic nodes in the SSA tier according to the
scheduling algorithm. Section 4.5.1 covers the characteristics of this tier. Section 4.17.1 presents the
characteristics of the master nodes. Section 4.9 discusses the supported redundancy models of the HA
tier.
The scalability and service availability tier: This tier consists of traffic nodes that run application
servers. In the event that all servers are overloaded, the cluster administrator can add more nodes to
this tier to handle the increased workload. As the number of nodes increases in this tier, the cluster
throughput increases and the cluster is able to respond to more traffic. Section 4.6.2 describes the
characteristics of this tier. Section 4.17.2 presents the characteristics of the traffic nodes. Section 4.10
discusses the supported redundancy models of the SSA tier.
The storage tier: This tier consists of nodes that provide storage services for all cluster nodes so that
web servers share the same set of content. Section 4.6.3 describes the characteristics of this tier and
Section 4.17.3 describes the characteristics of the storage nodes. Section 4.11 presents the supported
redundancy models of the storage tier. The HAS cluster prototype did not utilize specialized storage
nodes. Instead, it utilized a contributed extension to the NFS to support HA storage.

89
90
Scalability and Service
High Availability (HA) Tier Availability (SSA) Tier Storage Tier
(Optional)

Active Traffic Node 1 Shared Storage A

Active Primary Master Node
EthD TCD
Application
CCM saru
C NTP DHCP Server LdirectorD
RCM Disk
DiskAA
L EthD RAD NFS
CVIP TM HBD
U
Active Traffic Node 2 Disk
DiskBB
S
T EthD TCD ...
Hot Standby Master Node Application
E Server LdirectorD
R CCM saru Disk
DiskNN
NTP DHCP
RCM
End Users EthD RAD NFS
CVIP TM HBD Active Traffic Node N
V
I EthD TCD
1+1 Redundancy Model Application Disk N’
DiskN’
P Server LdirectorD
(can be expanded to) ...
N+M Redundancy Model N-Way Redundancy Model
(can be expanded to) Disk B’
DiskB’
Standby Master Node
N+M Redundancy Model
CCM saru
NTP DHCP Standby Traffic Node Disk A’
DiskA’
RCM
EthD RAD NFS EthD TCD
CVIP TM HBD Application
Shared Storage B
Server LdirectorD
CCP1
CCP2

Figure 43: The HAS architecture

Management Redundant
External Components to the
Console Image
HAS Architecture
Servers

Legend
CCP1 Cluster Communication Path 1 (LAN 1) NTP Network Time Protocol
CCP2 Cluster Communication Path 1 (LAN 2) TM Traffic Manager
VIP Cluster Virtual IP Layer NFS Network File Server
DHCP Dynamic Host Configuration Protocol TCD Traffic Client Daemon
RAD IPv6 Router Advertisement Daemon HBD Heartbeat mechanism
EthD Ethernet Redundancy Daemon CCM Cluster Configuration Manager
RCM Redundancy Configuration Manager LdirectorD Linux Director Daemon
4.4 HAS Architecture Components
Each of the HAS architecture tiers consists of several nodes, and each node runs specific software
component. A software component (or system software) is a stand-alone set of code that provides
service either to users or to other system software. A component can be internal to the cluster and
represents a set of resources contained on the cluster physical nodes; a component can also be
external to the cluster and represents a set of resources that are external to the cluster physical nodes.
Components can be either software or hardware components. Core components are essential to the
operation of the cluster. On the other hand, optional components are used depending on the usage and
deployment model of the HAS cluster.
The HAS architecture is flexible and allows administrators of the cluster to add their own software
components. The following sub-sections present the components of the HAS architecture, categorize
the components as internal or external, discuss their capabilities, functions, input, output, interfaces,
and describe how they interact with each other.

4.4.1 Architecture Internal Components

The internal components of the architecture include the master nodes, traffic nodes, storage nodes,
routers, and local networks.
Master nodes are members of the HA tier. They provide a key entry into the system through the
cluster virtual IP interface, and act as a dispatcher, forwarding incoming traffic from the cluster
virtual IP interface to the traffic nodes located in the SSA tier. Moreover, master nodes also provide
cluster services to the cluster nodes such as DHCP, NTP, TFTP, and NFS. Section 4.17.1 discusses
the characteristics of master nodes.
Traffic nodes reside in the SSA tier of the HAS architecture. Traffic nodes run application servers,
such as a web server, and they are responsible for replying to clients requests. Section 4.17.2
discusses the characteristics of traffic nodes.
Storage nodes are located in the storage tier, and provide HA shared storage using multi-node access
to redundant mirrored storage. Storage nodes are optional nodes. If the administrators of the HAS
cluster choose not to deploy specialized storage nodes, they can instead use a modified
implementation of the NFS providing HA capabilities.
The architecture supports two local networks, also called cluster communication paths, to provide
connectivity between all cluster nodes. Each of the networks connects to a different router.
91
4.4.2 Architecture External Components
External components to the cluster represent a set of resources that are external to the cluster physical
nodes and include external networks, to which the cluster is connected, the management console
through which we administer the cluster, and web users. External networks connect the cluster nodes
to the Internet or the outside world. The management console is an external component to the cluster
through which the cluster administrator logs in and performs cluster management operations. Web
users, also called web clients, are the service requesters. A user can be a human being, an external
device, or another computer system. Image servers are optional external components to the HAS
architecture. We can use master nodes to provide the functionalities provided by the image servers
(DHCP and TFTP services for booting and installation of traffic nodes). However, we recommend
dedicating the resources on master nodes to serve incoming traffic.

4.4.3 Architecture Software Components

The architecture consists of several system software components. Some of the components are
essential to the operation of the cluster and run by default. However, we expect that across all
possible deployments of highly available web clusters, there are trade-offs between available
functions, acceptable levels of system complexity, security needs, and administrator preferences.
The HAS system software components include: traffic client daemon (TCD), cluster virtual IP
interface (the routed process), traffic manager daemon (TM), IPv6 router advertisement daemon,
DHCP daemon, NTP daemon, HA NFS server daemon, TFTP daemon, heartbeat service (HBD), the
Linux director daemon (LDirectord), the saru module (connection synchronization manager), and the
Ethernet redundancy daemon (EthD).
The traffic client daemon is a core system software that runs on all traffic nodes. It computes the
load_index of traffic nodes, and reports it to the traffic managers running on the master nodes. The
traffic client daemon allows the traffic manager running on master nodes to provide efficient traffic
distribution based on the load of each traffic node. The metrics collected by the traffic client daemon
are the processor load and memory usage. The implementation of the traffic client supports IPv6.
Section 4.23.5 discusses the traffic client daemon.
The cluster virtual IP interface (CVIP) is a core system component that runs on the master nodes in
the HA tier of the HAS architecture. The CVIP masks the HAS architecture internals, and makes it
appear as a single server to the external users, who are not be aware of the internals of the cluster such

92
as how many nodes exist or where the applications run. The CVIP allows a virtually infinite number
of clients to reach a virtually infinite number of servers presented as a single virtual IP address,
without impact on client or server applications. The CVIP operates at the IP level, enabling
applications that run on top of IP to take advantage of the transparency it provides. The CVIP
supports IPv6 and is capable of handling incoming IPv6 traffic. Section 4.21 discusses the cluster
virtual IP interface.
The traffic manager daemon (TM) is a core system component that runs on the cluster master nodes.
The traffic manager receives the load index of the traffic nodes from the traffic client daemons,
maintains the list of available traffic nodes and their load index, and executes the distribution of
traffic to the traffic nodes based on the defined distribution policy in its configuration file. The current
traffic manager implementation supports round robin and the HAS distribution. However, the traffic
manager can support more policies. The traffic manager supports IPv6. Section 4.23.3 discusses the
traffic manager.
The cluster configuration manager (CCM) is a system software that manages all the configuration
files that control the operation of the HAS architecture software components. It provides a centralized
single access point for editing and managing all the configuration files. For the purpose of this work,
we did not implement the cluster configuration manager. However, it is a high priority future work.
At the time of publication, with the HAS architecture prototype, we maintain the configuration files if
the various software components on the network file system.
The redundancy configuration manager (RCM) is responsible for switching the redundancy
configuration of each cluster tier from one redundancy configuration to another, such as from the 1+1
active/standby to the 1+1 active/active. It is also responsible for switching service between
components when the cluster tiers follow the N+M redundancy model. Therefore, it should be aware
active nodes in the cluster and their corresponding standby nodes. For the purpose of this work, we
did not implement the redundancy configuration manager. Section 6.2.4 discusses the RMC as a
future work item.
The IPv6 router advertisement daemon is an optional system software component that is used only
when the HAS architecture need to support IPv6. It offers automatic IPv6 configuration for network
interfaces for all cluster server nodes. It ensures that all the HAS cluster nodes can communicate with
each other and with network elements outside the HAS architecture over IPv6.

93
The cluster administrator has the option of using the DHCP daemon (an optional server service) to
configure IPv4 addresses to the cluster server nodes.
The NTP service is a required system service used to synchronize the time on all cluster server nodes.
It is essential to the operation of other software components that rely on time stamps to verify if a
node is in service or not. Alternatively, we can use a time synchronization service provided by an
external server located on the Internet. However, this poses security risks and it is not recommend it.
As for storage, we provide an enhanced implementation of the Linux kernel NFS server
implementation that supports NFS redundancy and eliminates the NFS server as a SPOF. Section 4.16
discusses the storage models and the various available possibilities.
The TFTP service daemon is an optional software component used in collaboration with the DHCP
service to provide the functionalities of an image server. The image server provides an initial kernel
and ramdisk image for diskless server nodes within the HAS system. The TFTP daemon supports
IPv6 and is capable of receiving requests to download kernel and ramdisk images over IPv6.
The heartbeat service (HBD) runs on master nodes and sends heartbeat packets across the network to
the other instances of Heartbeat (running on other master nodes) as a keep-alive type message. When
the standby master node no longer receives heartbeat packets, it assumes that the active master node
is dead, and then the standby node becomes primary. The heartbeat mechanism is a contribution from
the Linux-HA project [20]. We have contributed enhancements to the heartbeat service to
accommodate for the HAS architecture requirements. Section 4.19 discusses the heartbeat service and
its integration with the HAS architecture.
The Linux director daemon (LDirectord) is responsible for monitoring the availability of the web
server application running on the traffic nodes by connecting to them, making an HTTP request, and
checking the result. If the LDirectord module discovers that the web server application is not
available on a traffic node, it communicates with the traffic manager to ensure that the traffic manager
does not forward traffic to that specific traffic node. Section 4.20 presents the functionalities of the
LDirectord.

4.5 HAS Architecture Tiers

The following subsections describe the three HAS architecture tiers illustrated in Figure 43: the high
availability tier, scalability and service availability tier, and the storage tier.

94
4.5.1 The High Availability Tier
The HA tier consists of master nodes that act as a dispatcher for the SSA tier. The role of the master
node is similar to a connection manager or a dispatcher. The HA tier does not tolerate service
downtime. If the master nodes are not available, the traffic nodes in the SSA tier become unreachable
and as a result, the HAS cluster cannot accept incoming traffic. The primary functions of the nodes in
this tier are to handle incoming traffic and distribute it to traffic nodes located in the SSA tier, and to
provide cluster infrastructure services to all cluster nodes.
The HA tier consists of two nodes configured following the 1+1 active/standby redundancy model.
The architecture supports the extension redundancy model of this tier to the 1+1 active/active
redundancy model. With the 1+1 active/active redundancy model, master nodes share servicing
incoming traffic to avoid bottlenecks at the HA tier level. Another possible extension to the
redundancy model is the support of the N-way and the N+M redundancy models, which allows the
HA tier to scale the number of master nodes one at a time. However, it requires a complex
implementation and it is not yet supported it. Section 4.8 describes the supported redundancy models.
Furthermore, the master nodes provide cluster wide services to nodes located in the SSA and the
storage tiers. The HA tier controls the activity in the SSA tier, since it forwards incoming requests to
the traffic nodes. Therefore, the HA tier needs to determine the status of traffic nodes and be able to
reliably communicate with each traffic node. The HA tier uses traffic managers to receive load
information from traffic nodes. Section 4.6.1 presents the characteristics of the HA tier. Section 4.8
discusses the redundancy models supported by this tier.

4.5.2 The Scalability and Service Availability Tier

The scalability and service availability (SSA) tier consists of traffic nodes that run web server
applications. The main task of the traffic nodes is to receive, process, and reply to incoming requests.
This tier provides a redundant application execution cluster. In the event that the traffic manager
daemon running on the master nodes ceases to receive load notification from the traffic client
daemon, then the traffic manager stops forwarding traffic to that specific traffic node. Section 4.27.9
discusses this scenario.
If the traffic node becomes unresponsive (after a timeout limit defined in the configuration file), the
traffic manager declares the traffic node unavailable and stops forwarding traffic to it. Section 4.27.9
discusses this use case and presents its sequence diagram. The supported redundancy model is the N-

95
way model: all nodes are active and there are no standby nodes. As such, redundancy is at node level.
Section 4.6.2 presents the characteristics of the SSA tier. Section 4.10 discusses the redundancy
models supported by this tier.

4.5.3 The Storage Tier

The storage tier consists of specialized nodes that provide shared storage for the cluster. This tier
differs from the other two tiers in that it is an optional tier, and not required if the master nodes are
providing shared storage through a distributed file system. Since storage is not the focus of this
dissertation, we do not explore this area in depth. Instead, our interest in this area is limited to
providing a highly available storage and access to storage independently from the storage techniques.
For instance, to avoid single points of failure, we provide redundant access paths from the cluster
nodes to the shared storage. Our efforts in this area include a contribution of a HA extension to the
Network File Server (NFS). In the modified version of the NFS, the file system supports two
redundant servers, where one server failure is transparent to the users. Master nodes can provide the
storage service using the highly available NFS implementation that requires a modified
implementation of the mount program. Cluster nodes use the modified mount program to mount
simultaneously two network file servers at the same mounting point. Section 4.6.3 presents the
characteristics of the storage tier. Section 4.11 presents the redundancy models supported by this tier.

4.6 Characteristics of the HAS Cluster Architecture

The HAS architecture is the combination of the HA, SSA, and storage tiers. The architecture inherits
the characteristics of all tiers, discussed in Sections 4.6.1, 4.6.2, and 4.6.3. The HAS architecture does
not require specialized hardware or commercial software components; instead, the architecture is
implementable using COTS hardware and open source software components. The following sub-
sections discuss the characteristics of each architecture tier.

4.6.1 Characteristics of the HA Tier

The HA tier consists of two master nodes that provide cluster services to traffic nodes and direct
incoming requests to the SSA tier. The HA tier has the following characteristics:
- No single point of failure: If a master node fails and becomes unavailable, the standby master
node takes over in a transparent fashion. Section 4.27.8 presents the sequence diagram of this
case scenario. The HA tier provides node level redundancy. All software components are
96
redundant and available on both master nodes. If one master node fails, the other master node
handles incoming traffic, and provides cluster services to cluster nodes. Section 4.8 discusses the
redundancy model of the HA tier.
- Sensible repair and replacement model: The architecture allows the upgrade or replacement of
failed components without affecting the service availability. For instance, if the active master
node requires a processor upgrade, then the administrator gracefully switches control from the
active master node to the standby node; the active master becomes standby and the standby
master becomes active. The cluster continues to provide services without associated downtime
and without performance penalties, with the exception that load sharing among master nodes
(1+1 active/active) is not possible since there is only one master node available. With such a
model, upgrades do not require a cluster or a service downtime, as it is the case with SMP or
MMP systems. Instead, only the affected node is involved, and the service continues to be
available to end users.
- Shared storage: Master nodes require private disk storage to maintain the configuration files for
the services they provide to all cluster nodes. Application data on the other hand is stored on
shared storage. If a traffic node fails, the application data is still available to the surviving traffic
nodes. Therefore, all application data is available exclusively on redundant highly available
storage and is not dependent on the serving nodes.
- Master node failure detection: The heartbeat mechanism running on master nodes ensures that
the standby master node detects the failure of the active master node within a delay of 200 ms.
Section 4.19 describes how the architecture handles the failure of a master node.
- Number of nodes in the high availability tier: The current implementation of the HA tier supports
two master nodes. In this model, the standby node is available to allow a quick transition when
the active node fails, or for load-sharing purposes. We can add more master nodes to provide
additional protection against multiple failures. The HA tier can scale from two to N master nodes.
Although the addition of nodes would provide additional protection and higher capacity, it
introduces significant complexity to the implementation of the system software. The current
prototype HAS architetcure supports the 1+1 redundancy model in both active/standby and
active/active modes. Section 4.8 describes the supported redundancy models.
- Cluster Virtual IP Interface (CVIP): The cluster virtual IP interface is a transparent layer that
presents the cluster as a single entity to the outside world, whether it is users to the cluster, or
97
other systems. Users of the system are not aware that the web system consists of multiple nodes,
and they do not know where the applications run. Section 4.21 describes the CVIP interface.

4.6.2 Characteristics of the SSA Tier

The SSA tier consists of multiple independent traffic nodes, each running a copy of the Apache web
server software (version 2.0.35), and service incoming requests. Theoretically, there are no limitations
on the number of traffic nodes in this tier: the number of nodes can be N, where N ≥ 2 to insure that
traffic nodes do not constitute a SPOF. The HAS architecture prototype consisted of 2 master nodes
and 16 traffic nodes. Redundancy in the SSA tier is at the node level. Requests coming from the HA
tier are assigned to traffic nodes based on the traffic distribution algorithm (Section 4.23). Traffic
nodes reply to requests directly to the web clients eliminating possible bottleneck at master nodes.
The characteristics of the SSA tier are the following:
- Node availability: Single node availability is less important in this tier, since there are N traffic
nodes available to serve requests. If one traffic node becomes unavailable, the master node
removes the failed traffic node from its list of available traffic nodes, and directs traffic to
available traffic nodes. Section 4.27.9 presents the case scenario of a failing traffic node.
- Traffic node failure detection: In the same way the heartbeat mechanism checks the health of the
master nodes in the HA tier, the HAS architecture supports mechanisms to verify that a traffic
node is up and running, and that the web software application is providing service to web clients.
We use two methods simultaneously to ensure that traffic nodes are healthy and that the web
server application is up and running. The first check is a continuous communication between the
traffic manager running on the master node and the traffic client running of the traffic node
(Section 4.23). The second check is an application check that ensures that the web server
application is up and running (Section 4.20).
- Support for diskless nodes: The SSA tier supports diskless traffic nodes. Traffic nodes are not
required to have local disks and rely on image servers for booting and downloading a kernel and
disk image into their memory.
- Seamless upgrades without service interruption: With the HAS architecture, it is possible to
upgrade the operating system and the application software without disturbing the service
availability. We provide the mechanisms to upgrade the kernel and application automatically
through an image server, by rebooting the nodes. Upon reboot, the traffic node downloads an

98
updated kernel and the new version of the ramdisk (if the node is diskless) or a new image disk (if
the node has disk) from the image server. Section 4.27.5 presents this upgrade scenario with the
sequence diagram.
- Hosting application servers: Traffic nodes run application servers that can be stateful or stateless.
If an application requires state information, then the application saves the state information on the
shared storage and makes it available to all cluster nodes.

4.6.3 Characteristics of the Storage Tier

Data oriented applications require that storage and access to storage is available at all times. To
ensure reliability and high availability, the architecture does not allow application data to be stored on
local node storage devices. Application servers maintain their persistent data on the shared storage
nodes located in the storage tier, or on shared storage managed by master nodes through a redundant
network file server (Section 4.16.2). Since storage techniques are outside the scope of the thesis, we
do not provide a full coverage of this area. However, the architecture is flexible to support multiple
ways of providing storage through specialized storage nodes, software RAID and distributed file
systems. The supported storage models include:
- Specialized storage nodes use techniques such as storage area networks or network-attached
storage to provide highly available shared storage over a network to a large network of users.
- RAID techniques consist of established techniques to achieve data reliability through redundant
copies of the data. Since we are using COTS components, then software RAID is our choice to
provide redundant and reliable data storage. In addition, software RAID techniques do not require
specialized hardware; therefore, we can use it with master nodes when master nodes are
providing shared storage through the NFS.
- Distributed file systems allow us to combine all the disk storage on multiple cluster nodes under
one virtual file system.

4.7 Availability and Single Points of Failures

The architecture provides facilities to enable fast error detection, and provides correction procedures
to recover, when it is possible. One of our goals is to increase service availability. Availably is a
measure of the time that a server is functioning normal. We calculate the availability as,

99
⎛ MTBF ⎞
A = ⎜⎜ ⎟⎟ *100 ,
⎝ MTBF +MTTR ⎠
where A is the percentage of availability, MTBF is the mean time between failures and MTTR is the
mean time to repair or resolve a particular problem. According to the formula, we calculate
availability A as the percentage of uptime for a given period, taking into account the time it requires
for the system to recover from unplanned failures and planned upgrades. As MTTR approaches zero,
the availability percentage A increases towards 100 percent. As the MTBF value increases, MTTR
has less impact on A. Following this formula, there are two possible ways to increase availability:
increasing MTBF and decreasing MTTR. Increasing MTBF involves improving the quality or
robustness of the software and using redundancy to remove single points of failures. As for
decreasing MTTR, our focus in the implementation of the system software is to streamline and
accelerate fail-over, respond quickly to fault conditions, and make faults more granular in time and
scope to allow us to have many short faults than a smaller number of long ones, and to limit scope of
faults to smaller components.
To increase MTTR among the HAS architecture components, we need to avoid a SPOF. The
following subsections discuss eliminating SPOF at master node levels, traffic nodes, application
servers, networks and network interfaces, and storage nodes.
The HAS architecture supports fault tolerance through features such as the hot-standby data
replication to enable node failure recovery, storage mirroring to enable disk fault recovery, and LAN
redundancy to enable network failure recovery. The topology of the architecture enables failure
tolerance because of the various built-in redundancies within all layers of the HAS architecture.
Figure 44 illustrates the supported redundancy at the different layers of the HAS architecture. The
cluster virtual IP interface (1) provides a transparent layer that hides the internal of the cluster. We
can add or remove master nodes from the cluster without interruptions to the services (2). Each
cluster node has two connections to the network (3) ensuring network connectivity. Many factors
contribute towards achieving network and connection availability such as the availability of
redundant routers and switches, redundant network connections and redundant Ethernet cards. We
contributed an Ethernet redundancy mechanism to ensure high availably for network connections. As
for traffic nodes (4), redundancy is at the node level, allowing us to add and remove traffic nodes
transparently and without service interruption. We can guarantee service availability by providing
multiple instances of the application running on multiple redundant traffic nodes. The HAS
100
architecture supports storage redundancy (5) through a customized HA implementation of the NFS
server; alternatively, we can also use redundant specialized storage nodes.

Users
Users

1
1 Cluster Interface to the
Cluster Virtual Interface
Outside World

2 Redundancy of Nodes Master Node A 2 Master Node B

Providing cluster services
3
3 Network Redundancy
4
Traffic Node A … Traffic Node N
4 Traffic Node Redundancy

5 Storage Redundancy
(includes NFS redundancy Storage Node A 5 Storage Node B
and RAID 5)

Figure 44: Built-in redundancy at different layers of the HAS architecture

The following sub-sections discuss eliminating SPOF at each of the HAS architecture layers.

4.7.1 Eliminating Master Nodes as a Single Point of Failure

The HA tier of the HAS architecture consists of two master nodes that follow the 1+1 redundancy
model. In the 1+1 active/standby redundancy model (Section 4.27.1.1), if the active master node is
unavailable and does not respond to the heartbeat messages, the standby master node declares it as
dead and assumes the active state. Section 4.27.8 discusses this scenario. If the master nodes follow
the 1+1 active/active redundancy model (Section 4.27.1.2), the failures in one master node are
transparent to end users, and do not affect the service availability.

4.7.2 Eliminating Applications as a Single Point of Failure

The primary goal of the HAS architecture is to provide a highly available environment for web server
applications. Each traffic node runs a copy of the web server. These applications represent a SPOF,
since in the event that the application crashes, the service on that traffic node becomes unavailable.
To ensure the availability of these applications, the HAS architecture prototype supports two
mechanisms. The first mechanism is a health check between the traffic manager and the traffic client.
Section 4.23.5 discusses this health check mechanism, and Section 4.27.11 presents the sequence
diagram of this scenario. The second mechanism is an application health check to ensure that the
101
application is up and running. If the application is not available, the health check mechanism notifies
the traffic manager that removes the traffic node from its list of available node, and as a result, the
traffic manager stops forwarding traffic to it. Sections 4.27.9 and 4.27.12 present the sequence
diagrams of this scenario.

4.7.3 Eliminating Network Adapters as a Single Point of Failure

The Ethernet redundancy daemon, a system software component, handles network adapter failures.
Figure 45 illustrates the Ethernet adapter swapping process. When a network adapter (Ethernet card)
fails, the Ethernet redundancy daemon swaps the roles of the active and standby adapters on that
node. The failure of the active adapter is transparent with a delay of less than 350 ms while the
system switches to the standby Ethernet adapter.

Active Standby Active Standby Unavailable Active

The active Ethernet adapter provides

The active Ethernet adapter has failed and the
the connection to the network.
Ethernet redundancy daemon designates the
The standby Ethernet adapter is hidden
former standby Ethernet adapter as the new active
from applications and is known only to
adapter.
the Ethernet redundancy daemon.

Figure 45: The process of the network adapter swap

4.7.4 Eliminating Storage as a Single Point of Failure

Shared storage is a possible SPOF in the cluster if it relies on a single storage server. Section 4.16
discusses the physical model description which covers eliminating storage as a SPOF using both a
custom implementation of a highly available network file system (Section 4.16.2), and redundant
specialized storage nodes (Section 4.16.3).

4.8 Overview of Redundancy Models

The following sub-sections examine the various redundancy models: the 1+1 two nodes redundancy
model, the N+M and N-way redundancy models, and the no redundancy model.

102
4.8.1 The 1+1 Redundancy Model
There are two types of the 1+1 redundancy model (also called two-node redundancy model): the
active/standby, which is also called the asymmetric model, and the active/active or the symmetric
redundancy model [115]. With the 1+1 active/standby redundancy model, one cluster node is active
performing critical work, while the other node is a dedicated standby, ready to take over should the
active master node fails. In the 1+1 active/active redundancy model, both nodes are active and doing
critical work. In the event that either node should fail, the survivor node steps in to service the load of
the failed node until the first node is back to service.

4.8.2 The N+M Redundancy Model

In the N+M redundancy model, the cluster tier support N active nodes and M standby nodes. If an
active node fails, a standby node takes over the active role. If M=1, then the model would be the N+1
redundancy model, where the “+1” node is standby, ready to be active when one of the active nodes
fail. The N+1 redundancy model can be applied to services that do not require a high level of
availability because an N+1 cluster cannot provide a standby node to all active nodes. N+1 cannot
perform hot standby because the standby node never knows which active nodes will fail. One
advantage of the N+M over the N+1 is that in the N+M (with M ≥ 2) provide higher availability
should more than one node fails, while not affecting the performance and throughput of the cluster.

4.8.3 The N-way Redundancy Model

In the N-way redundancy model, all N nodes are active. Redundancy is at the node level.

4.8.4 The “No Redundancy” Model

The “no redundancy” model is used with non-critical services, when the failure of a component does
not cause a severe impact on the overall system. Following this model, an active component does not
have a standby ready to take over if the active fails. We list this model for completion purposes.

4.9 HA Tier Redundancy Models

The HA tier supports three redundancy models: the 1+1 active/standby, 1+1 active/active and the N-
way redundancy model. The current prototype of the HA tier supports the 1+1 redundancy model in
both configurations: active/standby and active/active. Support for the N+M and for the N-way
redundancy models is a future work item.
103
4.9.1 The HA Tier Active/Standby Redundancy Model
Figure 46 illustrates the 1+1 active/standby redundancy model supported by the HA tier, which
consists of two master nodes, one is active and the second is standby. The active master node hosts
active processes and the standby nodes host standby processes that are ready to take over when
primary node fails. The active/standby model enables fast recovery upon the failure of the active
node.

Shared
Network Physically connected
Storage but not logically in use
Dual Redundant
Data Paths

Heartbeat Messages
Active Standby
Master Master
Node Node

Public Network

Clients

Figure 46: The 1+1 active/standby redundancy model

Shared
Physically connected Network Physically connected
but not logically in use Storage and in use
due to the failure of
the master node

Heartbeat Messages
Active Standby
Failed Master Now Active
Master Master Node
Node Node
Node

Public Network

Clients

Figure 47: Illustration of the failure of the active node

104
Figure 47 illustrates the 1+1 active/standby pair after the failover has completed. The active/standby
redundancy model supports connection synchronization between the two master nodes. Section 4.22
discusses connection synchronization.
The HA tier can transition from the 1+1 active/standby to the 1+1 active/active redundancy model
through the redundancy configuration manager, which is responsible for switching from one
redundancy model to another. The 1+1 active/standby redundancy model provides high availability;
however, it requires a master node to sit idle waiting for the active node to fail so it can take over. The
active/standby model leads to a waste of resources and limits the capacity of the HA tier.
The 1+1 active/active redundancy model, discussed in the following section, addresses this problem
by allowing the two master nodes to be active and to serve incoming requests for the same virtual
service.

4.9.2 The HA Tier Active/Active Redundancy Model

The active/standby redundancy model offers one way of providing high availability. However, we
would like to take advantage of the standby node and not have it sitting idle. By moving to the
active/active redundancy model, both nodes in this tier are active and resulting in an increased cluster
throughput. The active/active model allows two nodes to load balance the incoming connections for
the same virtual service at the same time. It provides a solution where the master nodes are servicing
incoming traffic and distributing it to the traffic nodes; as a result, we increase the capacity of the
whole cluster, while still maintaining the high availability of master nodes.
Figure 48 illustrates the 1+1 active/active redundancy model, which supports load sharing between
the two master nodes in addition to concurrent access to the shared storage. If the HAS cluster
experiences additional traffic, we can extend the number of nodes in this tier and support the N-way
redundancy model, where all nodes in the tier are active and provide service. However, increasing the
number of nodes beyond two nodes increases the complexity of the implementation.
The proposed method of providing active/active in the HA tier of a HAS cluster is to have all master
nodes configured with the same Ethernet hardware address (MAC) and IP address. Originally, the
support of the active/active was not planned as part of the HAS architecture prototype; instead, the
HAS prototype supported the active/standby and the benchmarking tests (Chapter 5) were performed
on the HAS prototype with the HA tier configured in the active/standby redundancy model. The
support for the active/active redundancy model came at a much later stage.

105
Shared
Physically connected Network Physically connected
and in use providing Storage and in use providing
redundant data path
redundant data path
for master node 1
for master node 2

Heartbeat Messages
Active Active
Master Master
Node 1 Node 2

Public Network

Clients

Figure 48: The 1+1 active/active redundancy model

4.9.2.1 The Role of the Saru Module

When the HA tier is in the active/active redundancy model, each master node has the same hardware
address (MAC) and IP address and the challenges becomes on how to distribute incoming
connections between the two active master nodes. For that purpose, we have enhanced the saru
module [116], open source system software, to run in coordination with heartbeat on each of the
master nodes and be responsible for dividing the incoming connections between the two master
nodes. The heartbeat daemon provides a mechanism to determine which master node is available and
the saru module uses this information to divide the space of all possible incoming connections
between the two available master nodes.
If the saru module divides the blocks in the results space, then it is possible that an inconsistent state
results between the master nodes, where some blocks are allocated to more than one master node, or
some blocks are not allocated at all. Therefore, one approach to solve this problem is to have a saru
master node and a saru slave node. When the master nodes boot, the saru module starts and elects a
master node to do the allocations. The elected master node divides blocks of source or destination
ports or addresses between the two active master nodes.
The notion of a master and a slave node is only a internal convenience for the saru module. Both
master nodes are active and continue to service incoming requests distribute it to the traffic nodes.

106
4.10 SSA Tier Redundancy Models
The SSA tier supports the N+M and the N-way redundancy models. In the N+M model, N is the
number of active traffic nodes hosting the active web server application, and M is the number of
standby traffic nodes. When M=0, it is the N-way redundancy model where all traffic nodes are
active. Following the N-way redundancy model, traffic nodes operate without standby nodes. Upon
the failure of an active traffic node, the traffic manager running on the master node removes the failed
traffic node from its list of available traffic nodes (Section 4.27.9) and redirects traffic to available
traffic nodes.
Figure 49 illustrates the N-way redundancy model. The state information of the web server
application running on the active nodes is saved (1) on the HA shared storage (2). When the
application running on the active node fails, the application on the standby node accesses the saved
state information on the shared storage (3) and provides continuous service.

N+M Redundancy Model N-way Redundancy Model

Node A Node B Node C Node D Node A Node B Node C Node D

1 1 3

Standby
Process
2 HA Shared 2 HA Shared
State State
Active Storage Storage
Process

Standby Standby (1) State information is written to shared

Node 1 Node 2 storage
(2) State information is available on
shared storage
Standby (3) Traffic nodes have access to the state
Node 1 information.

Figure 49: The N+M and N-way redundancy models

In Figure 50, the state information of the application running on the active node is saved (1) on the
HA shared storage.

107
Active Node Active Node Active Node Standby Node

Application 1 Application Application Application

Middleware State Middleware Middleware Middleware

Operating System Operating System Operating System Operating System

Processor Processor Processor Processor

HA high 2
speed State
connectivity

HA Shared
Storage

Figure 50: The N+M redundancy model with support for state replication

In Figure 51, when the application running on the active node fails, the application on the standby
node accesses the saved state info on the shared storage and provides a continuous service.

Failed Node Active Node Active Node Active Node

Application Application Application Application

3
Middleware Middleware Middleware Middleware

Operating System Operating System Operating System Operating System

Processor Processor Processor Processor

HA high
speed State
connectivity

HA Shared
Storage

Figure 51: The N+M redundancy model, after the failure of an active node

Supporting applications that require maintaining state information is not the scope of this dissertation.
However, the WebBench tool provides dynamic test suites. Therefore, in our testing, we used a
combination of both static (STATIC.TST) and dynamic test suites (WBSSL.TST). The static test
suite contains over 6,200 static pages and executable test files. The dynamic test suites use
applications that run on the server and require maintaining a state. Although supporting applications
with state is not in the scope of the work, the HAS architecture still handles them.
108
4.11 Storage Tier Redundancy Models
Although storage is outside the scope of our work, the redundancy models of the storage tier depend
on the physical storage model described in Section 4.16.

4.12 Redundancy Model Choices

Figure 52 identifies the various redundancy models supported for master, traffic and storage nodes.

Master Nodes: m=2

HA Tier 1+1 redundancy model,
Active/Standby | Active/Active

N active N active / M standby

Traffic Nodes: t ≥ 2
Redundancy is at node level
SSA
Redundancy models:
Tier N active | N active & M standby,
where N ≥ 2 and M ≥ 0

Redundant Redundant storage

Specialized through the Storage:
Storage Storage Nodes implementation Either use redundant specialized
Tier of a HA NFS server nodes: s = 2, or use the contributed
HA NFS implementation

Local networks, l=2

routers, r=2

= Active Node

= Standby Node

Figure 52: the redundancy models at the physical level of the HAS architecture

Master nodes follow the 1+1 redundancy model. The HA tier hosts two master nodes that interact
with each other following the active/standby model or the active/active (load sharing) model. Traffic
nodes follow one of two redundancy models: N+M (N active and M standby) or N-way (all nodes are
active). In the N+M active/standby redundancy model, N is the number of active traffic nodes
available to service requests. We need at least two active traffic nodes, N ≥ 2. M is the number of

109
standby traffic nodes, available to replace an active traffic node as soon as it becomes unavailable.
The N-way redundancy model is the N+M redundancy model with M = 0. In the N-way redundancy
model, all traffic nodes are in the active mode and servicing requests with no standby traffic nodes.
When a traffic node becomes unavailable, the traffic manager stops sending traffic to the unavailable
node and redistributes incoming traffic among the remaining available traffic nodes. However, when
standby nodes are available, the throughput of the cluster does not suffer from the loss of a traffic
node since the standby node takes over the unavailable traffic node.
As for the storage tier, the redundancy model depends on various possibilities ranging from hosting
data on the master nodes to having separate and redundant nodes that are responsible for providing
storage to the cluster. The redundancy configuration manager is responsible for switching from one
redundancy model to another. For the purpose of the work, we did not implement the redundancy
configuration manager (Section 6.2.4). Rather, we relied on re-starting the cluster nodes with
modified configuration files when we wanted to experiment with a different redundancy model.
Table 8 provides a summary of the possible redundancy models. The HAS architecture allows the
support for all redundancy models and supporting them is an implementation issue.

1+1 1+1 N+M N-Way No Redundancy

Active/Active Active/Standby
HA Tier X X X X X
(N+M, with M=0) (one master node)
SSA X X X X X

Tier (one traffic node)

Storage X X X X X

Tier (one storage node)

Table 8: The possible redundancy models per each tier of the HAS architecture

Table 9 illustrates the implemented redundancy models for the HAS architecture prototype. At the
HA tier, both the 1+1 active/standby and the 1+1 active/active redundancy models are supported. At
the SSA tier, the HAS architecture supports the N-way redundancy model. The storage tier supports
the 1+1 active/active redundancy model.

110
1+1 Active/Active 1+1 Active/Standby N+M N-Way No Redundancy
HA Tier X X X
(one master node)
SSA Tier X X
(one traffic node)
Storage Tier X X
(one NFS server)

Table 9: The supported redundancy models per each tier in the HAS architecture prototype

4.13 The States of a HAS Cluster Node

A state transition diagram illustrates how a cluster node transitions from one defined state to another,
as a result to certain events. We represent the states of the nodes as circles, and label the transitions
between the states with the events or failures that changed the state of the node from one state to
another. A node starts in an initial state, represented by the closed circle; and can end up in a final
state, represented by the bordered circle. A cluster node can be in one of four HA states: active,
standby, in-transition, and out-of-cluster.
We identify two types of error conditions: recoverable and unrecoverable errors. Recoverable errors
such as the failure of an Ethernet adapter do not qualify as error conditions that require the node to
change state because such a failure is recoverable. However, other failures such as a kernel crash are
unrecoverable and as a result require a change of state.
Figure 53 presents the HA states of a node, discarding the out-of-cluster state for simplicity purposes.
We assume that the initial state of a node is active (1). When in the active state, the node is ready to
service incoming requests or provide cluster services. When the active node is facing some hardware
or software problem forcing it to stop service, it transitions (2)(3) to the out-of-cluster state. In this
state, the node is not member of the cluste; it does not receive traffic (traffic node) or provide cluster
services to other nodes (master node). When the problem is fixed, the node transitions back to the
active state (4)(5) and become the active. A node is in-transition when it is changing state. If the node
is changing to an active state (5), after the completion of the transition, the node starts receiving
incoming requests.

111
Initial State (Assuming initial
1 state is active for this node)

Active
[The active node is facing
Accept & service problems forcing a it to
traffic change state]
2
5

In-Transition
In-Transition
Switching from
out of cluster Stop accepting
to active new requests

4 3
[Problem is fixed, node
re-joining cluster]
Out of [The node is now considered
Cluster outside the cluster. It does
not serve traffic nor provide
services to cluster nodes]

Figure 53: The state diagram of the state of a HAS cluster node

Figure 54 represents the state diagram after we expand it to include the standby state, in which the
node is not currently providing service but prepared to take over the active state. This scenario is only
applicable to nodes in the HA tier, which supports the 1+1 active/standby redundancy model. When
the node is in the active, in-transition or standby state, and it encounters software or hardware
problems, it becomes unstable and it will not be member of the HAS cluster. Its state becomes out-of
cluster and it is not anymore available to service traffic. If the transition is from active to standby, the
node stops receiving new requests and providing services, but keeps providing service to ongoing
requests until their termination, when possible; otherwise, ongoing requests are terminated.
The system software that manages the transition of states are the traffic manager and the heartbeat
daemon running on the master nodes, and the traffic client and the LDirectord running on the traffic
nodes.

112
Out of
Cluster [Error]

Active

Accept &
service
traffic

In In
Transition [Error] Out of
Out of [Error] Transition
Cluster Switching from Cluster
Stop accepting
Standby to
New requests
active

Standby

[Error]

Out of
Cluster

Figure 54: The state diagram including the standy state

4.14 Example Deployment of a HAS Cluster

Figure 55 illustrates an example prototype of the HAS architecture using a distributed file system to
provide shared storage among all cluster nodes. In this example, two master nodes (A and B) provide
storage using the contributed implementation of the highly available NFS (Section 4.16.2). The HA
tier consists of two master nodes in the 1+1 active/standby redundancy model. Node A is the active
node and Node B is the standby node. The SSA tier consists of four traffic nodes each with its own
local storage. The SSA tier follows the N-way redundancy model, where all traffic nodes are active.
There are two redundant local area networks, LAN 1 and LAN 2, connected to two redundant routers.
The minimal prototype of the HAS architecture requires two master nodes and two traffic nodes. In
the event that the incoming web traffic increases, we add more traffic nodes into the SSA tier. If we
scale the number of traffic nodes, no other support nodes are required and no changes are required for
either the master nodes or the current traffic nodes.

113
HA Tier LAN 1 LAN 2
SSA Tier

Master
Node A
Traffic Local
Disk
C DFS Node 1
L Node A
Local Disk
U H
S
e
a
Traffic Local
Disk
T r
t
Data
Synchronization
Node 2
E b
R e
a
t Node B
Traffic Local
V Disk
I
Local Disk Node 3
P

Master Traffic Local

Node B Node 4 Disk

Figure 55: A HAS cluster using the HA NFS implementation

Figure 56 illustrates the hardware configuration of an HA-OSCAR cluster [117]. The system consists
of a primary server, a standby server, two LAN connections, and multiple compute clients, where all
the compute clients have homogeneous hardware.

Figure 56: The HA-OSCAR prototype with dual active/standby head nodes

A server is responsible for serving user’s requests and distributing the requests to specified clients. A
compute client is dedicated to computation [118]. Each server has three network interface cards; one
114
interface card connects to the Internet by a public network address, the other two connect to a private
LAN, which consists of a primary Ethernet LAN and a standby LAN. Each LAN consists of network
interface cards, switch, and network wires, and provides communication between servers and clients,
and between the primary server and the standby server. The primary server provides the services and
processes all the user’s requests. The standby server activates its services and waits for taking over
the primary server when its failure is detected. The periodical transmission of heartbeat messages
travels across the Ethernet LAN between the two servers, and works as health detection of the
primary server. When a primary server failure occurs, the heartbeat detection on the standby server
cannot receive any response message from the primary server. After a prescribed time, the standby
server takes over the alias IP address of the primary server, and the control of the cluster transfers
from the primary server to the standby server. The user’s requests are processed on the standby server
from later on. From the user’s point of view, the transfer is almost seamless except the short
prescribed time. The failed primary server is repaired after the standby server taking over the control.
Once the repair is completed, the primary server activates the services, takes over the alias IP address,
and begins to process user’s requests. The standby server releases its alias IP address and goes back to
initial state.
At a regular interval, the running server polls all the LAN components specified in the cluster
configuration file, including the primary LAN cards, the standby LAN cards, and the switches.
Network connection failures are detected in the following manner. The standby LAN interface is
assigned to be the poller. The polling interface sends packet messages to all other interfaces on the
LAN and receives packets back from all other interfaces on the LAN. If an interface cannot receive or
send a message, the numerical count of packets sent and received on an interface does not increment
for an amount of time. At this time, the interface is considered down. When the primary LAN goes
down, the standby LAN takes over the network connection. When the primary LAN is repaired, it
takes over the connection back from the standby LAN.
Whenever a client fails while the cluster is in operation, the cluster undergoes a reconfiguration to
remove the corresponding failing client node or to admit a new node into the cluster. This process is
referred to as a cluster state transition. The HA OSCAR cluster uses a quorum voting scheme to keep
the system performance requirement, where the quorum Q is the minimum number of functioning
clients required for a HPC system. We consider a system with N clients, and assume that each client
is assigned one vote. The minimum number of votes required for a quorum is given by (N+2)/2 [119].
115
Whenever the total number of the votes contributed by all the functioning clients falls below the
quorum value, the system suspends operation. Upon the availability of sufficient number of clients to
satisfy the quorum, a cluster resumption process takes place, and brings the system back to
operational state.

4.15 The Physical View of the HAS Architecture

The physical model specifies the architecture topology in terms of hardware architecture and
organization of physical resources (processors, network and storage elements). It also describes the
hardware layers and takes into account the system's non-functional requirements such as system
connection availability, reliability and fault-tolerance, performance (in terms of throughput) and
scalability.
The HAS architecture consists of a collection of inter-connected nodes presented to users as a single,
unified computing resource. Nodes are independent compute entities that physically reside on a
common network. The software executes on a network of computers. We map the various elements
identified in the logical, process, and development views onto the various cluster nodes that constitute
the HAS architecture. The mapping of the software components to the nodes is highly flexible and
has minimal impact on the source code itself. We can deploy different physical configurations
depending on deployment and testing purposes.
Figure 57 illustrates the physical view of the architecture, which consists of the following resources:
master nodes, traffic nodes, redundant storage nodes, redundant LANs, one router per LAN, and two
redundant image servers. The HA tier consists of at least two redundant master nodes that accept
incoming traffic from the Internet through a virtual network interface that presents the cluster as a
single entity and provides cluster infrastructure services to all HAS cluster nodes. The HA tier does
not tolerate service downtime because if the master node (acting as dispatcher) goes down, the traffic
nodes are unable to receive traffic to service web clients. The HA tier controls the activity in the SSA
tier and therefore needs to be able to determine the health of traffic nodes and their load.
The SSA tier consists of a farm of independent and possibly diskless traffic nodes. The number of
nodes in this tier scales from two to N nodes. If a node in this tier fails, the dispatcher stops
forwarding traffic to the failed node, and forwards incoming traffic to the remaining available traffic
nodes.
The storage tier consists of at least two redundant specialized storage nodes.

116
The HAS architecture requires the availability of two routers (or switches) to provide a highly
available and reliable communication path.
An image server is a machine that holds the operating system and ramdisk images of the cluster
nodes. This machine, two for redundancy purposes, is responsible for propagating the images over the
network to the cluster nodes every time there is an upgrade or a new node joining the cluster. Master
nodes can provide the functionalities of the image server; however, for large deployments this might
slow down the performance of master nodes. Image servers are external and optional components to
the architecture.

HA Tier SSA Tier Storage Tier

LAN 1 LAN 2

Router 1 Router 2 Image

Server 1

Traffic
Node 1

Traffic
Node 2
C Storage Image
L Master Node 1 Server 2
U Node A
Users
Users S Traffic
T
Node 3
E
R Storage
Node 2
V Master
I Node B Traffic
P
Node 4

Figure 57: The physical view of the HAS architecture

We divide the cluster components into the following functional units: master nodes, traffic nodes,
storage nodes, local networks, external networks, network paths, and routers.
The m master nodes form the HA tier in a HAS architecture and implement the 1+1 redundancy
model. The number of master nodes is m = 2. These nodes can be in the active/standby or
active/active mode. When m ≥ 2, then the redundancy model is the N+M model; however, we do not
implemented this redundancy model in the HAS prototype. The t traffic nodes are located in the SSA
tier, where t ≥ 2. If t = 1, then there is a single traffic node that constitutes a SPOF. Let s indicate the
number of storage nodes. If s = 0, then the cluster does not include specialized storage nodes; instead
master nodes in the HA tier provide shared storage using a highly available distributed file system.
The HA file system uses the disk space available on the master nodes to host application data. When s

117
≥ 2, it indicates that at least two specialized nodes are providing storage. When s ≥ 2, we introduce
the notion of d shared disks, where d ≥ 2 x s. The d shared disks are the total number of shared disks
in the cluster. The l local networks provide connectivity between cluster nodes. For redundancy
purposes, l ≥ 2 to provide redundant network paths. However, this is dependent on two parameters:
the number of routers r available (r ≥ 2, one router per network path) and the number of network
interfaces eth available of each node (one eth interface per network path). The HAS architecture
requires a minimum of two Ethernet cards per cluster node; therefore eth ≥ 2. The cluster can be
connected to outside networks, identified as e, where e ≥ 1 to recognize that the cluster is connected
to at least one external network.

4.16 The Physical Storage Model of the HAS Architecture

The storage model of the HAS architecture aims to meet two essential requirements. The first
requirement is for the storage to be highly available, ensuring that data is always available to cluster
users and applications. This requirement enables tolerance of single storage unit failure as mirrored
data is hosted on at least two physical units; it also allows node failure tolerance, as mirrored data can
be accessed form different nodes. The second requirement is to have high throughput and minimum
access delay. Based on these requirements, we propose three possible physical storage access models
for the HAS architecture: using the HA implementation of the NFS, using specialized shared storage,
or using local node storage. The following sub-sections explore these storage access methods and
discuss their architectures and redundancy models.

4.16.1 Without Shared Storage

Figure 58 illustrates the no-shared storage model. This model offers a limited number of advantages;
however, we list it for completeness. Following this model, cluster nodes use their local storage to
store configuration and application data. While the no-shared model offers low cost, it comes at the
expense of not being able to use diskless nodes. In addition, application data is not replicated since
state information is saved on each node where the transaction took place. If a node becomes
unavailable, the data that resides on this node becomes unavailable too. As a result, the no-shared
storage model is limited and poses restrictions on cluster nodes and their applications.

118
HA Tier SSA Tier
LAN 1LAN 2

Node A
Local Disk
Traffic Local
Disk
Node 1
C
L Master
U Node A
S Traffic Local
Disk
T Node 2
E Heartbeat
R

Master Traffic Local

V Disk
Node B Node 3
I
P

Traffic Local

Node B Node 4 Disk

Local Disk

Figure 58: The no-shared storage model

However, it is worthy to mention that other research projects (Section 2.10.5) have adopted this
model as their preferred way of handling data and dividing it across multiple traffic nodes [82].
Following their architectures, a traffic node receives a connection only if it has the data stored locally.

4.16.2 Shared Storage with Distributed File Systems

Distributed file systems enable file system access from a client through the network. This model
enables cluster node location transparency as each cluster node has direct access to the file system
data and it is suitable for applications with high storage capacity and access performance
requirements.
Figure 59 illustrates the physical model of the HAS architecture using shared storage. In this example,
cluster nodes rely on the networked storage available through the master nodes. Master nodes run the
modified implementation of the NFS server (Section 6.1.9) in which two NFS servers provide
redundant shared storage to cluster nodes. In the event of a failure of one of the NFS servers, the other
NFS server continues to provide access to data. We contributed this mechanism with a modified
implementation of the mount program (Section 6.1.9) that allow us to mount two NFS servers into the
same location, where the client data resides.

119
HA Tier SSA Tier

LAN 1 LAN 2

Master
Node A
Traffic Local
Disk
C DFS Node 1
L Node A
U H Local Disk
S e Traffic Local
T a Disk
E r Data Node 2
t Synchronization
R
b

…
e
V a Node B
I t Local Disk
P
Traffic Local
Node N Disk
Master
Node B

Figure 59: The HAS storage model using a distributed file system

Figure 60 illustrates how the HAS architecture achieves NFS server redundancy. In Figure 57-A,
master-a, is the name of the Master Node A server, and master-b is the name of the Master Node B
server. Both master nodes are running the modified HA version of the network file system server.
Using the modified mount program, we mount a common storage repository on both master nodes:
% mount –t nfs master-a,master-b:/mnt/CommonNFS

LAN 2 LAN 2
LAN 1 Primary LAN 1
Master Node A NFS Server Primary
Master Node A NFS Server
(master-a)

Node A Local Disk

Node A Local Disk
Common storage /mnt/CommonNFS Synchronization
/mnt/CommonNFS available on 2 is performed by
NFS servers the rsync utility
/mnt/CommonNFS
Node B Local Disk
Node B Local Disk

Secondary Secondary
Master Node B NFS Server Master Node B NFS Server
(master-b)

(Fig 57-A) Storage view from outside the (Fig 57-B) Changes in contents is synchronized
cluster: One storage repository. with the rsync utility.

Figure 60: The NFS server redundancy mechanism

120
When the rsync utility detects a change in the contents, it performs the synchronization to ensure that
both repositories are identical. If the NFS server on master-a becomes unavailable, data requests to
/mnt/CommonNFS will not be disturbed because the secondary NFS server on master-b is still running
and hosting the /mnt/CommonNFS network file system.

4.16.2.1 Synchronization of Shared Storage

The rsync utility is open source software that provides incremental file transfer between two sets of
files across the network connection, using an efficient checksum-search algorithm [120]. It provides a
method for bringing remote files into synch by sending just the differences in the files across the
network link [121].
The rsync utility can update whole directory trees and file systems, preserves symbolic links, hard
links, file ownership, permissions, devices and times, and uses pipelining of file transfers to minimize
latency costs. It uses ssh or rsh for communication, and can run in daemon mode, listening on a
socket, which is used for public file distribution.
We used the rsync utility to synchronize data on both NFS servers running on the two master nodes in
the HA tier of the HAS architecture.

4.16.2.2 Disk Replication Block Device

DRBD (Disk Replication Block Device) is an alternative to rsync to provide highly available storage.
DRBD is system software that acts as a block device. It operates by mirroring a whole block device
via the network and provides the synchronization between the data storage on both master nodes
[122]. DRBD takes over the data, writes it to the local disk, and sends it to the other storage server.
DRBD is a distributed replicated block device responsible for carrying the synchronization between
the two independently running NFS servers on two separate nodes.
Figure 61 illustrates the data replication with DRDB. If the active node fails, the heartbeat switches
the secondary device into primary state and starts the application there. If the failed node becomes
available again, it becomes the new secondary node and it synchronizes its NFS content to the
primary master node. This synchronization takes place as a background process and does not interrupt
the service.

121
Network
Clients
Users Users
Users Users
Users Users

Network

Active Standby
Disk
Master Master
Replication
Node Node
(DRDB)

Local Disk Local Disk

for Active for Standby
Master Node Master Node

Figure 61: DRDB disk replication for two nodes in the 1+1 active/standby redundancy model

The DRBD utility provides intelligent resynchronization as it only resynchronizes those parts of the
device that have changed, which results in less synchronization time. It grants read-write access only
to one node at a time, which is sufficient for the usual fail-over HA cluster.
The drawback of the DRBD approach is that it does not work when we have two active nodes
because of possibly multiple writes to the same block. If we have more than one node concurrently
modifying distributed devices, we have an interesting problem to decide which part of the device is
up-to-date on which node, and what blocks we need to resynchronize in which direction.

4.16.3 Storage with Specialized Storage Nodes

The HAS architecture allows the addition of specialized storage nodes that provide shared storage to
the cluster nodes. Figure 62 illustrates the physical model of the architecture using a storage area
network provided by two redundant specialized storage nodes. Following this model, storage for
application data is hosted on both of the specialized storage nodes. The traffic nodes can use their
private disk to maintain configuration and system files. We did not experiment providing storage
using specialized storage nodes.

122
HA Tier SSA Tier Storage Tier
LAN 1 LAN 2

Traffic
Node 1

C Traffic
L Master Node 2
U
S
Node A Specialized
T Storage Node 1
E
R Traffic
Master Node 3
V Node B
I Specialized
P Storage Node 2
Traffic
Node 4

Figure 62: A HAS cluster with two specialized storage nodes

4.17 Types and Characteristics of the HAS Cluster Nodes

A cluster is a dynamic entity consisting of a number of nodes configured to be part of the cluster.
Node can join or leave the cluster at anytime. Sections 4.27.3 and 4.27.8 describe the sequence
diagrams for a node joining and leaving our HAS architecture prototype. A cluster node is the logical
representation of a physical node. A node is an independent compute entity that is a member of a
cluster and resides on a common network. It communicates with all the cluster nodes through two
local networks, enabling tolerance against single local network failure. A cluster node can be a single
or dual processor machine or an SMP machine. An independent copy of the operating system
environment usually characterizes each cluster node. However, some cluster nodes share a single boot
image from the central, shared disk storage unit or an image server.
There are three types of nodes in the HAS architecture: master, traffic and storage nodes. The
following sub-sections present the characteristics of each node type and discuss its responsibilities
and roles within the cluster.

4.17.1 Master Nodes

Master nodes reside in the HA tier of the HAS architecture. They have access to persistent storage,
and depending on the specific deployment, they provide shared disk access to traffic nodes and host
cluster configuration, management and application data. Master nodes have local disk storage, unlike
traffic nodes that can be diskless.

123
Figure 63 illustrates the software and hardware stack of a master node in the HAS architecture.

NTP, DHCP, CVIP, TM,

saru, Ethd, HBD, DHCP

Operating System (Linux)

Interconnect Protocol
(IPv4 and IPv6)
Interconnect Technology
Ethernet (TCP/IP/UDP)

Processors

Figure 63: The master node stack

Master nodes provide an IP layer abstraction hiding all cluster nodes and provide transparency
towards the end user. Master nodes have a direct connection to external networks. They do not run
server applications; instead, they receive incoming traffic through the cluster virtual IP interface and
distribute it to the traffic nodes using the traffic manager and a dynamic distribution mechanism
(Section 4.23). Master nodes provide cluster-wide services for the traffic nodes such as DHCP server,
IPv6 router advertisement, time synchronization, image server, and network file server. Master nodes
run a redundant and synchronized copy of the DHCP server, a communications protocol that allows
network administrators to manage centrally and automate the assignment of IP addresses. The
configuration files of this service are available on the HA shared storage. The router advertisement
daemon (radvd) [123] runs on the master nodes and sends router advertisement messages to the local
Ethernet LANs periodically and when requested by a node sending a router solicitation message.
These messages are specified by RFC 2461 [124], Neighbor Discovery for IP Version 6, and are
required for IPv6 stateless autoconfiguration. The time synchronization server, running on the master
nodes, is responsible for maintaining a synchronized system time. In addition, master nodes provide
the functionalities of an image server.
When the SSA tier consists of diskless traffic nodes, there is a need for an image server to provide
operating system images, application images, and configuration files. The image server propagates
this data to each node in the cluster and solves the problem of coordinating operating system and
application patches by putting in place and enforcing policies that allow operating system and
software installation and upgrade on multiple machines in a synchronized and coordinated fashion.
Master nodes can optionally provide this service. In addition, master nodes provide shared storage via

124
a modified, highly available version of the network file server. We also prototyped a modified mount
program to allow master nodes to mount multiple servers over the same mounting point. Master
nodes can optionally provide this service.

4.17.2 Traffic Nodes

Traffic nodes reside in SSA tier of the HAS architecture. Traffic nodes can be of two types: either
with disk or diskless. Traffic nodes rely on cluster storage facilities for application data. Diskless
traffic nodes rely on the image server to provide them with the needed operating system and
application images to run and all necessary configuration data.
Figure 64 illustrates the software and hardware stack of a traffic node in the HAS architecture.

Apache Web Server Software

TCD, LDirectord,Ethd

Operating System (Linux)

Interconnect Protocol
(IPv4 and IPv6)
Interconnect Technology
Ethernet (TCP/IP/UDP)

Processors

Figure 64: The traffic node stack

Traffic nodes run the Apache web server application. They reply to incoming requests, forwarded to
them by the traffic manager running on the master nodes. Each traffic node runs a copy of the traffic
client, LDirectord, and the Ethernet redundancy daemon.
Traffic nodes rely on cluster storage to access application data and configuration files, as well as for
cluster services such as DHCP, FTP, NTP, and NFS services. Traffic nodes have the option to boot
from the local disk (available for nodes with disks), the network (two networks for redundancy
purposes), flash disk (for CompaqPCI architectures), from CDROM, DVDROM, or floppy. The
default booting method is through the network. Traffic nodes also run the NTP client daemon, which
continually keeps the system time in step with the master nodes. With the HAS prototype, we
experienced booting traffic nodes from the local disk, the network, and from flash disk which we
mostly used for troubleshooting purposes.
125
4.17.3 Storage Nodes
Cluster storage nodes provide storage that is accessible to all cluster nodes. Section 4.16 presents the
physical storage model of the HAS architecture.

4.18 Local Network Access

Cluster nodes are interconnected using redundant dedicated links or local area networks (LAN). All
nodes in the cluster are visible to each other. We assume, following the physical model, that all
cluster nodes are one routing hop from each other. We achieve redundancy by using two local
networks to interconnect all the nodes of the cluster. All nodes have two Ethernet adapters, with each
connected to a separate LAN.
Figure 65 illustrates the local network access model where each cluster node connects to both LAN1
and LAN2; LAN1 and LAN2 are separate local area networks with their own subnet or domain.

LAN 1 LAN 2

Router 1 Router 2

Traffic
Node 1

C Traffic
L Master Node 2
U Node A Storage
S
T Node 1
E
R Traffic
Node 3
V Master Storage
I Node B
P Node 2
Traffic
Node 4

Ethernet Interface 1 Ethernet Interface 2

Figure 65: The redundant LAN connections within the HAS architecture

This connectivity model ensures high availability access to the network and prevents the network of
being a SPOF. The HAS architecture supports both Internet Protocols IPv4 and IPv6. Supporting
IPv4 does not imply additional implementation considerations. However, supporting IPv6 requires
the need for a router advertisement daemon that is responsible for automatic configuration of IPv6
Ethernet interfaces. The router advertisement daemon also acts as an IPv6 router: sending router
advertisement messages, specified by RFC 2461 [124] to a local Ethernet LAN periodically and when

126
requested by a node sending a router solicitation message. These messages are required for IPv6
stateless autoconfiguration. As a result, in the event we need to reconfigure networking addressing for
cluster nodes, this is achievable in a transparent fashion and without disturbance to the service
provided to end users.

4.19 Master Nodes Heartbeat

It is necessary to detect when master nodes fail and when they become available again. Heartbeat is
the system software that provides this functionality in the HAS architecture prototype. It is an internal
system software to the HAS architecture prototype. Heartbeat is an open source project that provides
system software that runs on both master nodes over multiple network paths for redundancy purposes
[125]. Its goal is to ensure that master nodes are alive through sending heartbeat packets to each other
as defined in its configuration file. The heartbeat component is a low-level component that monitors
the presence and health of master nodes in the HA tier of a HAS architecture by sending heartbeat
packets across the network to the other instances of heartbeat running on other master nodes as a sort
of keep-alive message.
The heartbeat program takes the approach that the keep-alive messages, which it sends, are a specific
case of the more general cluster communications service [126]. In this sense, it treats cluster
membership as joining the communication channel, and leaving the cluster communication channel as
leaving the cluster. Heartbeat itself acts similarly to a cluster-wide init daemon, making sure each of
the services it manages is running at all times. When a master node stops receiving the heartbeat
packets, it assumes that the other master node died; as a result, the services the primary master node
was providing are failed over to the standby master node.
Figure 66 illustrates the heartbeat topology. The HA tier consists of two master nodes following the
1+1 redundancy model. Heartbeat supports both configuration of the 1+1 redundancy model: the
active/standby and active/active configurations. Each master node sends heartbeat and administrative
messages to the other master node as broadcasts.

127
Master Master
Node 1 Node 2

2
1 1
2

Router

Figure 66: The topology of the heartbeat Ethernet broadcast

With heartbeat, master nodes are able to coordinate their role (active and standby) and track their
availability. Heartbeat discussions are presented in [20], [127], and [128].

4.20 Traffic Nodes Heartbeat using the LDirectord Module

In the same way, the heartbeat mechanism (Section 4.19) checks the availability of the master nodes
in the HA tier, we need to provide a mechanism to verify that traffic nodes in the SSA tier are up and
running, and providing service to web clients. We use two methods to ensure that traffic nodes are
available and providing services. The first method is the keep-alive mechanism integrated in the
traffic distribution scheme discussed in Section 4.23. Each traffic node reports its load to the traffic
managers running on the master nodes every x seconds (x is a configuration parameter). This
continuous communication ensures that the traffic managers are aware of all available traffic nodes.
In the event that this communication is disrupted, after a pre-defined time out the traffic manager
removes the traffic node from its list of available traffic nodes. This communication ensures that the
traffic client is alive, and that it is reporting the load index of the traffic node to the traffic manager.
Section 4.23 discusses this mechanism.
The second method for traffic nodes heartbeat is an application check whose goal is to ensure that the
application server running on the traffic node is available and running. The application check relies

128
on the Linux Director daemon (LDirectord) to monitor the health of the applications running on the
traffic nodes. Each traffic node runs a copy of the LDirectord daemon.
The LDirectord daemon performs a connect check of the services on the traffic nodes by connecting
to them and making a HTTP request to the communication port where the service is running. This
check ensures that it can open a connection to the web server application. When the application check
fails, the LDirectord connects to the traffic manager and sets the load index of that specific traffic
node to zero. As a result, existing connections to the traffic node may continue, however the traffic
manager stop forwarding new connections to it. Section 4.27.12 discusses this scenario. This method
is also useful for gracefully taking a traffic node offline.

4.20.1 Sample LDirectord Configuration

The LDirectord module loads its configuration from the ldirectord.cf configuration file, which
contains the configuration options. An example configuration file is presented below. It corresponds
to a virtual web server available at address 192.68.69.30 on port 80, with round robin distribution
between the two nodes: 142.133.69.33 and 142.133.69.34.

1 # Global Directives
2 Checktimeout = 10
3 Checkinterval = 2
4 Autoreload = no
5 Logfile = "local0"
6 Quiescent = yes
7
8 # Virtual Server for HTTP
9 Virtual = 192.68.69.30:80
10 Fallback = 127.0.0.1:80
11 Real = 142.133.69.33:80 masq
12 Real = 142.133.69.34:80 masq
13 Service = http
14 Request = "index.html"
15 Receive = "Home Page"
16 Scheduler = rr
17 Protocol = tcp
18 Checktype = negotiate

Once the LDirectord module starts, the virtual server kernel table is populated. The capture below
uses the ipvsadm command line to capture the output of the kernel. The ipvsadm command is used
to set up, maintain or inspect the virtual server table in the Linux kernel. The listing illustrates the

129
virtual server session, with the virtual address on port 80, and the two hosts providing this virtual
service.
% ipvsadm -L -n
IP Virtual Server version 1.0.7 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 192.68.69.30:80 rr
-> 142.133.69.33:80 Masq 1 0 0
-> 142.133.69.34:80 Masq 1 0 0
-> 127.0.0.1:80 Local 0 0 0

By default, the LDirectord module uses the quiescent feature to add and remove traffic nodes. When a
traffic node is to be removed from the virtual service, its weight is set to zero and it remains part of
the virtual service. As such, exiting connections to the traffic node may continue, but the traffic node
is not allocated any new connections. This mechanism is particularly useful for gracefully taking real
servers offline. This behavior can be changed to remove the real server from the virtual service by
setting the global configuration option quiescent=no.

4.21 CVIP: A Cluster Virtual IP Interface for the HAS Architecture

One of the challenges is to present the cluster as a single entity to the end users. To overcome the
disadvantage of existing solutions presented in Sections 2.9 and 2.10, the HAS architecture provides a
transparent and scalable interface between the internet and the cluster called the cluster virtual IP
interface. It is fault-tolerant and scalable interface between the Internet and the HAS architecture that
provide a single entry point to the HAS cluster. It should not have an impact on web clients,
applications and the existing network infrastructure. It should not present a SPOF, have minimal
impact on performance, and have minimal impact in case of failure. In addition, it should be scalable
to allow a large number of transactions (virtually an unlimited number) without posing a bottleneck,
and to allow the increase of the number of master nodes and traffic nodes independently.
Furthermore, it should support IPv4 and IPv6 and be application independent. One essential
requirement is that the CVIP interface should provide a single Support for multiple VIP addresses
As a result, the CVIP is a fault-tolerant and scalable method of interfacing a plurality of application
servers running on the HAS cluster. The CVIP framework includes a plurality of network
terminations (master servers in the HAS architecture) that receive incoming data packets from the
Internet, and a plurality of forwarding processes (traffic managers) that are associated with the

130
network terminations. The following sub-sections present the CVIP framework and discuss the
architecture and the various concepts.

4.21.1 The Basic Architecture

Figure 67 illustrates the generic configuration of the CVIP interface and the CVIP framework. It
presents a three layer architecture that consists of server application software (i.e. the web server
software) running on the traffic nodes, the master nodes that provide IP services, and the network
cards on the master nodes connecting the CVIP interface towards external networks and the Internet.
We achieve network bandwidth scalability by increasing the number of network terminations. We
achieve capacity scalability by increasing the number of servers in the SSA tier.

Server application software (traffic nodes)

IP servers (master nodes/front-end nodes)

Network cards (on master nodes and visible

to the Internet)

Intra/Internet

Figure 67: The CVIP generic configuration

Figure 68 illustrates the level of distribution within the HAS cluster using the CVIP as the interface
towards the outside networks. There exist two distribution points: the network termination, which
distributes packages based on IP address to correct master (or front end) node, and the traffic manager
that distributes network connections to the applications on the traffic node. Section 4.21.1.1 discusses
the network termination concept.

131
Apache Web Server Software Apache Web Server Software

TCD, LDirectord, Ethd TCD, LDirectord, Ethd

Operating System (Linux) Operating System (Linux)

Interconnect Protocol Interconnect Protocol

(IPv4 and IPv6) (IPv4 and IPv6)
Interconnect Technology Interconnect Technology
Ethernet (TCP/IP/UDP) Ethernet (TCP/IP/UDP)

Processor Processor
Distribution of IP packages from net
termination to web (or application) server

NTP, DHCP, CVIP, TM, NTP, DHCP, CVIP, TM,

saru, Ethd, HBD, DHCP saru, Ethd, HBD, DHCP

Operating System (Linux) Operating System (Linux)

Interconnect Protocol Interconnect Protocol

(IPv4 and IPv6) (IPv4 and IPv6)
Interconnect Technology Interconnect Technology
Ethernet (TCP/IP/UDP) Ethernet (TCP/IP/UDP)

Processor Processor

Distribution of IP packages from/to

external world to net termination
"front ends”

Figure 68: Level of distribution

4.21.1.1 The Concept of Network Termination

Figure 69 illustrates the concept of a network termination. A network termination is the last stop on
the network for a given connection (data packet) before it is forwarded for processing inside the HAS
architecture. The network termination of a web request packet is the network card interface on the
master node in the HA tier of a HAS cluster. Each network termination carries its own IP address and
can be addressed directly. These addresses are published via RIP/OSPF/BGP to the routers indicating
them as gateway addresses for the CVIP address; this means that routers see each termination just as
another router.

132
Linux Linux
Kernel Kernel
• All Linux processors have their own IP address
• When in the 1+1 active/standby model, one front-end
node (master node) is owner of the virtual IP address
CPU CPU
• When in the 1+1 active/active model, each master
node claims to be an IP router for the CVIP address
• More “Front Ends” can be added at runtime
Net Net
• The OSPF protocol is used to monitor router links
Termination Termination

Intra/Internet

Figure 69: Network termination concept

4.21.2 The CVIP as a Framework

The CVIP is an interfacing method and a framework that provides high fault tolerance and close to
linear scalability of the servers and the network interfaces. The framework is transparent to the web
clients and to the HAS cluster servers, and has minimal impact on the surrounding network
infrastructure. The CVIP operates with applications that run on top of IP.
Figure 70 illustrates the CVIP framework. The CVIP interface receives incoming data packets from
the Internet or a packet data network and passes packets to the traffic manager (TMD) running on an
IP server (i.e. a master node) in the HAS architecture. The IP server then forwards the request to an
already existing connection or a traffic distribution mechanism is invoked to determine the next
traffic node where a request will be sent. On a master node, the traffic manager is the entity
responsible of forwarding traffic to application servers running on traffic nodes in the SSA tier of the
HAS architecture based on a defined traffic distribution mechanism that is executed by the traffic
manager. The outgoing traffic takes a different path. The reply to the request goes directly from the
application server running on the traffic node to the web client, issuer of the original request.

133
Traffic Node Traffic Node Traffic Node Traffic Node

HTTPD TCD
HTTPD HTTPD

TCD TCD HTTPD

TCD

TMD routed TMD

Active routed Standby
Master Master
Node Node
HB HB
= =

Intra/Internet

Users Users Users

Users
Users Users

Figure 70: The CVIP framework

4.21.3 Advantages of the CVIP

The following sub-sections present the CVIP advantages such as a single point of entry to the cluster,
scalability and high availability advantages, transparency and support for multiple application servers.

4.21.3.1 Transparency and single entry point to the cluster

CVIP is transparent to the applications running on traffic nodes and web clients are not aware of it.
CVIP supports multiple protocols and applications that use TCP, UDP and raw IP sockets, are able to
use it transparently. CVIP hides the cluster internals from the users and makes the cluster visible to
the outside world as a single entity through a virtual IP address. For the outside world, service is
available through a certain web address, which is the virtual cluster IP address that masks behind it
the IP addresses of the master nodes. It allows access to a cluster of processors via a single IP address.
We can also define a number of CVIP addresses in a system and achieve communication between
web clients and applications using different CVIP addresses in the cluster.

134
4.21.3.2 Scalable
The CVIP offers a unique scalability advantage. We can increase network terminations, master nodes,
or traffic nodes independently and without any affecting how the cluster is presented to the outside
world.
With CVIP, we can cluster multiple servers to use the same virtual IP address and port numbers over
a number of processors to share the load. As we add new nodes in the HA and SSA tiers, we increase
the capacity of the system and its scalability. The number of clients or servers using the virtual IP
address is not limited. The framework is scalable and we can add more servers to increase the system
capacity. In addition, although we have only presented HTTP servers, the applications on top may
include server application that runs on IP such as an FTP server for file transfer.

4.21.3.3 Fault tolerance

The interface hides errors that take place on the HAS nodes and it is always available. Any crash in
the application server is transparent to end users. In the event the web server software crashes, only
1/N of the ongoing transactions are lost (only if there is no connection synchronization between
master nodes). All ongoing transactions can be saved and the state information can be preserved.

4.21.3.4 Availability
Since CVIP supports multiple servers, it does not provide a SPOF. In the HAS architecture prototype,
CVIP was provided on two master nodes in the HA tier. If one master node crashes, the web clients
and web servers are not affected.

4.21.3.5 Dynamic connection distribution

The framework supports a traffic distribution mechanism discussed in Section 4.23 that provides web
clients with access to application servers running on traffic nodes in a transparent way. When the
CVIP receives incoming traffic, it is possible to choose between two different distribution algorithms:
the round robin distribution mechanism or the HAS distribution mechanism that provides dynamic
and high performance distribution (discussed in 4.23).

135
4.21.3.6 Support for multiple application servers
Since the CVIP interface operates at IP level and it is transparent to application servers running on
traffic nodes, then it is independent from the type of traffic that it accepts and forwards. As a result,
with CVIP, the HAS architecture supports all types of application servers that work at IP level.

4.21.4 CVIP Discussion

When the HA tier is configured for the 1+1 active/standby redundancy model, incoming connections
to the HAS cluster arrive at the active master node, owner of the CVIP address. When the HA tier is
configured for the 1+1 active/active redundancy model, incoming connections arrive at the elected
master node owner of the CVIP address, and then are distributed among the master nodes using the
saru module.
With the CVIP framework, we can start as many web servers as needed to handle increased traffic
with the entire set of web servers serving the same virtual IP address. The routed process includes a
local routing table that contains a list of IP addresses that can be used to reach specific client IP
addresses. The routed process contains information that is global to all processors, but it is also
available on each processor through local instances of the routed process.
The cluster IP interface operates at the IP level, enabling applications that run on top of IP to access
transparently application servers running on traffic nodes and gain scalability and high fault tolerance.
Whenever we add a new network card to a master node, we define its own local IP address. Through
the OSPF routing protocol, the rest of the IP network is informed that this network interface can be
used to receive CVIP traffic.

4.22 Connection Synchronization

When the HA tier follows the 1+1 active/standby redundancy model, the cluster achieves a higher
availability than a single node tier, since there are two master nodes, one is active and the second is
standby. When the active master node fails, the standby master node automatically takes over the IP
address of the virtual service, and the cluster continues to function. However, when a fail-over occurs
at the active master node, ongoing connections that are in progress terminate because the standby
master node does not know anything about them. To improve this situation and to prevent the loss of
ongoing connections, there is a need to provide the capabilities of synchronizing the ongoing
connections information between the active master node and the standby master node. As a result, the

136
cluster can minimize and eliminate the situation of lost connections caused by the failure of an active
master node. When the information of ongoing connections is synchronized between master nodes,
then if the standby master node becomes the active master node, it retains the information about the
currently established and active connections, and as a result, the new active master node continues to
forward their packets to the traffic nodes in the SSA tier.

4.22.1 The Challenge of Connection Synchronization

When a master node receives a packet for a new connection, it allocates the connection to a traffic
node. This allocation in effected by allocating an ip_vs_conn structure in the Linux kernel which
stores the source address and port, the address and port of the virtual service, and the traffic node
address and port of the connection. Each time a subsequent packet for this connection is received, this
structure is looked up, and the packet is forwarded accordingly. This structure is also used to reverse
the translation process, which occurs when NAT is used to forward a packet and to store persistence
information. Persistence is a feature whereby subsequent connections from the same web user are
forwarded to the same traffic node.
When fail-over occurs, the new master node does not have the ip_vs_conn structures for the active
connections. Therefore, when a packet is received for one of these connections, the new master node
does not know which real server to send it to. As a result, the connection breaks and the web user
need to reconnect. By synchronizing the ip_vs_conn structures between the master nodes, this
situation ca be avoided, and connections can continue after a master node fails-over.

4.22.2 The Master/Slave Approach to Connection Synchronization

The connection synchronization code relies on a sync-master/sync-slave setup where the sync-master
sends synchronization information and the sync-slave listens. The following example illustrates this
scenario [129]. There are two master nodes, master-A and master-B. Master-A is the sync-master and
the active master node. Master-B is the sync-slave and the standby master node.
In step 1 (Figure 71), the web user opens connection-1. Master node A receives this connection,
forwards it to a traffic node, and synchronizes it to the sync-slave, master node B.

137
web user opens connection-1 forwarded
connection-1 to traffic node
Active Master Node A
sync-master

connection-1
synchronized

Standby Master Node B

sync-slave

Figure 71: Step 1 - Connection Synchronization

In step 2 (Figure 72), a fail-over occurs and the master node B becomes the active master node.
Connection-1 is able to continue because the connection synchronization took place in step 1.

Standby Master Node A

sync-master

web user continues connection-1 forwarded

connection-1 to traffic node
Active Master Node B
sync-slave

Figure 72: Step 2 - Connection Synchronization

The master/slave implementation of the connection synchronization works with two master nodes: the
active master node sends synchronization information for connections to the standby master node,
and the standby master node receives the information and updates its connection table accordingly.
The synchronization of a connection takes place when the number of packets passes a predefined
threshold and then at a certain configurable frequency of packets. The synchronization information
for the connections is added to a queue and periodically flushed. The synchronization information for
up to 50 connections can be packed into a single packet that is sent to the standby master node using
multicast. A kernel thread, started through an init script, is responsible for sending and receiving
synchronization information between the active and standby master nodes.

4.22.3 Drawbacks of the Master/Slave Approach

After experimenting with the master/slave approach for synchronizing connections, we realized that it
suffers from a drawback that is discoverable when the new master node (previously standby) fails and

138
the current standby node (previously active) becomes active again. To illustrate this drawback, we
continue discussing the example of connection synchronization from the previous section.
In Step 3 (Figure 73), a web user opens connection-2. Master node B receives this connection, and
forwards it to a traffic node. Connection synchronization does not take place because master node B
is a sync-slave.

Standby Master Node A

sync-master

No connection
synchronization
web user opens connection-2 forwarded
connection-2 to traffic node
Active Master Node B
sync-slave

Figure 73: Step 3 - Connection Synchronization

In step 4 (Figure 74), another fail-over takes place and master node A is again the active master node.
Connection-2 is unable to continue because it was not synchronized.

web user continues

connection-2
Active Master Node A connection-2 breaks
sync-master

Standby Master Node B

sync-slave

Figure 74: Step 4 - Connection Synchronization

4.22.4 Alternative Approach: Peer-to-Peer Connection Synchronization

The master/slave approach to connection synchronization is not optimum and therefore there is a need
for a different approach that eliminates the drawback in the existing implementation. Figure 75
illustrates the peer-to-peer approach. In this approach, each master node sends synchronization
information for connections that it is handling to the other master node. Therefore, in the scenario
illustrated in Figure 74, connections are synchronized from master node B to master node A;
connection-2 are able to continue after the second fail-over when master node A becomes the active
master node again. The implementation of this approach is a future work item.

139
connections forwarded
Connections
to traffic node
Active Master Node A
sync-master

Connections
synchronized

Standby Master Node B

sync-slave

Figure 75: Peer-to-peer approach

4.23 Traffic Management

Traffic management is an important aspect of the HAS architecture that contributes to building a
highly available and scalable web cluster. This section presents the requirements for a flexible and
scalable traffic distribution inside a web cluster. We discuss the needs of the HAS architecture for a
traffic management scheme and presents the HAS architecture solution that consists of a traffic
manager running on master nodes enforcing a distribution policy and a traffic client that runs on
traffic nodes and report the load index of the traffic node providing a dynamic cycle of feedback.

4.23.1 Background and Requirements

Traffic distribution is the process of distributing network traffic across a set of server nodes to
achieve better resource utilization, greater scalability, and high availability. Scalability is an important
factor because it ensures a rapid response to each network request regardless of the load. Availability
ensures the service continues to run despite failure of individual server nodes. Traffic distribution can
be passive or active technology depending on the specific implementation. Some traffic distribution
schemes do not modify network requests, but pass them verbatim to one of the cluster nodes and
returns the response verbatim to the client. Other schemes, such as NAT, change the request headers
and force the request to go through an intermediate server before it reaches the end user.
There exist many interesting aspects of a traffic distribution implementation that we considered when
designing and prototyping the HAS architecture traffic distribution scheme. These aspects include
optimizing of response time, copying with failures of individual traffic nodes, supporting
heterogeneous traffic nodes, ensuring the distribution service remains available, supporting session
persistence, and ensuring transparency so that users are not aware where the application is hosted and
if the server is a cluster or a single server.

140
Our survey of identical works (Sections 2.9 and 2.10) has identified that added performance from
complex algorithms is negligible. The recommendations were to focus on a distribution algorithm that
is uncomplicated, has a low overhead, and that minimizes serialized computing steps to allow for
faster execution.
Scalable web server clusters require three core components: a scheduling mechanism, a scheduling
algorithm, and an executor. The scheduling mechanism directs clients’ requests to the best web
server. The scheduling algorithm defines the best web server to handle the specific request. The
executor carries out the scheduling algorithm using the scheduling mechanism. The following sub-
section present these three core components in the HAS architecture.

4.23.2 Traffic Management in the HAS Architecture

Traffic management is a core technology that enable better scalability in the HAS architecture. It
consists of three elements that work together to achieve efficient and dynamic traffic distribution
among cluster traffic nodes. These three elements are the traffic manager running on the master
nodes, the traffic client daemons running on traffic nodes, and the traffic distribution policy. Traffic
nodes continuously report to the traffic managers their availability and their load index. The traffic
manager distributes incoming traffic based on the distribution policy.
When a request, such as that for a particular page of a website, arrives at a master node, depending on
the policy defined in the configuration file of the traffic manager, the traffic manager forwards the
request to the appropriate traffic node. This scheme follows the LVS direct routing method, where the
traffic manager and the master nodes have one of their interfaces physically linked by a hub or a
switch. It is also the same approach as the L4/2 clustering method (Section 2.6.1).
When a user accesses a virtual service provided by the server cluster, the packet destined for the
virtual IP address arrives, the traffic manager forwards it to the appropriate traffic nodes, which
services the request and replies directly to the user.

4.23.3 Traffic Manager

The traffic manager runs on master nodes. It receives load announcements from the traffic client
daemons running on traffic nodes, and updates its internal list of traffic nodes and their load. It
maintains a list of available traffic nodes and their load index. The traffic manager distributes
incoming traffic to the available traffic nodes based on their load index. The traffic manager requires

141
a configuration file that lists the addresses of all traffic nodes, the traffic distribution policy, and the
port of communication, the timeout limit, and the addresses of the master nodes.

4.23.3.1 Sample Traffic Manager Configuration File

This section presents sample configuration file of a traffic manager daemon.

1 # TMD Configuration File

2 # List of master nodes
3 master1 <IP Address of Master Node 1>
4 master2 <IP Address of Master Node 2>
5
6 # Active traffic nodes
7 # List the IP addresses of active traffic nodes
8 traffic1 <IP Address of Traffic Node 1>
9 traffic2 <IP Address of Traffic Node 2>
10 traffic3 <IP Address of Traffic Node 2>
11 traffic4 <IP Address of Traffic Node 2>
12
13 # Distribution mechanism (either HAS or RR)
14 distribution HAS
15
16 # Standby traffic nodes - Used when in the NxM redundancy model
17 # If redundancy model is not NxM, then leave empty
18 # List the IP addresses of standby traffic nodes
19 <IP Address of Traffic Node 1>
20 <IP Address of Traffic Node 2>
21 <IP Address of Traffic Node M>
22
23 # Port number
24 port <port_number>
25
26 # Reporting errors – needed for troubleshooting purposes
27 ErrorLog <full_path_to_error_log_file>
28
29 # The Timeout option specifies the amount of time in ms the TMD
30 # waits to receive load info from TCD. Timeout should be > than
31 # the update frequency of the TCD
32 # See TCD configuration file for TCD specific information
142
33 timeout <timeout_value>

4.23.4 The Proc File System

The /proc file system is a real time, memory resident file system that tracks the processes running
on the machine and the state of the system, and maintains highly dynamic data on the state of the
operating system. The information in the /proc file system is continuously updated to match the
current state of the operating system. The contents of the /proc files system are used by many
utilities which read the data from the particular /proc directory and display it.
The traffic client uses two parameters from /proc to compute the load_index of the traffic node:
the processor speed and free memory. The /proc/cpuinfo file provides information about the
processor, such as its type, make, model, cache size, and processor speed in BogoMIPS [128]. The
BogoMIPS parameter is an internal representation of the processor speed in the Linux kernel.
Figure 76 illustrates the contents of the /proc/cpuinfo file at a given moment in time and
highlights the BogoMIPS parameter used to compute the load_index of the traffic node. The
processor speed is a constant parameter; therefore, we only read the /proc/cpuinfo file once when
the TC starts.

% more /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 13
model name : Intel(R) Pentium(R) M processor 1.70GHz
stepping : 6
cpu MHz : 598.186
cache size : 2048 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr mce cx8 sep mtrr pge mca
cmov pat clflush dts acpi mmx fxsr sse sse2 ss tm pbe est tm2
bogomips : 1185.43

Figure 76: The CPU information available in /proc/cpuinfo

143
The /proc/meminfo file reports a large amount of valuable information about the RAM usage. The
/proc/meminfo file contains information about the system's memory usage such as current state of
physical RAM in the system, including a full breakdown of total, used, free, shared, buffered, and
cached memory utilization in bytes, in addition to information on swap space.
Figure 77 illustrates the contents of the /proc/meminfo file at a given moment in time and
highlights the MemFree parameter used to compute the load_index of the traffic node. Since the
MemFree is a dynamic parameter, it is read from the /proc/meminfo file every time the TC
calculates the load_index.

% more /proc/meminfo
MemTotal: 775116 kB
MemFree: 6880 kB
Buffers: 98748 kB
Cached: 305572 kB
SwapCached: 2780 kB
Active: 300348 kB
Inactive: 286064 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 775116 kB
LowFree: 6880 kB
SwapTotal: 1044184 kB
SwapFree: 1040300 kB
Dirty: 16 kB
Writeback: 0 kB
Mapped: 237756 kB
Slab: 171064 kB
Committed_AS: 403120 kB
PageTables: 1768 kB
VmallocTotal: 245752 kB
VmallocUsed: 11892 kB
VmallocChunk: 232352 kB
HugePages_Total: 0
HugePages_Free: 0

Figure 77: The memory information available in /proc/meminfo

4.23.5 Traffic Client

The traffic client is a daemon process that runs on each traffic node. It collects processor and memory
information from the /proc file system and reports it to the traffic managers running on master
nodes. The implementation of the traffic client requires a configuration file (Section 4.23.5.1) that
lists the addresses of master nodes, port of communication, timeout limits, and the logging directive.
144
In the event of failure of the traffic client daemon, the traffic manager does not receive a load
notification and after a timeout, it removes the traffic node from its list of available traffic nodes. As a
result, no disruption of service results from the failure of the traffic client daemon. Section 4.27.9
discusses this scenario.

4.23.5.1 Sample Traffic Client Configuration File

This section presents sample configuration file of a traffic client daemon.

1 # TC Configuration File
2 # List of master nodes to which the TC daemon reports load
3 master1 <IP Address of Master Node 1>
4 master2 <IP Address of Master Node 2>
5
6 # Port # to connect to at master - this port number can be anywhere
7 # between 1024 and 49151. We can also use ports 49152 through 65535
8 port <port_number>
9
10 # Frequency of load updates in ms.
11 updates <frequency_of_updates>
12
13 # Reporting errors -- needed for troubleshooting purposes
14 ErrorLog <full_path_to_error_log_file>
15
16 # Number of RAM in the node with the least RAM in the cluster
17 RAM <num_of_ram>

4.23.6 Characteristics of the Traffic Management Scheme

The traffic management scheme supports static distribution with round robin without taking into
consideration the load of traffic nodes. It also supports dynamic distribution that is configurable to
support various intervals of load updates. The HAS cluster can include computers with heterogeneous
hardware such as with different processor speeds and memory capacity. The traffic distribution
mechanism is aware of the variation in computing power since the processor speed, reported in
BogoMIPS, and memory reported in MB, are factored by the traffic manager when forwarding traffic
to the nodes. As a result, the master node assigns traffic to the traffic nodes based on their load index.
The traffic manager uses the following formula to calculate the load index of each traffic node:
145
⎛ CPU _ a ∗ RAM _ a ⎞
Load _ a = ⎜⎜ ⎟⎟
⎝ RAM _ x ⎠
The formula consists of the following variables:
• CPU_a is the BogoMIPS representation of the processor speed of traffic node a
• RAM_a is the number of free RAM in MB of traffic node a, and
• RAM_x is the number of total RAM in MB of traffic node x, where traffic node x is the node with
the least amount of RAM among all traffic nodes. The configuration file of the traffic client
specifies this parameter (Section 4.23.5.1).
The result is the relative load of traffic node a.
Let us consider an example of how the formula is applied. Consider a traffic node with a Pentium
Mobile processor running at a speed of 1.7 GHz with 748 MB of RAM. The BogoMIPS processor
speed is 3358,72. We assume that the node has 512 MB of free RAM. Let us consider as well that the
machine with the lowest amount of RAM has 256 MB of RAM. Therefore, the load of that machine
as reported by the traffic client daemon to the traffic manager is:

⎛ 3358,72 * 524,288 ⎞
Load _ a = ⎜ ⎟ ≈ 6,717
⎝ 262,144 ⎠
The traffic manager maintains a list of nodes and their loads. Figure 78 illustrates a case example of a
HAS cluster that consists of eight traffic nodes.

Node IP Address Load Index

Least Busy
192.168.1.100 4484 Traffic Node
192.168.1.101 4397
192.168.1.102 4353
192.168.1.103 4295
192.168.1.104 4235
192.168.1.105 4190
192.168.1.106 4097
Most Busy
192.168.1.107 3973 Traffic Node

Figure 78: Example list of traffic nodes and their load index

When the traffic manager receives an incoming request, it examines the list of nodes and forwards the
request to the least busy node on the list. The list of traffic nodes is a sorted linked list that allows us
to maintain an ordered list of nodes without having to know ahead of time how many nodes we will
146
be adding. To build this data structure, we used two class modules: one for the list head and another
for the items in the list. The list is a sorted linked list; as we add nodes into the list, the code finds the
correct place to insert them and adjusts the links around the new nodes accordingly.

4.23.7 Example of Operation

This section provides a description of a normal scenario of traffic client reporting load to traffic
manager. Figure 79 illustrates this scenario.

Users
Users
5

/proc file system

routed routed
/proc/cpuinfo
/proc/cpuinfo
6
/proc/meminfo
/proc/meminfo
1
4

TM
TM TM
TM 3 TC
TC
2

7
9
saru saru Web
Web
Server
Server
8 Distributed
Storage

HBD HBD
LDirectord

Master
Master Node
Node 22 Master
Master Node
Node 11 Traffic
Traffic Node
Node 11

Figure 79: Illustration of the interaction between the traffic client and the traffic manager

(1) The traffic client reads the /proc file system and retrieves the BogoMIPS processor speed from
/proc/cpuinfo, and then retrieves the amount of free memory from /proc/meminfo.
(2) The traffic client computes the node load index based on the formula in Section 4.23.6.
(3) The traffic client reports the load index to the traffic manager as a string that consists of pair
parameters: the traffic node IP address and the load index (traffic_node_IP, load_index).
(4) The traffic manager receives the load index and updates its internal list of traffic nodes to reflect
the new load_index of the traffic_node_IP.
(5) For illustration purposes, we assume that an incoming request reaches the virtual interface.
147
(6) The routed daemon forwards the request to the traffic manager. The traffic manage examines the
list of traffic nodes and chooses a traffic node as the target for this request.
(7) The traffic manages forwards the request to the traffic node.
(8) The web server running on the traffic node receives the request. It reads from the distributed
storage to retrieve the requested document and fetches it from storage.
(9) The web server sends the request document directly to the web user.

This scenario also illustrates several drawbacks such as the number of steps involved and the
communication overhead between the system software. One of the future work items is to minimize
the number of system software resulting in less communication overhead.

4.23.8 Support for Other Distribution Algorithms

The current traffic manager prorotype supports the basic RR distribution and the HAS distribution.
The distribution method is specified in the configuration file of the traffic manager (Section 4.23.3.1):
# Distribution mechanism (HAS or RR, or others)
distribution <distribution-scheme>
As a result, when the traffic manager starts, it enforces the specified traffic distribution method. If the
distribution method is RR, then the traffic clients running on traffic nodes continue to report their
load, however, the list of traffic nodes will not be kept sorted based on the load of the traffic nodes.
Instead, the table remains static, and reflects the list of traffic nodes available in the HAS cluster. The
traffic manager rotates through the table and distributes incoming traffic in a RR fashion to the list of
traffic nodes it maintains. Furthermore, the current model can support additional distribution methods.

4.23.9 Discussion of the HAS Architecture Traffic Management

The traffic management scheme is lightweight and configurable. The traffic management scheme is
able to cope with failures of individual traffic nodes. It supports a cluster that consists of
heterogeneous cluster nodes with different processor speed and memory capacity.
In case the master node fails, the traffic manager becomes unavailable. In this case, the standby
master node takes over and the standby traffic manager becomes responsible for distributing
incoming traffic. The scheme is transparent to web clients who are unaware that the applications run
on a cluster of nodes.

148
4.24 Access to External Networks and the Internet
The two classical methods to access external networks are the direct access method and the restricted
method. In the direct access method, traffic node can reply to web clients directly. In the restricted
method, traffic nodes forward their response to one of the master nodes that then rewrites the
response header and forwards it to the web client. With the latest method, access to external networks
is restricted by master nodes in the HA tier, which is achieved using forwarding, filtering, and
masquerading mechanisms. As a result, master nodes monitor and filter all accesses to the outside
world. The HAS architecture supports both methods as this is architecture independent and depends
on configuration.
Based on the survey of similar work (Sections 2.9 and 2.10), the direct access method is the most
efficient traffic distribution method that help improve the system scalability. This model is supported
by the HAS cluster prototype and with which we have performed our benchmarking tests.
Figure 80 illustrates the scenario of direct access. When a request arrives to the cluster (1), the master
node examines it, decides where the request should be forwarded, and forwards it (3) to the
appropriate traffic node. The request reaches the traffic node that treats it, and replies directly (4) to
the web client. Furthermore, the HAS architecture supports the restricted access method, as this is
implementation specific.

End
EndUsers
Users

LAN 1 LAN 2
1 C
L Master Node A
U
S Traffic Node 1
T 2 3
E
R

V Master Node B 4
I Traffic Node 2
P

Figure 80: The direct routing approach – traffic nodes reply directly to web clients

Figure 81 illustrates the scenario of restricted access, which requires re-writing of IP packets. When a
request arrives to the cluster (1), the master node examines it, decides where the request should be

149
forwarded, rewrites the packets, and forwards it (3) to the appropriate traffic node. The request
reaches the traffic node that treats it and replies (4) to the master node. The master node then re-writes
the packets (5) and sends the final reply to the web client (6).

LAN 1 LAN 2

1 C
L Master Node A Traffic Node 1
U 2
S
T 5 3
End 6
EndUsers
Users E 4
R

V Traffic Node 2
I Master Node B
P

Figure 81: The restricted access approach – traffic nodes reply to master nodes, who in turn
reply to the web clients

4.25 Ethernet Redundancy

The Ethernet redundancy daemon (erd) runs after the "route" configuration data has been installed,
following boot up. In order to have link redundancy, is it necessary to configure the paired ports
(eth0 and eth1) with the same IP and MAC addresses.
The "erd" program starts by fetching the configuration of all routes and then storing them into a table
(loaded from /proc/net/route). The table is used to compose route deletion commands for all
routes bound to the secondary link (eth1). The route configuration for the primary link (eth0) is then
copied to the secondary link (eth1) with a higher metric (meaning a lower priority). The primary link
(eth0) is then polled every x milliseconds (x is a configurable parameter in the configuration file).
When a primary link fails, all routing information is deleted for the link, causing traffic to be switched
to the secondary link. Upon link restoration, the cached routing information for the primary link is
updated, forcing a traffic switchback.

4.25.1 Command Line Usage

The command has specific usage syntax: % erd [eth0 eth1]

150
The example shown above indicates that eth0 has eth1 as its backup link. If we do not specify
parameters in the command, it defaults to the equivalent of "erd eth0 eth1". We automated this
command on system startup.

4.25.2 Encountered Issues

The servers in our prototype use Tulip Ethernet cards. We patched the tulip.c driver to make the
MAC addresses for ports 0 and 1 identical. Alternatively, we also were able to get the same result by
issuing the following commands (on Linux) to set the MAC address for an Ethernet port:
% ifconfig eth[X] down

% iconfig eth[X] hw ether AA:BB:CC:DD:EE:FF

% ifconfig eth[X] up

We also modified the source code of the Ethernet device driver tulip.c to toggle the RUNNING bit
in the devÆflags variable, which allows ifconfig to present the state of an Ethernet link. The
state of the RUNNING bit for the primary link is accessed by erd via the system call ioctl.

4.26 Dependencies and Interactions between Software Components

The HAS architecture consists of many software components (Section 4.4) that depend on each other
to provide service. When the work initially started, our goal was to minimize the amount of
dependencies so to reduce the possibilities of cascading failures. However, as the work progressed
and the architecture grew to support additional functionalities, the number of system components
grew in number as well, and their relation became more complex.
Figure 82 illustrates the dependencies and interconnections of the various system software
components within the HAS architecture.
As a future work item, we plan to consolidate the functionalities of these system software components
resulting in fewer components. The following subsections discuss the dependencies among these
software components, and describe the nature of their interactions.

151
Master Node A Master Node B
TM
TM

saru
saru heartbeat
heartbeat heartbeat
heartbeat

routed
routed

Traffic Traffic Traffic

Node 1 TC Node 2 TC Node n
TC TC

/proc
/proc file
file system
system LDirectord /proc
/proc file
file system
system LDirectord
LDirectord LDirectord

Legend:
Depends on
# Depends on, only in the 1+1 active/ active
redundancy model

Mutual dependency

Dependencies across nodes

Figure 82: The dependencies and interconnections of the HAS architecture system software

4.26.1 Traffic Manager and the CVIP

The CVIP interface is the point of entry to the HAS cluster. All incoming connections arrive at the
CVIP and are forwarded to the traffic manager running on the master node in the HA tier to decide
which traffic node is to service the incoming request. Therefore, the traffic manager depends on the
CVIP to receive incoming requests.

152
4.26.2 The Saru Module and the Heartbeat Daemon
When the HA tier is in the active/active redundancy model, the saru module runs in coordination with
heartbeat on each of the master nodes. The saru module is responsible for dividing the incoming
connections between the two master nodes. The heartbeat daemon provides a mechanism to
determine which master node is available and the saru module uses this information to divide the
space of all possible incoming connections between both active master nodes.

4.26.3 Traffic Manager and Traffic Client

Traffic managers running on master nodes depend on the traffic client daemons running of traffic
node to provide traffic distribution based on the load index of the traffic nodes. In the event that
traffic client daemon does not send the load index to the traffic managers, traffic managers are not
able to distribute traffic accordingly. Therefore, we need to ensure that traffic managers are aware
when a traffic node becomes unavailable, when a traffic client daemon fails and when the application
fails on the traffic node.

4.26.4 Heartbeat and CVIP

The active master node owns the address to the virtual service. If the active master node fails, the
standby master node becomes the active node (notified by the heartbeat mechanism) and the owner of
the virtual service. Therefore, there is a dependency between the heartbeat mechanism and the cluster
virtual IP interface.

4.26.5 Heartbeat Instances

Each master node runs a copy of the heartbeat daemon. The heartbeat instance running on the master
nodes communicate with each other to ensure that they are available. If the heartbeat on the standby
master node does not receive a reply from the heartbeat instance running on the active master node,
then it declares the active master node as dead, and assumes the active state. The cron utility monitors
the heartbeat daemon locally on each master node, and when the heartbeat daemon fails, the cron
utility re-starts it automatically.

4.26.6 LDirectord and Traffic Manager

The LDirectord module communicates with the traffic manager to ensure that the traffic manager
does not send web traffic to a traffic node that hosts a failing web server application. The LDirectord
153
module reports to the traffic manager the IP address of the traffic node it is running on and the
load_index pre-set to the value of 0 (zero). Section 4.27.12 presents this scenario.

4.26.7 The LDirectord Module and the Traffic Client

The traffic client reports the load of the traffic node to the traffic manager. We use also this
mechanism as a health check between the traffic client and the traffic manager. If the traffic client
fails to report within a specific time, and the timeout is exceeded, then the traffic manager consider
the traffic node as unavailable and removes it from its list of active nodes. However, there is a case
scenario where the traffic client is reporting the load index to the traffic manager but it is unaware
that the application it is hosting is unavailable (has failed). The role of the LDirectord module is to
ensure that the traffic client does not update the load_index while the application is not responsive.
Therefore, the LDirectord sets the traffic client load_index_report_flag to 0. When the
load_index_report_flag = 0, the traffic client stops reporting its load to the traffic manager.
When the application check performed by LDirectord returns positive (the application is up and
running), the LDirectord sets the load_index_report_flag to 1. As a result, the traffic client
restarts to report its load index to the traffic manager and then the traffic manager adds the traffic
node to its list of available nodes.

4.26.8 The Saru Module and Traffic Manager

The traffic manager can be a SPOF in the 1+1 active/active redundancy model and therefore we need
a mechanism to ensure continuous availability of the functions of a traffic manager. The saru module
is the entity that ensures that in the event a traffic manager fails, the cluster virtual interface stops
sending incoming traffic to it and instead sends traffic to the available traffic manager on the other
master node.
The saru module is active on one master node when the nodes in the HA tiers follow the in 1+1
active/active redundancy model. It is responsible of evenly dividing traffic between the active master
nodes, using the round robin algorithm. Let us assume that the saru module is running on active
master node 1. When a web request arrives to the saru module from the routed process, the request is
now within master node 1. The saru module forwards it to the appropriate traffic manager depending
on the round robin algorithm. Based on this, the traffic manager depends on the saru module to
receive traffic.

154
4.26.9 The saru Module and routed Process
The saru module relies on the routed process to receive incoming traffic from the cluster virtual IP
interface.

4.26.10 Traffic Client and the /proc File System

The traffic client daemon depends on the /proc file system to retrieve memory and processor usage,
which are metrics needed to computer the load index of a traffic node.

4.27 Scenario View of the Architecture

The scenario view consists of a small subset of important scenarios – instances of use cases – to
illustrate how the elements of the HAS architecture work together. For each scenario, we describe the
corresponding sequences of interactions between objects and between processes. This view acts as a
driver to help us validate and illustrate the architecture design, both on paper and as the starting point
for the benchmarking tests of the architectural prototype.
The following sub-sections examine are these scenarios:
- Normal scenario: The request arrives to the CVIP. The traffic manager forwards it to a traffic
node, and the traffic nodes replies to the web client.
- Traffic client daemon reporting its load: This scenario presents how a traffic node computes and
reports its load index to the traffic managers running on master nodes.
- Adding a traffic node to the SSA tier: This scenario examines the steps involved adding a traffic
node to the SSA tier.
- Boot process of a traffic node: This scenario examines the boot process of a traffic node. There
are two variations of this scenario depending on whether the traffic node is diskless or it has local
disk.
- Upgrading the operating system and application server software: This scenario describes generic
upgrading the kernel and applications on a HAS cluster node.
- Upgrading the hardware of a master node: This scenario presents the process of upgrading a
hardware component on a master node.
- Boot process of a master node: This scenario examines the boot process of a master node.
- Master node becomes unavailable: This scenario illustrates what happens when a master node
becomes unavailable.

155
- Traffic node becomes unavailable: In some cases, the traffic node can become unavailable
because of hardware or software error. This scenario illustrates how the cluster reacts to the
unresponsiveness of a traffic node.
- Ethernet port becomes unavailable: A cluster node can face networking problems because of
Ethernet card or Ethernet drivers issues. This scenario examines how a HAS cluster node reacts
when it faces Ethernet problems.
- Traffic node leaving the cluster: When a traffic node is not available to serve traffic, the traffic
manager disconnects it from the cluster. This scenario illustrates how a traffic node leaves the
cluster.
- Application server process dies on a traffic node: When the application becomes unresponsive, it
stops serving traffic. This scenario examines how to recover from such a situation.
- Network becomes unavailable: This scenario presents the chain of events that take place when the
network to which the cluster is connected, becomes unavailable.

4.27.1 Normal Scenario

The normal scenario describes how the HAS cluster successfully processes an incoming web request.
The process of the request varies depending on the redundancy model at the HA tier. Therefore, this
scenario examines two variations: the HA tier configured for the 1+1 active/standby redundancy
model, and then when it is configured for the 1+1 active/active redundancy model.

4.27.1.1 Normal Scenario, 1+1 Active/Standby

Figure 83 illustrates the sequence diagram of a successful request. This scenario assumes that the
nodes in the HA tier follow the 1+1 active/standby redundancy. The web user issues a request for a
specific service that the cluster provides through its virtual interface (1). The CVIP interface receives
the request and forwards it to the traffic manager running on the active master node (2). The traffic
manager locates the target node from its list of available traffic nodes (3), and forwards the request to
the appropriate traffic node based on the traffic distribution algorithm (4). The traffic node receives
the requests, processes it, and returns the reply directly to the web user (5).

156
Traffic Manager (TM) Traffic
CVIP
Active Master Node Node A
Users

User sends
request to Requests
virtual IP arrives to the
address TM on that TM checks for
master node the best
1 available
2 traffic node

Request arrives
at Traffic Node A
4

Traffic Node A
replies to user
5

Figure 83: The sequence diagram of a successful request with one active master node

4.27.1.2 Normal Scenario, 1+1 Active/Active

Figure 84 illustrates the sequence diagram of an incoming request that arrives at the HA tier of the
HAS cluster, with two active master nodes. This scenario assumes that the nodes in the HA tier
follow the 1+1 active/active redundancy.The web user accesses a service hosted on web server (1).
The request arrives at the cluster IP interface (2). Master node 1 is elected by the saru module to own
the CVIP interface and receive incoming requests from the CVIP. Therefore, the request arrives to
master node 1 (3). The saru module forwards it to the traffic manager running on master node 2 (4).
The saru module is responsible of evenly dividing traffic in between the active master nodes, using
round robin algorithm. The traffic manager on master node 2 checks (5) for the best available traffic
node from its list of available traffic nodes based on the distribution algorithm, and then forwards the
request to it (6). The web server application receives the request, processes it, and replies directly to
the user (7).

157
Traffic Manager (TM) Traffic Manager (TM) Traffic
CVIP
Active Master Node 1 Active Master Node 2 Node C
Users

User sends
request to User sends
virtual IP request to
address virtual IP
address
1 Master node 1 is
elected by saru to
2 receive incoming
requests from VIP
and divide it
3 among the active
master nodes

Request is sent to
Master Node 2
4
TM checks for the
5 best available
traffic node
Request arrives
at Traffic Node C
6

Traffic Node C
replies to user
7

Figure 84: The sequence diagram of a successful request with two active master nodes

4.27.2 Traffic Node Reporting Load Status

Figure 85 illustrates the sequence diagram of a traffic node reporting its load information to the traffic
managers running on master nodes. This scenario assumes that each traffic node runs a copy of the
traffic client daemon. When the traffic client daemon starts, it loads its configuration from the
/etc/tcd.conf configuration file. As a result, it obtains the IP addresses of the master nodes to
which it reports its load, the port number, the frequency of the load index update, and the location
where it logs errors. The traffic client daemon then reads the processor and memory information from
the /proc file system (1), computes the load index (Section 4.23.5), and reports it (2) to traffic
managers running on the master nodes. The traffic managers receive load index of the traffic node,
and update their list of available traffic nodes and their load indexes (4)(5). The frequency of
reporting the load of the traffic node is defined in the traffic client configuration file. Due to the
repetitive nature of this activity, and since the processor information is static, unlike memory
information, the processor information is read from the /proc file system on the first time the traffic
client is started.
158
Master Node 1 Master Node 2 Traffic Node B
Traffic Manager Traffic Manager Traffic Client Daemon (TCD)

TCD knows about the traffic managers running on

master nodes from its configuration file

TCD retrieves CPU load

Traffic
1 and free memory, and
Manager Traffic node load index computes the load
Daemon index as:
2 ⎛ CPU ∗ free _ RAM ⎞
Load _ index = ⎜⎜ ⎟⎟
Traffic node load index Communication Port ⎝ lowest _ RAM ⎠
3

(4) and (5): Update list of available of

traffic nodes and their load
Before
update
List of available
traffic nodes
5 4
After
update

At this point, traffic node B is added to the list of available traffic nodes.
The traffic manager starts forwarding incoming traffic to traffic node B.

Figure 85: A traffic node reporting its load index to the traffic manager

4.27.3 Adding a Traffic Node to the HAS Cluster

The HAS architecture allows the addition of traffic nodes to the cluster in response to increased
traffic transparently and dynamically. The HAS cluster administrator configures the traffic node to
boot from LAN. The traffic node boots and downloads a kernel image and a ramdisk (Section
4.27.4.1). The ramdisk includes three software components that are started automatically: the
Ethernet redundancy daemon, the LDirectord, and the traffic client.
Figure 86 illustrates the scenario of a traffic node joining the HAS cluster. When the traffic client
starts, it reads the processor speed and available memory from the /proc file system and computes
the load index (1). The traffic client then reports its load index to the traffic managers running on the
master nodes in the HA tier as a pair of traffic node IP address and the load index (2)(3). The traffic
managers receive the notification and update their list of available traffic nodes (4)(5). Next, the
traffic node is ready to service incoming requests and the traffic manager starts forwarding traffic to
the new traffic node.

159
Master Node 1 Master Node 2 Traffic Node B
Traffic Manager (TM) Traffic Manager (TM) Traffic Client Daemon (TCD)

The traffic client daemon is aware of the master nodes since the
IP addresses of those nodes are provided in its configuration file.

TCD retrieves CPU load

and free memory,
1 computes the load index
Traffic Traffic node load index and reports it to TMs
Manager
running on master nodes
Daemon 2
Traffic node load index Communication Port
3

(4) and (5): Update list of available of

traffic nodes and their load
Before
update
List of available
traffic nodes
5 4
After
update

At this point, traffic node B is added to the list of available traffic nodes.
The traffic manager starts forwarding incoming traffic to traffic node B.

Figure 86: A traffic node joining the HAS cluster

4.27.4 Boot Process of a Traffic Node

There are two types of traffic nodes, diskless and nodes with local disk, and each boots differently.
The following sub-sections describe both booting methods, but first, we examine how the cluster
operates. The nodes in the cluster boot from the network. When the nodes boot, they broadcast their
Ethernet MAC address looking for a DHCP server. The DHCP server, running on the master nodes
and configured to listen for specific MAC addresses, responds with the correct IP address for the
nodes. Alternately, the DHCP server responds to any broadcast on its physical network with IP
information from a designated range of IP addresses. The nodes receive the network information they
need to configure their interfaces: IP addresses, gateway, netmask, domain name, the IP address of
the image and boot server, and the name of the boot file. Next, the nodes download a kernel image
from the boot server, which can also be the master node. The boot server responds by sending a
network loader to the client node, which loads the network boot kernel. The network boot loader
mounts the root file system as read-only; the network loader reads the network boot kernel sent from
the boot server into local memory and transfers control to it. The kernel mounts root as read/write,

160
mounts other file systems, and starts the init process. The init process brings up the customized Linux
services for the node, and the node is now fully booted and all initial processes are started.

4.27.4.1 Boot Process of a Diskless Traffic Node

Figure 87 illustrates the process of adding a diskless node to a running HAS cluster.

Diskless
DHCP/Image
Traffic
Server
Node

1 DHCP_DISCOVER (PXE Client)

2 IP Address, TFTP Server, rsync

Servers, Name of PXE bootloader

3 TFTP Request (PXE Client)

4 PXE bootloader (diskless_node)

5 TFTP Request (diskless_node)

6 TFTP (diskless_node_ramdisk)

Figure 87: The boot process of a diskless node

1. Ensure that the MAC address of the NIC on that diskless node is associated with a traffic node
and configured on the master nodes as a diskless traffic node. The notion of diskless is important
since the traffic node will download from the imager server a kernel and ramdisk image. Traffic
nodes have their BIOS configured to do a network boot. When the administrator starts traffic
nodes, the PXE client that resides in the NIC ROM, sends a DHCP_DISCOVER message.
2. The DHCP server, running on the master node, sends the IP address for the node with the address
of the TFTP server and the name of the PXE bootloader file that the diskless traffic node should
download.
3. The NIC PXE client then uses TFTP to download the PXE bootloader.
4. The diskless traffic node receives the kernel image (diskless_node) and boots with it.
5. Next, the diskless traffic node sends a TFTP request to download a ramdisk.
6. The image server sends the ramdisk to the diskless traffic node. The diskless traffic node
downloads the ramdisk and executes it.

161
When the diskless traffic node executes the ramdisk, the traffic client daemon starts and reports the
load of the node periodically to the master nodes. The traffic manager, running on the master nodes,
adds the traffic nodes to its list of available traffic nodes and starts forwarding traffic to it.

4.27.4.2 Boot Process of a Traffic Node with Disk

There are two scenarios to add a node with disk to the HAS cluster. The scenarios differ if the traffic
node has the latest versions of the operating system and the application server or if the traffic node
requires the upgrade of either the operating system or the application server. Figure 88 illustrates the
sequence diagram of a traffic node with disk that is booting from the network. If the traffic node
requires the upgrade of either the operating system or the application server, the traffic node
downloads the updated operating system image and then a ramdisk image including the newer version
of the web server application.

Traffic
Node DHCP/Image
With Disk Server

DHCP_DISCOVER (PXE Client)

1
IP Address and network configuration
2

Figure 88: The boot process of a traffic node with disk – no software upgrades are performed

Figure 89 illustrates the process of upgrading the ramdisk on a traffic node. To rebuild a traffic node
or upgrade the operating system and/or the ramdisk image, we re-point the symbolic link in the
DHCP configuration to execute a specific script, which results in the desired upgrade. At boot time,
the DHCP server checks if the traffic node requires an upgrade and if so, it executes the
corresponding script.

162
Traffic
DHCP/Image
Node with
Server
Disk

1 DHCP_DISCOVER (PXE Client)

2 IP Address, TFTP Server, rsync

Servers, Name of PXE bootloader

3 TFTP Request (PXE Client)

4 PXE bootloader (diskless_node)

5 TFTP Request (diskless_node)

6 IP_addr config file (linux_install)

7 TFTP (node_with_disk)

8 Boot kernel, initrd

9 FTP (node_disk_ramdisk)

10 Boot kernel with full ramdisk image

Figure 89: The process of rebuilding a node with disk

4.27.5 Upgrading Operating System and Application Server

The architecture promises service continuity even in the event of upgrading the operating system
and/or the application servers running of the traffic nodes. Figure 90 illustrates the events that take
place when we initiate an upgrade of both the kernel and the application servers on a specific traffic
node. While a traffic node is undergoing a software upgrade, traffic arrives to the cluster and the
traffic manager forwards it to the available traffic nodes. Following this model, we upgrade the
software stack on the cluster nodes without having a service downtime.

163
Traffic Image
Node B Server

1 X System administrator reboots the node. On reboot the

node is flagged for a new stage

Request a restage with new kernel and/or new

version of the application server

Image server uploads new kernel image to the traffic node

5 Node boots with new kernel

Node signals it booted with new kernel version

Image server uploads new version of the application

server
7

8 Node starts the new version of the application server

Figure 90: The process of upgrading the kernel and application server on a traffic node

4.27.6 Upgrading Hardware on Master Node

Figure 91 illustrates the events that take place when a master node is undergoing a hardware upgrade.
The administrator of the system shuts down (1) the master node 1, which becomes offline (2). The
heartbeat service on master node 1 is not available anymore (3) to reply to heartbeat messages sent to
it from the heartbeat instance running on master node 2. After a timeout (specified in the heartbeat
configuration file), master node 2 becomes the active master node (4) and the owner of the virtual
services offered by the cluster. As a result, all incoming traffic arrives to the only available master
node (5). The traffic manager on that master node forwards traffic to the appropriate traffic nodes,
and responses are sent back to the web users.

164
Master Master CVIP
Node 1 Node 2 (master node 2)
Users
Users Admin
Admin

Administrator shuts
down master node 1
1

2 X Master Node 1 is offline

Heartbeat on standby Master

Node 2 does not receive
heartbeat message from
Master Node 1
(timeout limit is exhausted)
3 X Master Node 2 declares
Master Node 1 unavailable
4 and becomes active and
owner of the CVIP
New incoming interface
traffic
5
The request is sent to
the appropriate traffic
Response is sent node for processing
back to the user

Figure 91: The sequence diagram of upgrading the hardware on a master node

4.27.7 Master Node Boot Process

The process of booting a master node is different from that of a traffic node; the cluster administrator
configures the master nodes, while master nodes manage traffic nodes. The cluster administrator
installs all the software packages and tunes the configuration files for all system software components
running on the master nodes. The components that are required include the network time protocol
server, the cluster IP interface, the Ethernet redundancy daemon, traffic manager, the IPv6 router
advertisement daemon, heartbeat daemon, DHCP, and the NFS redundant server daemon.

4.27.8 Master Node Becomes Unavailable

Master nodes supervise the availability of each other using heartbeat. With heartbeat, the failure of a
master node is detected within a delay 200 ms. One common scenario of a failure is when a master
node becomes unavailable because of a hardware problem or an operating system crash. It is critical
to have a mechanism in place to deal with such a challenge.
Figure 92 illustrates the sequence of events when master node 1 becomes unavailable. When the
master node 1 becomes unavailable (1), the heartbeat instance running on the master node 2 does not
165
receive the keep-alive message from the master node 1 and after a configurable timeout (2), master
node 2 become the active master node (3) and owner of the cluster virtual interface (4)(5). There is no
impact on storage because the content of the storage system is duplicated and synchronized on both
master nodes (Section 4.7.4).

Master Master Node 1 Master Node 2 Master Node 2

Node 1 Heartbeat Heartbeat CVIP
Users

Master Node 1 is
1 X unavailable due to
a major failure Heartbeat on Master
Node 1 does not send
heartbeat message to
2 X the heartbeat instance
running on Master Node
2 (timeout limit is The heartbeat instance on
exhausted) Master Node 2 declares
3 Master Node 1 unavailable
and Makes Master Node
2 as primary

4
Master Node 2
5 Is owner of the
New requests from web user arrive to Master Node 2 virtual services
6

Figure 92: The sequence diagram of a master node becoming unavailable

Figure 93 illustrates the sequence diagram of synchronizing storage when one of the master nodes
fails.

Master Node 2
Master Node 1
HA NFS Daemon

Redundant NFS Servers were mounted on traffic nodes

using the modified mount command which allows
mounting 2 servers towards the same mount target

1 X Master Node 1 is dead

The NFS implementation
2 X detects the failure of NFS
server on Master Node 1.
Master Node 1 is
3 X Back online

The NFS implementation

4 X detects that the NFS
server on Master Node 1
Synchronization with rsync is back online
5

Figure 93: The NFS synchronization occurs when a master node becomes unavailable
166
When the initial active master node becomes available again for service, there is no need for a
switchback to active status between the two master nodes. The new master node acts as a hot standby
for the current active master node. As a future work, we would like the standby master node to switch
to the load sharing mode (1+1 active/active), helping the active master node to direct traffic to the
traffic nodes once the active master node reaches a pre-defined threshold limit. When master node 1
becomes available again (3), the NFS server is re-started; it mounts the storage and re-syncs its local
content with the master node 2 using the rsync utility.

4.27.9 Traffic Node becomes Unavailable

Traffic nodes, similarly to master nodes, can face problems and become unavailable to serve
incoming traffic. The HAS cluster architecture overcomes this challenge by supporting node level
redundancy. Figure 94 illustrates the scenario when a traffic node becomes unavailable.

Traffic Master Node 1 Master Node 2

Node C Traffic Manager Traffic Manager

1 X Traffic Node C
becomes unavailable

Traffic managers do not receive

2 X X 2
the load index from traffic node C.

Timeout – traffic managers stop

waiting to receive the load index.
They declare it unavailable.

Traffic managers remove

3 Traffic Node C from their 3
lists of traffic nodes
Traffic Node C
4 X is back online

Traffic Client sends load

index to TM on Master
Node
5
Traffic managers receive
load index and add Traffic
6 6
Node C to their list
of available traffic nodes

Figure 94: The sequence diagram of a traffic node becoming unavailable

When a traffic node becomes unavailable (1), the traffic client daemon (running on that node)
becomes unavailable and does not report the load index to the master nodes. As a result, the traffic
manager daemons do not receive the load index from the traffic node (2). After a timeout, the traffic
167
managers remove the traffic node from their list of available traffic nodes (3). However, if the traffic
nodes becomes available again (4), the traffic client daemon reports the load index to the traffic
manager running on the master node (5). When the traffic manager receives the load index from the
traffic node, it is an indication that the node is up and ready to provide service. The traffic manager
then adds the traffic node (6) to the list of available traffic nodes. A traffic node is declared
unavailable if it does not send its load statistics to the master nodes within a specific configurable
time.

Web Users
Web Users

LAN LAN
Active Standby Active Standby Active Standby Active Standby

Node C Node D Node C Node D

Traffic Node C leaves the cluster.

Traffic Node C is available and serving traffic .
Traffic is directed to other available nodes.

Figure 95: The scenario assumes that node C has lost network connectivity

Figure 95 illustrates a traffic node losing network connectivity. The scenario assumes that traffic node
C lost network connectivity, and as a result, it is not a member of the HAS cluster. The traffic
manager now forwards incoming traffic is to the other remaining traffic nodes.

4.27.10 Ethernet Port becomes Unavailable

All cluster nodes are equipped with two Ethernet cards. The Ethernet redundancy daemon, running on
each cluster node, is responsible for detecting when an Ethernet interface becomes unavailable and
activating the second Ethernet interface. Figure 96 illustrates this scenario. When the Ethernet port 1
becomes unavailable (1), the Ethernet redundancy daemon detects the failure after a timeout (2) and
activates Ethernet port 2 with the same MAC address and IP address as Ethernet port 1 (3). From this
point further, all communication goes through Ethernet port 2.
168
Cluster Node

Ethernet
Ethernet Redundancy Ethernet
Port 1 Daemon Port 2

Ethernet port is
1 X
unavailable
Ethernet Redundancy
Daemon detects the
2 failure of Ethernet Port 1
and performs failover to
Ethernet Port 2

Ethernet Port 2 is now primary

Figure 96: The scenario of an Ethernet port becoming unavailable

4.27.11 Traffic Node Leaving the Cluster

If a traffic node does not report its load index to the traffic managers running on master nodes, the
later remove the traffic node from their list of available traffic nodes and stops forwarding traffic to it.
Figure 97 illustrates how a traffic node leaves the cluster.

Master Node 1 Master Node 2 Traffic Node B

Traffic Manager (TM) Traffic Manager (TM) Traffic Client Daemon (TCD )

The traffic node was servicing request until (1) an error took place
and the node is not communicating with master nodes.
Traffic node is
1 unavailable, suffering
TM does not receive traffic load index – the from hardware or
traffic node did not send out its load index software problem …

2 2
TM
TM updating
updating thethe list
list of
of
available
available traffic
traffic nodes
nodes After timeout, TM removes the traffic node
based
based on
on traffic
traffic nodes
nodes from its list of available traffic nodes
reporting
reporting their
their load
load
index
index

3 3

Figure 97: The sequence diagram of a traffic node leaving the HAS cluster

169
When traffic managers stop receiving messages from the traffic node reporting its load index (1)(2),
after a defined timeout, the traffic managers remove the node from the list of available traffic nodes
(3). The scenario of a traffic node leaving the HAS cluster is similar to the scenario Traffic Node
Becomes Unavailable presented in Section 4.27.9.

4.27.12 Application Server Process Dies

The availability of service depends on the availability of the application process (in our cases, the
Apache web server). If the application crashes and becomes unavailable, we need a mechanism to
detect the failure and restart the application. This is important because a master node cannot detect
such a failure in the traffic nodes and as a result, the traffic manager continues to forward incoming
traffic to a traffic node that hosts a failing application. In addition, we need to maintain consistency
between traffic manager and traffic client.
Our approach to overcome this challenge requires two system software: the cron operating system
facility and the LDirectord module. The cron utility is a UNIX system daemon that executes
commands or scripts as scheduled by the system administrator. As such, the operating system
monitors the application and it re-starts it when it crashes. However, if the application crashes in
between cron checks, then the traffic manager continues to forward requests to the web server
running on the traffic node. New and ongoing requests fail because the web server is down.
Figure 98 illustrates our initial approach into addressing this challenge.
(1) The LDirectord performs an application check by making an HTTP request to the application and
checking the result. The check, performed every x milliseconds (x is a configurable parameter),
ensures that we can open a HTTP connection to the service on the traffic node.
(2) The LDirectord receives the result of the check. A positive result indicates that the application is
up and running; no action is required from LDirectord. A negative result indicates that the web
server did not respond as expected.
(3) If the check is negative, then the LDirectord connects to the traffic manager and reports that the
application on the traffic node is not available. Otherwise, the traffic manager continues to
forward requests to the traffic node on which the unresponsive application is located. The
LDirectord delivers the pair of (traffic_node_IP, load_index) to the traffic manager, with
the load_index = 0. A load_index = 0 ensures that the traffic managers stop forwarding
traffic to the specific traffic node.

170
(4) The traffic manager updates its list of available traffic nodes and stops forwarding traffic to the
traffic node.
(5) The LDirectord needs to ensure that the traffic client does not update the load_index while the
application is not responsive. The LDirectord sets the load_index_report_flag to 0.
(6) When the load_index_report_flag = 0, the traffic client stops reporting its load to the
traffic managers.
(7) On the next loop cycle, the LDirectord checks if the application is still unresponsive. If the
application is still not available, then no action is required from LDirectord.
(8) If the application checks returns positive then LDirectord connects to the traffic manager and
reset the load_index_report_flag to 1. When the load_index_report_flag = 1, the
traffic client resumes reporting its load to the traffic manager.
(9) The traffic client reports the new load_index that overwrites the 0 value.
(10) The traffic manager updates its list of available traffic nodes and starts forwarding traffic to
the traffic node.

Users
Users

/proc file system

routed
/proc/cpuinfo
/proc/cpuinfo
/proc/meminfo
/proc/meminfo
4 10

6
TM
TM TC
TC
9

Web
Web
saru Server
Server
3

5 8
1 2 7

HBD LDirectord

cron

apphbd
Master
Master Node
Node 11 Traffic
Traffic Node
Node 11

Figure 98: The LDirectord restarting an application process

171
4.27.13 Network Becomes Unavailable
In the event that one network becomes unavailable, the HAS cluster needs to survive such a failure
and switch traffic to the redundant available network. Figure 99 illustrates this scenario.

Cluster Traffic Node

Switch/ Switch/ Ethernet Ethernet

Router 1 Router 2 Port 1 Port 2
Users

Request is
User sends request
forwarded to
1 Ethernet port 1 of
Traffic Node
2
3 X Switch/Router
becomes
unavailable
Reply is sent
4
Timeout
5
Reply is re-sent through
Switch/Router 2
6
User receives reply 7
8

Figure 99: The network becomes unavailable

When the router becomes unavailable (3), Ethernet port 1 gets a reply with a timeout (5). At this
point, the Ethernet port 1 uses its secondary route through router 2. We can use the heartbeat
mechanism to monitor the availability of routers. However, since routers are outside our scope, we do
not pursue how to use heartbeat to discover and recover from router failures.

4.28 Network Configuration with IPv6

By default, cluster administrators configure the network setting on all cluster nodes with IPv4 through
either static configuration or using a DHCP server. However, with clusters consisting of tens and
hundreds of nodes, these methods are labor intensive and prone to many errors. Designer of network
protocols recognize the difficulty of installing and configuring TCP/IP networks. Over the years, they
have come up with solutions to overcome these pitfalls. Their latest outcome is a newly designed IP
protocol version, IPv6. One of IPv6's useful features is its auto-configuration ability. It does not
require a stateful configuration protocol such as DHCP. Hosts, in our case cluster nodes, can use
router discovery to determine the addresses of routers and other configuration parameters. The router

172
advertisement message also includes an indication of whether the host should use a stateful address
configuration protocol.
There are two types of auto-configuration. Stateless configuration requires the receipt of router
advertisement messages. These messages include stateless address prefixes and preclude the use of a
stateful address configuration protocol. Stateful configuration uses a stateful address configuration
protocol, such as DHCPv6, to obtain addresses and other configuration options. A host uses stateful
address configuration when it receives router advertisement messages that do not include address
prefixes and require that the host use a stateful address configuration protocol. A host also uses a
stateful address configuration protocol when there are no routers present on the local link. By default,
an IPv6 host can configure a link-local address for each interface. The main idea behind IPv6
autoconfiguration is the ability of a host to auto-configure its network setting without manual
intervention.
Autoconfiguration requires routers of the local network to run a program that answers the
autoconfiguration requests of the hosts. The radvd (Router ADVertisement Daemon) provides these
functionalities. This daemon listens to router solicitations and answers with router advertisement.

Master Node
Traffic Node C
or Router

Router advertisement daemon is running on master nodes

Cluster nodes support the IPv6 protocol at the OS level

1 X Node boots

2 Generate link
Send solicitation local address
message
3

router advertisement,
specifying subnet prefix,
lifetimes, and default router.
4

Generate own IP and

5 perform duplicate
address detection
procedure

Figure 100: The sequence diagram of the IPv6 autoconfiguration process

173
Figure 100 illustrates the process of auto-configuration. This scenario assumes that the router
advertisement daemon is started on at least one master node, and that cluster nodes support the IPv6
protocol at the operating system level, including its auto-configuration feature. The node starts (1). As
the node is booting, it generates its link local address (2). The node sends a router solicitation
message (3). The router advertisement daemon receives the router solicitation message from the
cluster node (4); it replies with the router advertisement, specifying subnet prefix, lifetimes, default
router, and all other configuration parameters. Based on the received information, the cluster node
generates its IP address (5). The last step is when the cluster node verifies the usability of the address
by performing the Duplicate Address Detection process. As a result, the cluster node has now fully
configured its Ethernet interfaces for IPv6.

4.28.1 Changes Required to Support IPv6

Many changes are required on the cluster nodes to support IPv6. Some of the changes are common to
all cluster nodes; others are limited to the master nodes. In addition to supporting IPv6, we need to
add some missing IPv6 network functionalities. The following subsections describe the changes to the
HAS architecture prototype to support a dual IPv4 and IPv6 environment (Figure 101). We have
conducted many experiments and tests that are documented in [131].

4.28.1.1 Changes Needed on All Nodes

The single update that is required on all cluster nodes is to support the IPv6 protocol in the Linux
kernel. We demonstrate how to support IPv6 in the kernel in [132].

4.28.1.2 Specific Changes to Master Nodes

There are several changes needed on master nodes in order to fully support IPv6 and provide all
functionalities and services over IPv6. The core services that need to support IPv6 on these nodes
include NTP, DHCP, cluster virtual IP interface, and all others IP services provided by the master
nodes.

4.28.1.3 Added Functionalities to the Cluster

For general cluster infrastructure, we need to provide support for IPv6 DNS and IPv6 firewalling.
Figure 101 presents our prototyped cluster in an environment that supports IPv4 and IPv6. With the
illustrated environment, incoming traffic to the cluster can be over either IPv4 or IPv6. The cluster
174
virtual IP interface receives the traffic and forward it within the cluster. The upstream provider is the
authority that provides an IPv6 address range to the router that manages the assignment of IP
addresses within the cluster. The IPv6 DNS server is responsible for IPv6 addresses resolution. The
IPv6 in IPv4 tunnel in Figure 101 enables IPv6 hosts and routers to connect with other IPv6 hosts and
routers over the existing IPv4 Internet. IPv6 tunneling encapsulates IPv6 datagrams within IPv4
packets. The encapsulated packets travel across an IPv4 Internet until they reach their destination host
or router. The IPv6-aware host or router decapsulates the IPv6 datagrams, forwarding them as needed.
IPv6 tunneling eases IPv6 deployment by maintaining compatibility with the large existing base of
IPv4 hosts and routers.

LAN 1LAN 2

IPv6 DNS
Upstream Server Traffic
Provider
Node 1

IPv6 in IPv4 Tunnel

C Traffic
L Master Node 2
U Node A
S Storage
T Node 1
IPv6 E
Incoming R Traffic
Traffic Node 3
V Master
IPv4 I Node B Storage
P Node 2
Traffic
Node 4

Figure 101: A functional HAS cluster supporting IPv4 and IPv6

175
Chapter 5
Architecture Validation

5.1 Introduction
The initial goal of this work was to propose an architecture that allows web clusters to scale for up to
16 nodes while maintaining the baseline performance of each individual cluster node. The validation
of the architecture is an important activity that allows us to determine and verify if the architecture
meets our initial requirements. Network and telecom equipment providers use professional services of
specialized validation test centers to test and validate their products.
This chapter presents three types of validation for the HAS architecture. The first is the validation of
scalability and high availability. It present the benchmarking results that demonstrate the ability to
scale the HAS architecture for up to 18 nodes (2 master nodes and 16 traffic nodes) while maintaining
the baseline performance across all traffic nodes. In addition, this chapter presents the results of the
HA testing to validate the HA capabilities. The second validation is the external validation by open
source projects. It describes the impact of the work on the HA-OSCAR project. The third validation is
the adoption of the architecture by the industry as the base architecture for communication platforms
that run telecom applications providing mission critical services.

5.2 Validation of Performance and Scalability

The benchmarking environment we built to measure the performance and scalability the HAS
architecture prototype is similar to the benchmarking environment presented in Section 3.3, however
in testing the HAS architecture prototype, we used more client machines to generate traffic and tested
with a newer version of the WebBench software, version 5.0. To generate web traffic, we used 31
web client machines powered by Intel Pentium III and Pentium Celeron. Each client machine runs a
copy of the WebBench benchmarking software. In addition to the 31 web client machines, we used
one Pentium IV machine as the WebBench controller to managing the testing, collecting, and
compiling test results from all the web client machines.
WebBench is a benchmark program that measures the performance of web servers. WebBench uses
PC clients to send requests for standardized workloads to the web server. The workload is a
combination of static files and dynamic executables that run in order to produce the data the server

176
returns to the client. These client machines simulate web browsers. When the server replies to a client
request, the client records information such as how long the server took and how much data it
returned and then sends a new request. When the test ends, WebBench calculates two overall server
scores, requests per second and throughput in bytes per second, as well as individual client scores.
WebBench maintains at run-time all the transaction information and uses this information to compute
the final metrics presented when the tests are completed.

Figure 102: A screen capture of the WebBench software showing 379 connected clients

Figure 102 is a screen capture from the WebBench controller that shows 379 connected clients from
the client machines that are ready to generate traffic.
The benchmarking tests took place at the Ericsson Research lab in Montréal, Canada. Although the
lab connects to the Ericsson Intranet, our LAN segment is isolated from the rest of the Ericsson
network and therefore our measurement conditions are under well-defined control.
Figure 103 illustrates the network setup in the lab. The client computers run WebBench to generate
web traffic with one computer running WebBench as the test manager. These computers connect to a
fiber capable Cisco switch (2) through 100 MB/s links. The Cisco switch connects to the HAS cluster
(3) through a one GB/s fiber link. Most benchmarking tests we conducted over IPv4, with some

177
additional tests conducted over IPv6. The results demonstrate that we are able to achieve similar
results as with IPv4, however, with a slight decrease in performance [131].

1
2
31 client machines
100Mbps Links
running WebBench to
generate web traffic 3
One fiber-capable 1Gbps fiber link HAS
Cisco switch c2948g Cluster
1 machine running
WebBench acting as
the test manager
Permanent 100Mbps Links
Rest of lab backbone

Figure 103: The network setup inside the benchmarking lab

We experienced a decrease in the number of successful transactions per second per processor ranging
between -2% and -4% [133]. We believe that this is the direct result of the immaturity of the IPv6
networking stack compared to the mature IPv4 networking stack.

5.3 The Benchmarked HAS Architecture Configurations

The HAS architecture prototype consists of 18 nodes; each has a Pentium III processor and 512 MB
of RAM. We configured the HAS cluster prototype as follows:
- Two master nodes providing a single IP interface, cluster services, and storage service through a
redundant HA NFS implementation. The master nodes run following the 1+1 active/standby
redundancy model. They run redundant copies of NTP, RAD, DHCP, and TM. In addition, each
master node runs an instance of heartbeat, the CVIP, and the Ethernet redundancy daemon.
- The prototype included 16 traffic nodes that provide service to web clients. Each traffic node runs
a copy of Apache web server version 2.0.35. The Apache server serves documents available on
the shared storage. Each traffic node runs a copy of the traffic client daemon, which reports the
node load to the traffic manager daemons running on master nodes.
Figure 104 illustrates the benchmarked configurations of the HAS architecture prototype. In between
each of the benchmarking tests, all nodes in the HAS cluster were rebooted and initialized to ensure
that we were starting from a fresh install. The same applies to all the web client machines and the
WebBench controller machine.

178
HA Tier SSA Tier

Active Primary Master Node Test-1

C Traffic Node 1
L Traffic Node 2 Test-2
U NFS Server A
NFS Server A Test-3
S
T Traffic Node 3
Test-4
E Disk
Disk Traffic Node 4
WebBench R
Web Clients
V NFS Server B Traffic Node 5
NFS Server B
I
Traffic Node 6
P
Hot Standby Master Node
Traffic Node 7

Traffic Node 8
1+1 Active/Hot-Standby

Traffic Node 9

Traffic Node 10

Traffic Node 11

Traffic Node 12

Traffic Node 13

Traffic Node 14

Traffic Node 15

Traffic Node 16

Figure 104: The benchmarked HAS cluster configurations showing Test-[1..4]

I performed the following four benchmarking tests:

- Test-1 generates traffic to a HAS cluster that consists of two master nodes and two traffic nodes.
This is the minimal deployment of HAS cluster.
- Test-2 generates traffic to a HAS cluster that consists of two master nodes and four traffic nodes.
This configuration has double the number of traffic nodes than Test-1.
- Test-3 generates traffic to a 10 processors HAS cluster that consists of two master nodes and
eight traffic nodes. This configuration has double the number of traffic nodes than Test-2.
- Test-4 generates traffic to a HAS cluster that consists of two master nodes and 16 traffic nodes.
This configuration has double the number of traffic nodes than Test-3.
In addition to testing the HAS architecture prototype, I performed a test on a single machine with a
Pentium III processor and 512 MB of RAM. The goal of this test, called Test-0, is to establish the
baseline performance, and allows us to realize the performance limitation of a single node.

179
5.4 Test-0: Experiments with One Standalone Traffic Node
This test consists of generating web traffic to a single standalone server running the Apache web
server software. This test reveals the performance limitation of a single node. We use the results of
this test to define the baseline performance. Apache 2.0.35 was running on this node and the NFS
server running on the network segment hosting the document repository.
Table 10 presents the results of the benchmark with a single server node. The results of Test-0 are
consistent with the tests conducted in 2002 and 2003 with an older version of Apache (Section 3.7).
The main lesson to learn from this benchmark is that the maximum capacity for a standalone server is
an average of 1,033 requests per second. If the server receives requests over its baseline capacity, it
becomes overloaded and unable to respond to all of them. Hence, the high number of failed requests
as illustrated in the table below. Table 10 presents the number of clients generating web traffic, the
number of requests per second the servers has completed, and the throughput. WebBench generates
this table automatically as it collects the results of the benchmarking test.

Number of Requests Throughput Throughput Errors (Connection +

Clients Per Second (Bytes/Sec) (KBytes/Sec) Transfer)
1_client 143 825609 806 0
4_clients 567 3249304 3173 0
8_clients 917 5240358 5118 0
12_clients 1012 5769305 5634 0
16_clients 1036 5924397 5786 212
20_clients 1036 6044917 5903 456
24_clients 1038 6058103 5916 692
28_clients 1040 6063940 5922 932
32_clients 1037 6046431 5905 1012
36_clients 1037 6052267 5910 1252
40_clients 1037 6049699 5908 1484
44_clients 1029 6008823 5868 1736
48_clients 1032 6020502 5879 1983
52_clients 1033 6032181 5891 2237

Table 10: The performance results of one standalone processor running the Apache web server

180
1200

1000
Requests Per Second

800

600

400

200

0
nt

s
nt

nt
nt

nt
e
cli

lie

lie
cli

cli

_c
1_

52
Number of Clients

Requests Per Second

Figure 105: The results of benchmarking a standalone processor -- transactions per second

In Figure 105, we plot the results from Table 10, the number of clients versus the number of requests
per second. We notice that as we reach 16 clients, Apache is unable to process additional incoming
web requests and the scalability curve levels-off. Even though we are increasing the number of clients
generating traffic, the application sever has reached its maximum capacity and is unable to process
more requests. With this exercise, we conclude that the maximum number of requests per second we
can achieve with a single process is 1,035. We use this number to measure how our cluster scales as
we add more processors.
Figure 106 presents the throughput achieved with one processor. We plot the results from Table 10,
the number of clients versus the throughput in terms of KB/s. The maximum throughput possible with
a single processor averages around 5,800 KB/s. In addition, WebBench provides statistics about
failed requests. Table 10 presents the number of clients generating traffic and the number of failed
requests. Apache starts rejecting incoming requests when we reach 16 simultaneous WebBench
clients generating over 1,300 requests per second.

181
182
Requests Per Second Throughput (KBytes/Sec)

0
1000
2000
3000
4000
5000
6000
1_ 7000
1_ cli
e

0
500
1000
1500
2000
2500
cli nt
4_
4_ ent cli
cli e nt
e 8_ s
8_ nt s cli
cli e nt
en 12 s
12 t _c
_c s lie
l ie nt
16 n 16 s
_c ts _c
lie
li nt
20 en t 20 s
_c s _c
l ie lie
nt
24 nt 24 s
_c s _c
li lie
nt

Requests Per Second

28 en t 28 s
_c s _c
l ie lie
nt
32 nt 32 s
_c s _c
li lie
nt
36 en t 36 s
Number of Clients

_c s _c
l ie lie

Number of Clients
nt nt
Throughput (KBytes/Sec)

40 40 s
_c s _c
li lie
44 en t nt
_c s 44 s
li _c
lie
48 en t nt
_c s 48 s
l ie _c
52 nt lie
nt
_c s 52 s
l ie _c
nt lie
Figure 106: The throughput benchmarking results of a standalone processor

s nt

Errors (Connection + Transfer)

Figure 107: The number of failed requests per second on a standalone processor
Figure 107 illustrates the curve of successful requests per second combined with the curve of failed
requests per second. As we increase the number of clients generating traffic to the processor, the
number of failed requests increases. Based on the benchmarks with a single node, we can draw two
main conclusions. The first is that a single processor can process up to one thousand requests per
second before it reaches its threshold. The second conclusion is that after reaching the threshold, the
application server starts rejecting incoming requests.

5.5 Test-1: Experiments with a 4-nodes HAS Cluster

This test consists of generating web traffic to the cluster consisting of two master nodes (configured
in the active/standby redundancy model) and two traffic nodes. Table 11 presents the results of the
benchmarking test. The table presents the number of clients generating web traffic, the number of
requests per second the servers has completed, and the resulting throughput. The WebBench
controller machine generates this table automatically as it collects the results of the benchmarking
tests from all the WebBench client machines.
Number of Requests Per Throughput Throughput Errors (Connection
Clients Second (Bytes/Sec) (KBytes/Sec) + Transfer)
1_client 178 1102472 1077 0
4_clients 723 4388988 4286 0
8_clients 1458 9269198 9052 0
12_clients 1744 10986836 10729 0
16_clients 1976 12455528 12164 97
20_clients 2030 12795912 12496 317
24_clients 2034 12821126 12521 547
28_clients 2043 12871553 12570 792
32_clients 2056 12953497 12650 991
36_clients 2062 12997621 12693 1203
40_clients 2087 13155206 12847 1466
44_clients 2089 13167813 12859 1689
48_clients 2086 13146174 12838 1973
52_clients 2089 13165089 12857 2217

Table 11: The results of benchmarking a four-nodes HAS cluster

The traffic nodes in the HAS cluster start rejecting new incoming requests as WebBench has reached
generating 2,073 requests (1,976 successful requests versus 97 failed requests). As WebBench adds
more web clients to generate traffic, we notice an increase of failed requests, with the number of
successful requests being almost constant ranging between 2,030 and 2,089 requests per second.

183
Requests Per Second: 2 Master Nodes and 2 Traffic Nodes

2500
Requests Per Second

2000

1500

1000

500

0
nt

s
nt

nt
nt

nt
e
cli

lie

lie
cli

cli

_c
1_

52
Number of Clients

Requests Per Second

Figure 108: The number of successful requests per second on a HAS cluster with four nodes

Figure 108 illustrates the results with a 4-processor cluster, illustrating the curve of the number of
transactions per second versus the number of clients. Figure 109 shows the throughput curve
illustrating the achieved throughput per second with a 4-processor cluster as we increase the number
of client machines generating traffic to the cluster.

184
Throughput (KBytes/Sec): 2 Master Nodes and 2 Traffic Nodes

14000
Throughput (KBytes/Sec)

12000
10000
8000
6000
4000
2000
0
nt

s
nt

nt
nt

nt
e

e
cli

l ie

l ie
cli

cli

_c
1_

52
Number of Clients

Throughput (KBytes/Sec)

Figure 109: The throughput results (KB/s) on a HAS cluster with four nodes

2500

2000
Requests per Second

1500

1000

500

0
t

s
en

nt
nt

lie

lie
e

e
cli

cli

_c
1_

Number of Clients

Successful Requests Errors (Connection + Transfer)

Figure 110: The number of failed requests per second on a HAS cluster with four nodes

185
Figure 110 shows the curve of successful requests per second combined with the curve of failed
requests per second. As we increase the number of clients generating traffic to the processor, the
number of failed requests increases.

5.6 Test-2: Experiments with a 6-nodes HAS Cluster

This test consists of generating web traffic to a HAS cluster made up of two master nodes and four
traffic nodes. Table 12 presents the results of the benchmarking test. This table also lists the number
of client machines generating traffic, the successful number of requests per second, the throughput
achieved, and the number of failed requests per second because of either connection or transfer errors.
Number of Requests Per Throughput Throughput Errors (Connection
Clients Second (Bytes/Sec) (KBytes/Sec) + Transfer)
1_client 176 1101321 1075 0
4_clients 719 4588988 4481 0
8_clients 1460 8693081 8489 0
12_clients 1752 10986836 10729 0
16_clients 2014 12695058 12398 0
20_clients 2350 14813003 14466 0
24_clients 2711 17088532 16688 0
28_clients 3013 18985857 18541 0
32_clients 3405 21463095 20960 0
36_clients 3725 23473882 22924 0
40_clients 4031 25421634 24826 0
44_clients 4101 26329324 25712 189
48_clients 4152 26662688 26038 243
52_clients 4163 26694742 26069 473
56_clients 4167 26790905 26163 784
60_clients 4171 26810137 26182 981
64_clients 4171 26803711 26175 1193
68_clients 4170 26805003 26177 1390
72_clients 4168 26793795 26166 1667
74_clients 4194 26959538 26328 1983
78_clients 4189 26927397 26296 2312
82_clients 4220 27126669 26491 2633
86_clients 4168 26792407 26164 2987
90_clients 4170 26805263 26177 3271
94_clients 4168 26792407 26164 3592

Table 12: The results of benchmarking a HAS cluster with six nodes

The maximum number of successful requests per second is 4,220, and the maximum throughput
reached is 26,491 KB/s.

186
Requests Per Second: 2 Master Nodes and 4 Traffic Nodes

4500
4000
Requests per Second

3500
3000
2500
2000
1500
1000
500
0
nt

s
nt

nt
nt
e
cli

l ie

l ie
cli

_c
1_

94
Number of Clients

Requests Per Second

Figure 111: The number of successful requests per second on a HAS cluster with six nodes

Throughput (KBytes/Sec): 2 Master Nodes and 4 Traffic Nodes

30000
Throughput (KBytes/Sec)

25000

20000

15000

10000

5000

0
nt

s
nt

nt
nt
e
cli

l ie

l ie
cli

_c
1_

Number of Clients

Throughput (KBytes/Sec)

Figure 112: The throughput results (KB/s) on a HAS cluster with six four nodes
187
4500
4000
3500
Requests per Second

3000
2500
2000
1500
1000
500
0
nt

s
nt

nt
nt
e
cli

lie

lie
cli

_c
1_

94
Number of Clients

Successful Requests Errors (Connection + Transfer)

Figure 113: The number of failed requests per second on a HAS cluster with six nodes

Figure 113 presents the curve of successful requests per second combined with the curve of failed
requests per second. As we increase the number of clients generating traffic to the processor, the
number of failed requests increases.

5.7 Test-3: Experiments with a 10-nodes HAS Cluster

This test consists of generating web traffic to a HAS cluster made up of two master nodes and eight
traffic nodes. Table 13 presents the results of the benchmarking test. The maximum number of
successful transactions per second is 8,316. The maximum throughput achieved is 51,073 KB/s.

Number of Requests Per Throughput Throughput

Clients Second (Bytes/Sec) (KBytes/Sec)
1_client 173 1125072 1099
4_clients 725 4377974 4275
8_clients 1440 9258378 9041
12_clients 1753 10978510 10721
16_clients 2015 12694844 12397
20_clients 2349 14797572 14451
24_clients 2729 17194114 16791
28_clients 3022 19007587 18562

188
32_clients 3339 21372529 20872
36_clients 3660 23498460 22948
40_clients 4042 25416830 24821
44_clients 4340 26272234 25656
48_clients 4560 27561631 26916
52_clients 4800 29637244 28943
56_clients 5090 31580773 30841
60_clients 5352 33656387 32868
64_clients 5674 35681682 34845
68_clients 5930 37298145 36424
72_clients 6324 39770012 38838
76_clients 6641 41770148 40791
80_clients 6910 43462088 42443
84_clients 7211 45550281 44483
88_clients 7460 46789359 45693
92_clients 7680 48292606 47161
96_clients 7871 49192039 48039
100_clients 8052 50833660 49642
104_clients 8158 51116698 49919
100_clients 8209 51311680 50109
104_clients 8278 52060159 50840
92_clients 8293 52154505 50932
96_clients 8310 52267720 51043
100_clients 8312 52280300 51055
104_clients 8307 52248851 51024
108_clients 8316 52299169 51073
112_clients 8313 52280300 51055
114_clients 8311 52274010 51049
118_clients 8310 52261431 51037
122_clients 8306 52242561 51018
126_clients 8319 52318038 51092
130_clients 8302 52217403 50994
134_clients 8308 52255141 51030
138_clients 8311 52267720 51043
142_clients 8312 52280300 51055

Table 13: The results of benchmarking a HAS cluster with 10 nodes

Figure 114 presents the curve of performance illustrating the number of successful requests per
second achieved with a HAS cluster that consists of two master nodes and eight traffic nodes. The
master nodes are in the 1+1 active/standby model and the traffic nodes follow the N-way redundancy
model, where all traffic nodes are active. Figure 115 shows the throughput curve of the 10-processor
HAS cluster.

189
190
Throughput (KBytes/Sec) Requests Per Second
1_
1_
c c

0
1000
2000
3000
4000
5000
6000
7000
8000
9000

10000
20000
30000
40000
50000
60000

0
8_ l ien 8_ l ien
cli t cli t
16 en 16 en
_c t s _c t s
24 li en 24 li en
_c ts _c ts
32 li en 32 li en
_c ts _c ts
40 li en 40 li en
_c ts _c ts
48 li en 48 li en
_c ts _c ts
56 li en 56 li en
_c ts _c ts
64 li en 64 li en
_c ts _c ts
72 li en 72 li en
_c ts _c ts
80 li en 80 li en
_c ts _c ts
88 li en 88 li en
_c ts _c ts
96 li en 96 li en
_ t _ t
10 clie s 10 clie s
4_ n t
4_ n t s
10 c lie s 10 c lie
4_ nt 4_ nts
Number of Clients

c s c

Number of Clients
96 lie n 96 lie n
Requests Per Second

_ t _c ts

Throughput (KBytes/Sec)
10 clie s 1 0 li e
4_ n t 4_ n t
s
11 c lie s 11 c lie
2_ nt 2_ nts
11 c lie s 11 c lie
8_ nt 8_ nt
12 c lie s 12 c lie s
6_ nt 6_ nt
s
13 c lie s 13 c lie
Requests Per Second: 2 Master Nodes and 8 Traffic Nodes

4_ nt 4_ ts n
Throughput (KBytes/Sec): 2 Master Nodes and 8 Traffic Nodes

14 c lie s 14 c lie
2_ nt 2_ nt
c li s c li s

Figure 115: The throughput results (KB/s) on a HAS cluster with 10 nodes
en en
ts ts
Figure 114: The number of successful requests per second on a HAS cluster with 10 nodes
5.8 Test-4: Experiments with an 18-nodes HAS Cluster
This test consists of generating web traffic to a HAS cluster made up of two master nodes and 16
traffic nodes. This test is the largest we conducted and consists of 18 nodes in the HAS cluster and 32
machines in the benchmarking environment, 31 of which generate traffic. Figure 116 presents the
number of successful transactions per second achieved with the 18 processors HAS cluster.

Requests Per Second: 2 Master Nodes and 16 Traffic Nodes

18000

16000

14000
Requests Per Second

12000

10000

8000

6000

4000

2000

0
_c nt

_c ts

_c ts
_ ts
4_ n ts

4_ nts

8_ nts

4_ nts
8_ nts

4_ nts

8_ nts

4_ nts
8_ nts

4_ nts

0_ nts

6_ nts
2_ nts

ts
en
16 clie

32 l i e n

48 l i e n

64 l i e n

80 l i e n

96 l i e n
10 clie

10 c lie

11 c lie

13 c lie

14 c lie

16 c lie

17 c lie

19 c lie
20 c lie

22 c lie

24 c lie

25 c lie

27 c lie
c li
1_

Number of Clients

Requests Per Second

Figure 116: The number of successful requests per second on a HAS cluster with 18 nodes

In this test, the HAS cluster with 18 nodes achieved 16,001 successful requests per second, an
average of 1,000 successful requests per second per traffic node in the HAS cluster.

191
Throughput (KBytes/Sec): 2 Master Nodes and 16 Traffic
Nodes

120000
Throughput (KBytes/Sec)

100000
80000
60000
40000

20000
0
_c nt
_c ts

_c ts

_c ts
_c ts

_ ts
4_ n ts

4_ nts

8_ nts
4_ nts

8_ nts

4_ nts
8_ nts

4_ nts
0_ nts

6_ nts

2_ nts

ts
en
16 clie
32 l i e n

48 l i e n

64 l i e n
80 l i e n

96 l i e n
10 clie

10 c lie

11 c lie
13 c lie

14 c lie
16 c lie

17 c lie

19 c lie
20 c lie

22 c lie
24 c lie

25 c lie

27 c lie
c li
1_

Number of Clients

Throughput (KBytes/Sec)

Figure 117: The throughput results (KB/s) on a HAS cluster with 18 nodes

5.9 Scalability Charts

Table 14 presents the summary of the benchmarking results we collected from testing HAS clusters
consisting of two, four, eight and 16 traffic nodes, respectively; the table includes the results of
benchmarking a single standalone traffic node, which illustrates the baseline performance data.

Traffic Cluster Nodes Total Transactions Average Transactions per Traffic Node
1 1032 1032
2 2068 1034
4 4143 1036
8 8143 1017
16 16001 1000
Table 14: The summary of the benchmarking results of the HAS architecture prototype

For each testing scenario, we recorded the maximum number of requests per second that each
configuration supported. When we divide this number by the number of processors, we get the
maximum number of request that each processor can process per second in each configuration. Table
192
14 presents the number of successful transactions per traffic node. The total transactions is the total
number of successful transactions of the full HAS cluster as reported by WebBench. The average
transactions per traffic node is the average number of successful transaction served by a single traffic
node in the HAS cluster.

18000

16000 16001
Number of transactions per second

14000
(served by all the traffic nodes in

12000
the HAS cluster)

10000

8000 8143

6000

4000 4143

2000 2068

1032 1034 1036 1017 1000

0
1 2 3 4 5
Total Number of 1032 2068 4143 8143 16001
Transactions in the HAS
Cluster
Average Number of 1032 1034 1036 1017 1000
Transactions per Traffic
Node
Total Number of Transactions in the HAS Cluster
Average Number of Transactions per Traffic Node

Figure 118: The results of benchmarking the HAS architecture prototype

Figure 118 presents the scalability of the prototyped HAS cluster architecture. Starting with one
processor, we established the baseline performance to be at 1,032 requests per second. Next, we setup
the HAS prototype and performed benchmarking tests as we scaled the number of traffic nodes from
two to 16. The HAS cluster maintained a 1,000 requests per second per traffic node. As we scaled by
adding more traffic nodes to the SSA tier, we lost -3.1% of the baseline performance per traffic node
193
as defined in Section 5.4. These results represent an improvement compared to the -40% decrease in
performance experienced with a cluster built using existing software that uses traditional methods of
scaling (Section 3.9).

1040
1036
1034
1032
1030
Transactions per second per

1020
1017
processor

1010

1000 1000

990

980
1 22 43 84 16
5
Number of processors in the cluster

Average Transactions per traffic Node

Figure 119: The scalability chart of the HAS architecture prototype

Figure 119 illustrates the scalability chart of the HAS architecture prototype. The results demonstrate
a close to linear scalability as we increased the number of traffic nodes up to 16. The results
demonstrate that we were able to scale from a standalone single node performing an average
maximum of 1,032 requests per second to 18 nodes in the HAS cluster (16 of them serve traffic) with
an average of 1,000 requests per second per traffic node. The scaling is achieved with a -3.11%
decrease in performance as compared to the baseline performance.

5.10 Validation of High Availability

The validation of HA requires specialized testing and poses new challenges. In a real world situation,
validation of HA requires special monitoring and reporting tools and running the system for months
and monitoring its usage with a controlled stream of traffic while invoking error cases. The
monitoring and reporting tools would capture the details of the system under testing and produce

194
result matrix. However, since we do not have access to specialized HA testing tools, we performed
basic test scenarios to ensure that the claimed HA support is provided and it works as described. An
important feature of a highly available system is its ability to continue providing service even when
certain cluster sub-system fails. Our testing strategy was based on provoking common faults and
observing how this affects the service and if it leads to a service downtime. In our case, the system is
the HAS architecture prototype, and the cluster nodes run web servers that provide service to web
users. If there is a service downtime, users do not get replies to their web requests.
The following sub-sections discuss the high availability testing for the HAS cluster and cover testing
the connectivity (Ethernet connection and routers), data availability (redundant NFS server), master
node, and traffic node availability.

5.10.1 Experiments with Connectivity Availability

Figure 120 illustrates the experiments we performed to test the high availability features of the HAS
architecture.

Traffic Node X
4 5
Web Server Application Master Node 1

System Software EthD TCD TMD EthD System Software HBD HBD

Heartbeat daemon
Linux Operating System Linux Operating System running on Master
Node 2
Interconnect Protocol Interconnect Protocol

3
Interconnect Technology Interconnect Technology

Processor Processor

Router 1

Router 2
Ethernet Card 1
Ethernet Card 2 2

LAN 1
LAN 2

Experiment number

Figure 120: The possible connectivity failure points

The failures tested include Ethernet daemon failure, failure of Ethernet adapter, router failure,
discontinued communication between the traffic manager and the traffic client daemon and
195
discontinued communication between the heartbeat instances running on the master nodes. The
experiments included provoking a failure to monitor how the HAS cluster reacts to the failure, how
the failure affects the service provided, and how the cluster recovers from the failure.

5.10.1.1 Ethernet Daemon Becomes Unavailable

For this type of failure, there is no differentiation if the failure takes place on a master node or a
traffic node. If we shutdown the Ethernet daemon on a cluster node (Figure 120 – experiment 1),
there is no direct effect on the service provided
The one negative impact that can take place is when the primary Ethernet card fails (Ethernet Card 1),
the Ethernet redundancy daemon is not running to detect the failure and to switch to the backup
Ethernet card (Ethernet Card 2). As a result, the node becomes isolated from the cluster and the client
traffic daemon is not able to connect to the traffic manager. The traffic manager declares the traffic
node unavailable and removes it from its list of available traffic nodes. However, this case is an
unusual failure case scenario where both the Ethernet adapters and the Ethernet redundancy daemon
become unavailable on the same node.

5.10.1.2 Ethernet Card is Unresponsive

Similar to the use case above, there is no difference for type of failure if the failure occurs on a master
node or a traffic node. If the Ethernet card becomes unresponsive and unavailable (Figure 120 –
experiment 2), the Ethernet redundancy daemon detects the failure (within a range of 350 ms to 400
ms), considers the Ethernet card out of order, and starts the process of the network adapter swap
(Section 4.7.3). As a result, the Ethernet redundancy daemon designates the standby Ethernet card
(Ethernet Card 2) as the primary adapter. Section 6.1.9.3 covers the workings of this scenario. If the
unresponsive Ethernet card receives traffic while it is down, or while the transition to the second
active Ethernet card is not yet complete, new incoming connections will stall, and ongoing
connections are lost.

5.10.1.3 Router Failure

If the router fails (Figure 120 – experiment 3), we consider this failure a network problem and it is
beyond the scope of the HAS cluster architecture.

196
5.10.1.4 Discontinued Communication between the TM and the TC
This test case examines the scenario where we disrupt the communication between the TM and the
TC (Figure 120 – experiment 4). As a result, the TM stops receiving load alerts from the TC. After a
predefined timeout, the TM removes the traffic node from its list of available traffic nodes and stops
forwarding traffic to it. When we restore communication between the TM and the TC, the TC starts
sending its load messages to the TM. The TM then adds the traffic node to its list of available nodes
and starts forwarding traffic to it. Section 4.27.9 examines a similar scenario.

5.10.1.5 Discontinued Communication between the Heartbeat Instances

This test case examines what happens when we disconnect the communication between the heartbeat
instances running on the master nodes (Figure 120 – experiment 5). This failure scenario depends on
whether the disconnected node is the active master node or the standby master node.
If the heartbeat instance running on the active master node becomes unavailable, then the heartbeat
instance running on the standby node does not receive the keep-alive message. As a result, the
heartbeat instance on the standby node declares the primary master node unavailable, thereby making
the standby master node the active master node and owner of the virtual service. The new master
node starts receiving traffic within a delay of 200 ms.
If the heartbeat instance running on the standby node becomes unavailable, then the heartbeat
instance running on the master node does not receive the keep-alive message. The cluster does not
undergo a reconfiguration. However, the HA tier of the HAS cluster becomes vulnerable to SPOF
since there is only one master node that is available, and it is in the active state.

5.10.2 Experiments with Data Availability

The availability of data is essential in a clustered environment where data resides on shared storage
and multiple clients access it. In the HAS architecture prototype, the data is duplicated and available
on two NFS servers through a special implementation of the NFS server code and an updated
implementation of the mount program.
Our testing methodology uses a direct approach. Based on Figure 121, two components can affect
data availability: the availability of the NFS server daemons and the availability of the master nodes.
Master nodes are redundant and the failure of one master node does not affect the availability of data
or access to the provided service. The only scenario that could lead to data unavailability is if the NFS

197
server daemons running on master nodes were to crash. For this purpose, we have implemented
redundancy in the NFS server code. The two test cases we experimented with are shutting down the
NFS server daemon on a master node and disconnecting the master node from the network. In both
scenarios, there was no interruption to the service provided. Instead, there was a delay ranging
between 450 ms and 700 ms to receive the requested document.

LAN 2
LAN 1
Primary NFS
Master Node A Server Daemon

Node A Local Disk

/mnt/CommonNFS

Node B Local Disk

Secondary
Master Node B NFS Server
Daemon

Figure 121: The tested setup for data redundancy

5.10.3 Experiments with Traffic Node Availability

Traffic nodes follow the N redundancy model. If one node fails and becomes unavailable, the traffic
manager forwards traffic to other available traffic nodes. The failures tested include connectivity
problems, hardware and software problems, and TC problems.
Connectivity problems: Section 5.10.1 discusses the validation of high availability connectivity.
Unknown hardware or software problems leading to abnormal node unavailability: If the traffic node
goes to a halt state (power off), the traffic manager declares that node as unavailable (by taking it off
its list of available traffic nodes) since the traffic client daemon becomes unreachable and unable to
report node status to the traffic manager. Section 4.27.9 presents this case scenario.
TC problem: If the traffic client daemon becomes unavailable, because of software or a hardware
problem, the daemon does not report the node availability and status to the traffic managers. The
traffic node stops receiving traffic. Section 4.27.9 discusses a similar case scenario.
198
5.10.4 Experiments with Connection Synchronization
To test the connection synchronization mechanism, we open a connection to the virtual service while
the master node is active. Then we cause a fail-over to occur by powering down the master node. At
this point, the connection stalls. Once the CVIP address is failed over to the standby master node, the
connections continues.
Streaming is a useful way to test this, as streaming connections by their nature are open for a long
time. Furthermore, it provides intuitive feedback as the video pauses and then continues. It is of note
that by increasing the buffer size of the streaming client software, the pause is eliminated.

5.11 HA-OSCAR Architecture: Modeling and Availability Prediction

The HA-OSCAR project team evaluated the HA-OSCAR architecture, its system failure, and
recovery using the Stochastic Reward Nets (SRN) modeling approach with the goal to measure the
availability improvements introduced to the HA-OSCAR architecture. This section discusses the
Stochastic Reward Nets modeling approach, the Stochastic Petri Net Package (SPNP), the modeled
architecture, and the results.

5.11.1 Stochastic Reward Nets and Stochastic Petri Net Package

The HA-OSCAR project team built an SRN model to simulate the failure-repair behavior of the HA-
OSCAR cluster, and to calculate and predict the availability of the HA-OSCAR cluster. The goal of
the SRN model was to analyze the availability of the HA OSCAR cluster based on a range of
predefined assumptions and parameters.
The SRN are used in the availability evaluation and prediction for complicated systems, when the
time-dependent behavior is of interest. They applied the SRN for aiding the automated generation and
solution of the underlying large and complex Markov chains [134]. The Stochastic Petri Net Package
(SPNP) allows the specification of SRN models and the computation of steady and transient-state.
The HA-OSCAR team used the SPNP to build the HA-OSCAR SRN model and predict the
availability of the system [135].
The Stochastic Petri Net Package (SPNP) [136] is the modeling tool designed for SPN models [137],
used as a modeling tool for performance, dependability (reliability, availability, safety), and
performability analysis of complex systems. It uses efficient and numerically stable algorithms to
solve input models based on the theory of the SRN. The SRN models are described in the input

199
language for SPNP called CSPL (C-based SPN Language) which is an extension of the C
programming language with additional constructs that facilitate easy description of SPN models.
Additionally, if the user does not want to describe his model in CSPL, a graphical user interface is
available to specify all the characteristics as well as the parameters of the solution method chosen to
solve the model [138].

5.11.2 Building the HA-OSCAR Model

The HA-OSCAR project team used the SRN to develop a failure-repair behavior model for the
system, and to determine the high availability of the cluster. They utilized the SPNP to build and
solve the HA-OSCAR SRN model. Figure 122 illustrates how the behavior of the HA-OSCAR
cluster is divided into three sub-models: server, network connection, and client sub-models. The HA-
OSCAR team used the SRN model to study the availability of the HA-OSCAR cluster through the
example illustrated in Figure 122.

Server sub-model

Network connection
sub-model

Clients sub-model

Figure 122: The modeled HA-OSCAR architecture, showing the three sub-models

Figure 123 shows a screen shot of the SPNP modeling tool. The HA-OSCAR team also studied the
overall cluster uptime and the impact of different polling interval sizes in the fault monitoring
mechanism.

200
`

Figure 123: A screen shot of the SPNP modeling tool

Input Parameter Numerical Value

Mean time to primary server failure, 1/λps 5,000 hrs.
Mean time to primary server repair, 1/µpsr 4 hrs.
Mean time to takeover primary server, 1/µst 30 sec.
Mean time to standby server failure, 1/λss 5,000 hrs.
Mean time to standby server repair, 1/µssr 4 hrs.
Mean time to primary LAN failure, 1/λpl 10,000 hrs.
Mean time to primary LAN repair, 1/µplr 1 hr.
Mean time to takeover primary LAN, 1/µlt 30 sec.
Mean time to standby LAN failure, 1/λsl 10,000 hrs.
Mean time to standby LAN repair, 1/µslr 1 hr.
Mean time to client permanent failure, 1/λp 2,000 hrs.
Mean time to client intermittent failure, 1/λi 1,000 hrs.
Mean time to system reboot, 1/µsrb 15 min.
Mean time to client reboot, 1/µcrb 5 min.
Mean time to client reconfiguration, 1/µrc 1 min.
Mean time to client repair, 1/µrp 4 hrs.
Permanent failure coverage factor, cp 0.95
Intermittent failure coverage factor, ci 0.95

Table 15: Input parameters for the HA-OSCAR model

201
Table 15 presents the input parameter values used in the computation of the measurements. These
values are for illustrative purpose only. The model makes several assumptions. First, the HA-OSCAR
cluster functions properly only if either primary master node or the standby master node is
functioning. Second, there has to be at least one LAN that is functioning properly, either the primary
LAN or the standby LAN. Third, the quorum for compute node must be present. The "Quorum Value
Q" denotes the minimum number of nodes required for the system to keep functioning. For example,
in the third row, we have an 8(N) node system. In order to keep the system working, at least 5(Q)
nodes must be alive. The final assumption is that the system is not undergoing any rebooting or
reconfiguration.
The availability of the cluster at time t is computed as the expected instantaneous reward rate E[X(t)]
at time t and its general expression is:
E[ X (t )] = ∑ k∈τ rkπ k (t )

where rk represents the reward rate assigned to state k of the SRN, τ is the set of tangible marking,
and π k (t ) is the probability of being in marking k at time t [138][139].

5.11.3 The Results

The results of HA-OSCAR availability prediction study are available in [140] and [141]. The results
demonstrate that the component redundancy in HA-OSCAR is efficient in improving the cluster
system availability. Table 16 presents the steady-state cluster availability and the mean cluster down
time per year for the different HA-OSCAR configurations.

System Configuration Quorum Value System Availability Mean cluster down time
(N) (Q) (A) (t)
4 3 0.999933475091 34.9654921704
6 4 0.999933335485 35.0388690840
8 5 0.999933335205 35.0390162520
16 9 0.999933335204 35.0390167776

Table 16: System availability for different configurations

We notice that the system availabilities for the various configurations are very close, within a small
range of difference. After we introduce the quorum voting mechanism in the client sub-model, the
system availability is not sensitive to the change of client configuration. When we add more clients to
202
improve the system performance, the availability of the system almost remains unchanged. In Table
16, we notice that when we add more clients to improve the system performance in the first column N
is increasing, and we keep the value of Q to N/2+1, the availability of the system almost remains
unchanged. In column 3, the system availability is almost the same as we increase the number of
nodes in the system.
Figure 124 illustrates the instantaneous availabilities of the system when it has eight clients and the
quorum is five. The modeling and availability measurements using the SPNP offered the calculated
instantaneous availabilities of the system and its parameters.

Figure 124: System instantaneous availabilities

Figure 125 illustrates the total availability (including planned and unplanned downtime) improvement
analysis of the HA-OSCAR architecture versus single head node Beowulf architecture [140].
The results show a steady-state system availability of 99.9968% compared to 92.387% availability for
a Beowulf cluster with a single head node [140]. Additional benefits include higher serviceability
such as the ability to upgrade incrementally and hot-swap cluster nodes, operating system, services,
applications, and hardware, further improve planned downtime, which benefits the overall aggregate
performance.

203
The HA-OSCAR vs. the Beowulf Architecture
Total Availability impacted by service nodes
100.00% 100.000%
99.00% 99.995%
99.9966% 99.9968%
98.00% 99.9951% 99.9962% 99.990%
97.00% 99.9896% 99.985%
Availability

96.00% 99.980%
95.00% 99.975%
94.00% 99.970%
99.9684%
93.00% 92.251% 92.336% 92.387% 99.965%
92.081%
91.575%
92.00% 99.960%
90.580% Model assumption:
91.00% 99.955%
- Scheduled downtime=200 hrs
90.00% 99.950% - Nodal MTTR = 24 hrs
1000 2000 4000 6000 8000 10000
- Failover time=10s
Beowulf 0.905797 0.915751 0.920810 0.922509 0.923361 0.923873 - During maintenance on the
HA-OSCAR 0.999684 0.999896 0.999951 0.999962 0.999966 0.999968 head, standby node acts as
primary
Mean time to failure (hr)

Figure 125: Availability improvement analysis of HA-OSCAR versus the Beowulf architecture

5.11.4 Discussion
The HA-OSCAR architecture proof-of-concept implementation, experimental and analysis results,
suggest that the HA-OSCAR architecture offers a significant enhancement and a promising solution
to providing a high-availability Beowulf cluster class architecture [140][142][143]. The availability of
the experimental system improves substantially from 92.387% to 99.9968%. The polling interval for
failure detection indicates a linear behavior to the total cluster availability.

5.12 Impact of the HAS Architecture on Open Source

In 2001, I participated in the Open Cluster Group [144] and proposed the creation of a working group
whose goal is to design and prototype a highly available clustering stack for clusters that run mission
critical application. In 2002, the Open Cluster Group announced the creation of the HA-OSCAR
working group [21], High Availability Open Source Cluster Application Resource, which consisted of
Dr. Stephen S. Scott (Oak Ridge National Lab), Dr. Chokchai Leangsuksun (Louisiana Technical
University), and the author of this thesis. As a result, we started the open source HA-OSCAR project
to provide the combined power of high availability and high performance computing based on the
initial HAS architecture.

204
The goal of the HA-OSCAR project is to enhance a Beowulf cluster system for mission critical
applications, to achieve high availability and eliminate single points of failure, and to incorporate
self-healing mechanisms, failure detection and recovery, automatic failover and failback.
On March 23, 2004, the HA-OSCAR group announced the HA-OSCAR 1.0 release, with over 5000
downloads within the first 24 hours of the announcement. It provides an installation wizard and a
web-based administration tool that allows a user to create and configure a multi-head Beowulf cluster.
Furthermore, the HA-OSCAR 1.0 release supports high availability capabilities for Linux Beowulf
clusters. To achieve high availability, the HA-OSCAR architecture adopts component redundancy to
eliminate SPOF, especially at the head node. The HA-OSCAR architecture also incorporates a self-
healing mechanism, failure detection and recovery, automatic failover and failback [146]. In addition,
it includes a default set of monitoring services to ensure that critical services, hardware components,
and important resources are always available at the control node.

5.13 HA-OSCAR versus Beowulf Architecture

A Beowulf cluster consists of two node types: head node servers and multiple identical client nodes.
A server or head node is responsible for serving user requests and distributing them to clients via
scheduling and queuing software. Clients or compute nodes are dedicated to computation.

Clients

Head Node

Router

Compute Nodes

Figure 126: The architecture of a Beowulf cluster

Figure 126 illustrates the architecture of a Beowulf cluster. However, the single head-node
architecture represented in Beowulf cluster is a single-point-of-failure prone, similar to cluster
205
communication, where an outage of wither can render the entire cluster unusable. There are various
techniques to implement cluster architecture with high availability. These techniques include
active/active, active/standby, and active/cold standby. In the active/active, both head nodes
simultaneously provide services to external requests. If one head node goes down, the other node
takes over total control. A hot-standby head node, on the other hand, monitors system health and only
takes over control if there is an outage at the primary head node. The cold standby architecture is
similar to the hot standby, except that the backup head node is activated from a cold start.
The key effort focused on simplicity by supporting a self-cloning of cluster master node (redundancy
and automatic failover). While the aforementioned failover concepts are not new, HA-OSCAR
effortless installation, combined HA and HPC architecture are unique and its 1.0 release is the first
known field-grade HA Beowulf cluster release [34]. The HA-OSCAR experimental and analysis
results, discussed in Section 5.11, suggested a significant improvement in availability from the dual
head architecture [145].
Figure 127 illustrates the HA-OSCAR architecture. The HA-OSCAR architecture deploys duplicate
master nodes to offer server redundancy, following the active/standby approach, where one primary
master node is active and the second master node is standby [35]. Each node in the HA-OSCAR
architecture has two network interface cards (NIC); one has a public network address, and the other
has a private local network.
The HA-OSCAR project uses the SystemImager [147] utility for building and storing system images
as well as a providing a backup for disaster recovery purposes. The HA-OSCAR 1.0 release supports
high availability capabilities for Linux Beowulf clusters. It provides a graphical installation wizard
and a web-based administration tool to allow the administrator of the HA-OSCAR cluster to create
and configure a multi-head Beowulf cluster. In addition, HA-OSCAR includes a default set of
monitoring services to ensure that critical services, hardware components, and certain resources are
always available at the master node. The current version of HA-OSCAR, 1.0, supports active/standby
for the head nodes.

206
Users Users
Users Users
Users Users

Public Network

Heartbeat

Primary Standby
Head Node Head Node
Optional
Reliable
Storage Redundant image servers
sitting outside the cluster

Redundant
Router 1 Router 2 Network
Connections

Compute Nodes

Figure 127: The architecture of HA-OSCAR

5.14 The HA-OSCAR Architecture versus the HAS Architecture

When we established the HA-OSCAR project in 2002, the HA-OSCAR architecture was based on the
HAS architecture. Both architectures target high availability at the master node level, aiming to
provide redundancy to reduce single points of failure, and to support redundant network connections.
However, since the applications that run on the HA-OSCAR architecture have unique characteristics
than applications running on a HAS architecture, we realize that while both architectures have the
same base, they grew to have different characteristics.

5.14.1 Target Applications

The HAS architecture and the HA-OSCAR architecture target different types of applications. The
HA-OSCAR architecture aims for HPC computing applications that focus on dividing programs into
smaller pieces, then executing them simultaneously on separate processors within the cluster. On the

207
other hand, the HAS architecture focuses on client/server applications that run over the web and
characterized with short transactions, short response time, a thin control path, and static delivery data
pack.

5.14.2 Intensity of Jobs

The HAS architecture supports thousands of transactions per second while the HA-OSCAR
architecture, targeted for HPC applications, supports large computing jobs that are divided into
smaller jobs executed on the cluster compute nodes. The HA-OSCAR architecture receives fewer
jobs but the jobs run for longer periods.

5.14.3 Homogeneous versus Heterogeneous Node Hardware

The HA-OSCAR architecture assumes that all cluster nodes are homogeneous with the same
hardware characteristics. You can have heterogeneous hardware; however, this will not allow you to
take advantage of the faster nodes in the cluster. In contrast, the HAS architecture accommodates
differences in the hardware configurations of the nodes and uses all node resources fully using the
dynamic feedback mechanism.

5.14.4 Dynamic Feedback Mechanism

The HAS architecture supports a dynamic feedback mechanism that allows traffic nodes to report
their load to the master nodes, and to guide the executor of the traffic distribution policy to make the
best possible distribution at that moment. Furthermore, this mechanism acts as a keep-alive message;
anytime a master node does not receive this message from a traffic node, after a pre-defined timeout,
the master node stops forwarding requests to the traffic node.

5.14.5 Traffic Distributions versus Load Balancing

Since the type of applications supported by the HA-OSCAR architecture are different from those
supported by the HAS architecture, the underlying distribution mechanisms are distinctive. In the
HA-OSCAR architecture, distribution is a load balancing function of a single job divided into smaller
jobs, and distributed across computer nodes. In the HAS architecture, the cluster receives thousands
of requests per second, and the traffic manager needs to decide on where to forward these incoming
requests among available traffic nodes.

208
5.14.6 Failure Discovery and Recovery Mechanisms
Both the HA-OSCAR architecture and the HAS architecture support failure detection and recovery
mechanism. However, these mechanisms target different system component with variable failure
detection and recovery time. The failure recovery in the HA-OSCAR takes between 3 seconds and 5
seconds [148], compared to a failure recovery ranging between 200 ms and 700 ms in the HAS
cluster, depending on the type of failure.

5.14.7 Support for the Redundancy Models

The HAS architecture supports the 1+1 active/active redundancy model at the HA tier. This is a
roadmap feature for the HA-OSCAR architecture.

5.14.8 Connections Synchronization

The HAS architecture supports connection synchronization at the HA tier. We discuss this feature in
Section 4.22. The HA-OSCAR architecture does not provide such capabilities.

5.14.9 High Availability of Traffic Nodes

The availability of traffic nodes (also called compute nodes) in the HA-OSCAR architecture is less
important than the availability of traffic nodes in the HAS architecture.

5.14.10 Application Keep-alive Mechanism

The HAS architecture supports a keep-alive mechanism for applications running on traffic nodes. In
the event that an application becomes unavailable, the LDirectord module notifies the executor of the
traffic distribution policy (the TM), and as a result, the TM stops forwarding traffic to the traffic node
with the unavailable application.

5.14.11 Other Differentiations

The HA-OSCAR deploys a software utility called SystemImager to provide the installation
infrastructure of HA-OSCAR. In addition, the installation wizard of HA-OSCAR uses a graphical
user interface to install and setup the HA-OSCAR cluster. In the HAS architecture prototype, the
installation infrastructure is built from scratch and it does not offer a graphical user interface to guide
the administrator in installing a HAS cluster.

209
5.15 HAS Architecture Impact on Industry
The Open Source Development Labs (OSDL) is a non-for-profit organization founded in 2000 by IT
and Telecommunication companies to accelerate the growth and adoption of Linux based platforms
and standardized platform architectures. The Carrier Grade Linux (CGL) initiative at OSDL aims to
standardize the architecture of telecommunication servers and enhance the Linux operating system for
such platforms.
The CGL Working Group has identified three main categories of application areas into which they
expect the majority of applications implemented on CGL platforms to fall. These application areas
include gateways, signaling, and management servers, and have different characteristics. A gateway,
for instance, processes a large number of small requests that it receives and transmits them over a
large number of physical interfaces. Gateways perform in a timely manner, close to hard real time.
Signaling servers require soft real time response capabilities, and manage tens of thousands of
simultaneous connections. A signaling server application is context switch and memory intensive,
because of the quick switching and capacity requirements to manage large numbers of connections.
Management applications are data and communication intensive. Their response time requirements
are less stringent compared to those of signaling and gateway applications.

Figure 128: The CGL cluster architecture based on the HAS architecture

210
The OSDL released version 2.0 of the Carrier Grade specifications in October 2003. Version 2.0 of
the specifications introduced support for clustering requirements and the cluster architecture is based
on the work presented in this dissertation. Figure 128 illustrates the CGL architecture, which is based
on the HAS architecture. In June 2005, the OSDL released version 3.1 of the specification. The
Carrier Grade architecture is a standard for the type of communication applications presented earlier.

211
Chapter 6
Contributions, Future Work, and Conclusion

This chapter presents the contributions of the work, future work, and the conclusions.

6.1 Contributions
The initial goal of this dissertation was to design and prototype the necessary technology
demonstrating the feasibility of a web cluster architecture that is highly available and able to linearly
scale for up to 16 processors to meet the increased web traffic.

Lightweight and efficient traffic distribution:

- Traffic Client Daemon
- Traffic Manager Daemon
- HAS distribution algorithm
- Keep-alive mechanism
- Support for heterogeneous cluster nodes
HAS Architecture
¾ A building block approach for
scalable and HA web clusters HAS Efficient and Single entry point to the HAS
¾ Software components can be re- cluster
used in other environments and for Light-weight Cluster IP
Traffic - Concept and prototype
other architectures A Distribution
Interface
- Full transparency
¾ Supports different redundancy
models
r
¾ Dynamic addition of traffic nodes as c Connection synchronization
- Adaptations and integration with the
traffic increases h HAS prototype
¾ Provides the infrastructure for
cluster membership, cluster i LDirectord
Connection
saru - Improved integration with heartbeat
Synchronization
storage, fault management, t Saru
recovery mechanisms and traffic
distribution
e - Adaptations and integration with the
¾ Capable of maintaining baseline c HAS prototype
performance per node for up to 16 t
traffic nodes Connection Data
u Redundancy Redundancy HBD HA Tier Nodes Heartbeat
- Re-wrote parts of current
r implementation to decrease failure
Application availability
- Enhancement to existing LDirectord
e detection time
- Adaptations to integrate with the
implementation
HAS architecture
- Adaptations to integrate LDirectord
with the HAS architecture and to
HA Ethernet connection Highly available network
communicate with TC and TM
- The Ethernet redundancy daemon file system
- Several fixes to the Ethernet device driver - Support for HA NFS in the Linux Kernel
that made it more robust and less prone to - A modified mount program to support
failures under heavy load mounting dual redundant NFS servers

Figure 129: The contributions of the HAS architecture

We achieved our goal with the HAS architecture that supports continuous service through its high
availability capabilities, and provides close to linear scalability through the combination of multiple
parameters which include efficient traffic distribution, the cluster virtual IP layer and the connection
synchronization mechanism. Figure 129 provides an illustration of other contributions grouped into
eight distinct areas: application availability, network availability, data availability, master nodes
212
availability, connection synchronization, single cluster IP interface, and traffic distribution. Since the
HAS architecture prototype follows the building block approach, these contributions can be reused in
a different environments outside of the HAS architecture and can function completely independently
outside of a cluster environment.
The HAS architecture is based on loosely coupled nodes, provides a building block approach for
designing and implementing software components that can be re-used for other environments and
architectures. It provides the infrastructure for cluster membership, cluster storage, fault management,
recovery mechanisms, and traffic distribution. It supports various redundancy models for each tier of
the architecture and provides a seamless software and hardware upgrade without interruption of
service. In addition, the HAS architecture is able to maintain baseline performance for up to 18
cluster nodes (16 traffic nodes), validating a close to linear scaling.
The HAS architecture integrates these contributions within a framework that allows us to build
scalable and highly available web clusters. The following sub-sections examine these contributions.

6.1.1 Highly Available and Scalable Architecture

We have achieved our goal of a highly available and scalable system architecture that provides close
to linear scalability for up to 16 nodes. The HAS architecture is a generic cluster based architecture
targeted for web servers. A HAS cluster consists of multiple loosely coupled nodes connected through
a high-speed network. The loosely coupled cluster model is a basic clustering technique that is
suitable for web servers and similar types of application servers. We do not exclude specializations;
however, we can handle them in the future to meet the requirements of specialized applications.

6.1.2 Scalability and Capacity

The HAS architecture is characterized by the ability to add server nodes to the cluster without
affecting the servicability or the uptime. When we experience an increase in traffic, we can add more
traffic nodes into the SSA tier and experience additional capacity. The benchmarking results (Section
5.2) demonstrate the HAS architecture prorotype is able to sustain the baseline performance per
processor as we increase the number of processors in the HAS cluster with a minimal impact of -3.1%
on performance (Section 5.9) for up to 16 traffic processors. Figure 119 illustrates the close to linear
scalability achieved with the HAS architecture prototype, which consists of two master nodes and 16
traffic nodes. In addition, because the SSA tier of the HAS architecture support the N-way

213
redundancy model, the architecture does not force us to deploy traffic nodes in pairs. As a result, we
can deploy exactly the right number of traffic nodes to meet our traffic demands without having
traffic nodes sitting idle. In addition, we are able to scale each tier of the architecture independently of
the other tiers.

6.1.3 Support for Heterogeneous Hardware

The HAS cluster architecture supports heterogeneous hardware. Typically, clusters are deployed with
nodes that are homogenous, with the same hardware configurations. A HAS cluster can consist of
traffic nodes with varying processor speed and RAM capacity and still achieve efficient resource
utilization largely because of the efficient and lightweight traffic management scheme.

6.1.4 Lightweight and Efficient Traffic Distribution

Among the surveyed works, the SWEB project (Section 2.10.3) and the Hierarchical Redirection-
Based Web Server Architecture (Section 2.10.1) considered using a load feedback mechanism to
monitor the load of their cluster nodes. Other surveyed work proposed mechanisms and distribution
schemes that do not take into consideration the capacity of each traffic node. For the HAS
architecture, there was a need to either implement a continuous load monitoring or enhance an
available traffic distribution implementation with the load monitoring capability. The HAS traffic
distribution scheme monitors the load of the traffic nodes using multiple metrics, and uses this
information to distribute incoming traffic among the traffic nodes (Section 4.23.2). The traffic
distribution scheme enables the traffic manager to detect and avoid situations where nodes in the
cluster are operating inefficiently because of overloading. This scheme enables the cluster to operate
at or near full capacity in overload situations, which is in contrast to conventional cluster architectures
that are subject to congestion and collapse under overload.
In addition, the HAS traffic distribution scheme integrates a keep-alive mechanism which allows
master node to know when a traffic node is available for service and when it is not available because
of either software or hardware problems.
Section 4.23 presents the HAS traffic management scheme. The scheme is scalable (Section 5.2) and
does not present a performance bottleneck for up to 16 traffic nodes. It can be extended to support
additional distribution algorithms, to monitor additional system resources (such as network
bandwidth), and to support the concept of cluster sub-zones (discussed as a future work item).

214
6.1.5 High Availability
The architecture tiers support two essential redundancy models: the 1+1 (active/standby and
active/active) and the N-way redundancy models. As a result, the HAS cluster architecture achieves
high availability through redundancy at various levels of the architecture: network, processors,
application servers, and data storage. As a result, we are able to do such actions as reconfigure the
network setting, upgrade the hardware, software, and the operating system, without service downtime.
Other areas of contributions include a mechanism to detect and recover from Ethernet failures, master
node failures, NFS failures, and application (web server software) failures.

6.1.6 Cluster Single IP Interface

One of the challenges of clustering is the ability to present the cluster nodes to the outside world as a
single entity without creating a bottleneck that would affect the performance. The proposed approach
offers a transparent solution to web clients, and allows web clients to reach an unlimited number of
servers (Section 4.21). The proposed CVIP approach provides many advantages in the areas of single
cluster interface: transparency, scalability since it supports high load of traffic, availability, fault
tolerance, and dynamic connection distribution.

6.1.7 Continuous Service

The ability to synchronize connections provides us with continuous service even in the event of
software of hardware failures. Section 4.22.2 discusses the master/slave approach to provide
connection synchronization between the master nodes in the HA tier of the HAS cluster. The
approach allows ongoing connections to continue in progress even when the active master node fails.

6.1.8 Community and Industry Contributions

This work has served as a significant contributor to industry standard and to the open source
community through the HA-OSCAR project. The HA-OSCAR architecture has remarkably
influenced the thinking in high performance computing. Researchers and systems designers for HPC
cluster based systems are using the HA-OSCAR architecture as a concept prototype for a highly
available HPC cluster. The HA-OSCAR project is the first prototype of an architecture that provides
high availability for high performance computing platform. The first release of the HA-OSCAR
generated over five thousands download within the first 24 hours of its release followed by thousands

215
more in the weeks and months after. This public rush is a indication of the community of user who
are using, testing, and deploying the HA-OSCAR architecture for their specific needs. It is also
important to note that there is an active community of users on the HA-OSCAR project discussion
board and mailing list. The HA-OSCAR architecture is based on the HAS architecture and provides
an open source implementation that is freely available for download with a substantial user
community.
Section 5.15 described our contributions to the Carrier Grade specifications that define an
architectural model for telecommunication platforms providing voice and data communication
services. The Carrier Grade architecture mode is an industry standard largely based on the HAS
architecture.

6.1.9 Source Code Contributions

The HAS architecture was prorotyped and several software components were prorotyped and other
existing software were enhanced and modified to suite our needs. Since the beginning, the objective
was to keep these software components generic, scalable in terms of performance and future feature
additions, and less dependable on each other. For some software, we were able to reach that goal and
for others, we were not able to achieve this goal and we plan further improvements as discussed in the
following subsections.
We did not target the implementations to a specific cluster deployment model such as web cluster;
rather we sought to maintain a sense of independence from a specific deployment case.
In order to build the HAS web cluster, redundancy features for the network file system and Ethernet
connections were implemented at the kernel level and user space in the Linux operating system. A
part of the work behind this thesis involved writing the source code for the various system
components.
The following sub-sections describe the source code contributions and their state of development.

6.1.9.1 Redundant NFS Server

It was essential to have this functionality in place to enable highly available shared storage that is
accessible to all cluster nodes without a SPOF. As a result, in the event that the primary NFS server
running on the active master node fails, the secondary NFS server running on the standby node
continues to provide storage access without service discontinuity and transparently to end users. The

216
rsync utility provides the synchronization between the two NFS servers. Table 17 lists the Linux
kernel files modified to support the NFS redundancy.

Linux Kernel - Modified Source Files Description of Changes

/usr/src/linux/fs/nfs/inode.c Added support for NFS redundancy
/usr/src/linux/net/sunrpc/sched.c Raise the timeout flag when there is one
/usr/src/linux/net/sunrpc/clnt.c Raise the timeout flag when there is one
Table 17: The changes made to the Linux kernel to support NFS redundancy

The implementation of the HA NFS server is stable, however, it requires upgrading to the latest stable
Linux Kernel release, version 2.6.
Furthermore, the HA NFS implementation requires a new implementation of the mount program to
support mounting multi-host NFS servers, instead of single file server mount. This functionality is
provided, and in the new mount program, the addresses of the two redundancy NFS servers are
passed as parameters to the new mount program, and then to the kernel. The new command line for
mounting two NFS server looks as follows:
% mount –t nfs server1,server2:/nfs_mnt_point

6.1.9.2 Heartbeat Mechanism

We re-used the heartbeat mechanism developed initially by the Linux-HA project after making
several adaptations to the source code, and re-wroting part of the original source code making
optimizations that resulted in a decrease of the failure detection delay from 200 ms to 150 ms.
Furthermore, we can expect additional improvements to the failure detection time down to 100 ms
from 150 ms with further optimizations and re-writing parts of the original source code. The
limitation of the current implementation is its support to two nodes. As future work, we would like to
investigate using the heartbeat mechanism with more than two nodes in the HA tier and supporting
the N-way redundancy model.

6.1.9.3 Redundant Ethernet Connections

The goal with this feature is to maintain Ethernet connectivity at the server node level. The
implemented Ethernet redundancy daemon monitors the link status of the primary Ethernet port. On
link down, the route of the first port is deleted, and incoming traffic is directed to the second Ethernet

217
port. When the link goes up again, the daemon waits to make sure the connection does not drop again,
and then switches back to the primary Ethernet port.
In addition to this contribution, smaller supporting contributions included fixes and re-writes of the
Ethernet device drive; all of these supporting contributions are now integrated in the original Ethernet
device driver code in the Linux kernel.
Further improvements to the current implementation include stabilizing the source and optimizing the
performance of the Ethernet daemon, which include optimizing the failure detection time of the
Ethernet driver. In addition, the source code of the Ethernet redundancy daemon is to be upgraded to
run on the latest release of the Linux kernel, version 2.6.

6.1.9.4 Traffic Manager Daemon

The traffic manager runs on master nodes. It receives load announcements from the traffic client
daemons running on traffic nodes, compute the rank of each node, and update its internal list of nodes
and their load. It maintains a list of available traffic nodes and their load. The traffic manager is also
responsible for distributing incoming traffic to the available traffic nodes based on their load.
The current implementation of the traffic manager can use several improvements that include further
testing and stabilizing of the source code, optimizing insertion, and updates of the list of traffic nodes
and their load index. As future work, we would like to investigate the possibility of merging the
functionalities of the saru module (connection synchronization) with the traffic manager.

6.1.9.5 Traffic Client Daemon

The traffic client is a daemon process that runs on each traffic node. It collects processor and memory
information from the /proc file system, computes the load index, and communicates it to the traffic
managers running on master nodes.
The current implementation of the traffic client is stable because of the many testing it went through.
As future work, we plan to investigate how to have a better representation of the load index, by
adding for instance the network bandwidth parameter into the load index computation. In addition, we
would like to investigate the possibility of merging the functionalities of the LDirectord module with
the traffic client resulting in one system software that reports the node load index and monitors the
health of the application server running on the traffic node.

218
6.1.9.6 Cluster virtual IP Interface (CVIP)
The CVIP interface is a cluster virtual IP interface that presents the HAS cluster as a single entity to
the outside world, making all nodes inside the cluster transparent to end users. Section 4.21 discusses
the CVIP interface. Sections 6.2.7 and 6.2.8 discuss the future work items for CVIP.

6.1.9.7 LDirectord
The improvements and adaptations to the LDirectord module include capabilities to connect to the
traffic manager and the traffic client. The current implementation is not fully optimized; rather, the
implementation is a working prototype that requires further testing and stabilization. Furthermore, as
future work, we would like to minimize the number of sequential steps to improve the performance,
and investigate the possibility of integrating the LDirectord module with the traffic client running on
the traffic node.

6.1.9.8 Connection Synchronization

The improvements and adaptations to the saru module include improving capabilities to connect and
talk to the hearbeat mechanism. The current implementation of the saru module relies on the
redundancy relationship between master nodes. As future work, we plan to migrate to the peer-to-peer
approach discussed in Section 4.22.4.

6.1.10 Other Contributions

The work has also contributed several papers, technical reports, tutorials presentations, in addition to
offering contributions to existing projects and to the industry.
- Referenced papers, technical reports and white papers: [30], [96], [99], [100], [103], [107], [108],
[117], [146], [145], [149], [150], [151], [152], [153], [154], [155], [156], [157], [158], [159],
[160], [161], [162], [163] and [164].
- Conference presentations and tutorials: [165], [166], [167], [168], [169], [170], [171], [172],
[173], [174], [175], [176], [177], [178], [179], [180], [181] and [182].
- In the process of evaluating existing works, we benchmarked existing solutions, and provided
direct input to the specific projects. A major advantage we had was the availability of a test lab
that allowed us to perform large scale testing for performance and scalability.

219
- We ported the Apache and Tomcat web servers to support IPv6 and performed bechmarking tests
to compare with benchmaking tests of Apache and Tomcat running over IPv4.
- As part of the dissertation, we needed a flexible cluster installation infrastructure that help us built
and setup clusters within hours instead of days, and that would accommodate for nodes with disks
and diskless nodes and network boot. This infrastructure did not exist and we had to design it and
build it from scratch. This cluster installation infrastructure is now being used at the Ericsson
Research lab in Montréal, Canada.
- Other contributions included the influence of the work on the industry. The Carrier Grade Linux
specifications are industry standards with a defined architecture for telecom platforms and
applications running on telecom servers in mission critical environments. The carrier grade
architecture relies on the work proposed in this thesis with little modifications to accommodate
specific type of telecommunication applications. The author of the thesis is publicly recognized as
a contributor to the Carrier Grade specification. Furthermore, since January 2005, he has been
employed by the OSDL to focus on advancing the specifications and the architecture.

6.2 Future Work

The following sub-sections propose several areas that would extend and improve the current work.
With the exception of the maintenance activities to improve the current HAS prototype, the areas of
interest range from running additional benchmarking tests, to implementing new system components
and providing new functionalities. In addition, investigating linearly scalability beyond 16 nodes is a
high priority future work. The top two future work include the redundancy configuration manager and
the cluster configuration manager.

6.2.1 Support Linear Scalability Beyond 16 Nodes

With the HAS architecture, we were able to reach a near linear scalability for up to 16 traffic nodes
(Section 5.9). The next major goal is to investigate how to accomplish higher levels of scalability for
clusters consisting of 32 nodes and beyond. One of the important further investigations in this area
relates to the scaling overhead of the cluster. In the HAS cluster architecture, the scaling overhead has
been minimal but limited to 16 nodes. As the cluster scales in the number of nodes, many potential
bottlenecks affect the level of performance such as storage, communication, application servers, and

220
traffic distribution mechanism. The goal of this activity is to investigate the source of bottlenecks and
explore solutions.

6.2.2 Traffic Client Implementation

In a future version, we plan to allow the traffic client to receive notification that a master node is not
available directly from the heartbeat mechanism. With such information, the traffic client stops
reporting its load index to a master node that is unavailable, minimizing its communication overhead
and resulting in less network traffic.

6.2.3 Additional Benchmarking Tests

Because of limitations such as timing and access to lab hardware, we were not able to conduct more
benchmarking in the lab against the HAS prototype. As future work, we would to establish a larger
benchmarking environment that consists of more test machines to generate additional traffic into the
HAS cluster. Furthermore, we would like to benchmark the HAS cluster with the master nodes in the
HA tier are in load sharing mode, 1+1 active/active model (Figure 130).

Untested HAS Architecture Configurations

Master Nodes Master Nodes Master Nodes

1+1 1+1 1+1
(Active/Standby Active/Active Active/Active
& Active/Active)

Traffic Nodes N active Traffic Nodes

N active & Traffic N active
N active/ Nodes M standby
M standby

Specialized Storage Nodes Storage: HA NFS Storage: HA NFS

Figure 130: The untested configurations of the HAS architecture

We expect to achieve high performance levels when both master nodes are receiving incoming traffic
and forwarding it to traffic nodes. Furthermore, we would like to benchmark the HAS architecture
prototype using specialized storage nodes and compare the results to when using the HA NFS
implementation to provide storage. These tests will give us insights on the most efficient storage
solution.
221
6.2.4 Redundancy Configuration Manager
The current prototype of the HAS architecture does not support dynamic changes to the redundancy
configurations nor transitioning from one redundancy configuration to another. This feature is very
useful when the nodes reach a certain pre-defined threshold, then the redundancy configuration
manager would for example transition the HA tier from the 1+1 active/standby to the 1+1
active/active, allowing both master nodes to share and service incoming traffic. Such a transition in
the current HAS architecture prototype requires stopping all services on master nodes, updating the
configuration files, and restarting all software components running on master nodes.
The redundancy configuration manager would be the entity responsible for switching the redundancy
configuration of the cluster tiers from one redundancy model to another. For instance, when the SSA
tier is in the N+M redundancy model, the redundancy configuration manager will be responsible to
activate a standby traffic node when a traffic node becomes unavailable. As such, the configuration
manager should be aware of active traffic nodes and the states of their components, and their
corresponding standby traffic nodes.

6.2.5 Cluster Configuration Manager

The current HAS cluster prototype does not provide a central location for managing the cluster
configuration files required by all the system software. As a result, the process of locating and editing
all configuration files needed to run the cluster is a daunting experience. There is a need for a central
entity, preferably with a user-friendly graphical user interface, that manages all the configuration files
that control the operation of the various software components. This entity, a cluster configuration
manager, is a one-stop configuration that enables easy and centralized configuration.

6.2.6 Monitoring Load of Master Nodes

We would like to expand the current HAS implementation to include a logging mechanism that
monitors and reports the load of master nodes such as processor usage, free memory and network
bandwidth. We can then use this information to determine the impact of incoming traffic on the
performance and scalability of master nodes. Such information can also help us identify the node sub-
systems that suffer from bottlenecks, which is particularly useful in a controlled lab environment
when running benchmarking tests and generating thousands of requests per second.

222
6.2.7 Merging the Functionalities of the CVIP and Traffic Management Scheme
One possibility of further investigation is to couple the functionalities of the cluster virtual IP
interface with the traffic management scheme. With the current implementation, incoming traffic
arrives to the cluster through the cluster virtual IP Interface and then it is handled by the traffic
manager before it reaches its final destination on one of the traffic nodes. A future enhancement is to
eliminate the traffic management scheme and incorporate traffic distribution within the CVIP
interface. Eliminating the traffic manager daemon and integrating its functionalities with the CVIP
results in increased performance and a faster response time as we eliminate one serialized step in
managing an incoming request.
However, our proposal is to combine the functionalities of the cluster virtual IP interface and the
traffic distribution mechanism to eliminate a step of forwarding traffic in between web users and the
application server running the web server.

6.2.8 Eliminating the Need for Master Nodes

In the quest for maximum performance, we need to eliminate as many pre-processing steps as
possible to speed up the process of replying to the web clients. One interesting investigation is the
possibility of eliminating the need for master nodes and distributing their functionalities among traffic
nodes. With that model, we can eliminate a step of pre-processing the request; however, we create
additional challenges such as how to perform efficient traffic distribution and provide cluster services.
One interesting approach is to elect traffic nodes to perform selective master nodes functions based on
a well-defined criteria.

6.2.9 Supporting Cluster Zones and Specialized Traffic Nodes

A cluster can host several applications that provide a range of services for the cluster users. This
requirement influences the way we manage the cluster and implies the need for node specialization,
where specific nodes that constitute a cluster zone are dedicated to task-specific services. The concept
of cluster zone is as follows: since the cluster is composed of multiple nodes, we divide the cluster
into sub-clusters, called cluster zones or simply zones. Each zone provides a specific type of service
through defined cluster nodes and as a result, it receives specific type of traffic to those nodes. We

223
have two main challenges in this area: the first is to provide the virtualization of the cluster zones, and
the second is the ability to migrate dynamically cluster nodes between several zones based on traffic
trends.

LAN 1 LAN 2

Streaming
Node 2 1
Streaming
Node 1

FTP
Node 3 LAN 1
LAN 2
FTP Streaming
Node 2 Node 2
FTP
Node 1 Storage Streaming
Node 1 Node 1
C
L
U Master HTTP Storage
S Node A Node 5 Node 2 FTP
T HTTP Node 3
E
Node 4
R FTP
Master HTTP
Node 2
V
Node B Node 3
I FTP
HTTP
P Node 1 Storage
Node 2
HTTP Node 1
C
Node 1 L
U
Master HTTP 2 Storage
S Node A Node 5 Node 2
T HTTP
E
R Node 4
Master HTTP
V Node 3
I Node B
P HTTP
Node 2
LAN 1 LAN 2
HTTP
Streaming Node 1
Node 2

Streaming
Storage
Node 1
Node 1

FTP 3 Storage
Node 5 Node 2
FTP
Node 4
FTP
Node 3
FTP
C
L Node 2
U
Master FTP
S Node A Node 1
T
E
R
HTTP
Master Node 3
V
I Node B HTTP
P Node 2
HTTP
Node 1

Figure 131: The architecture logical view with specialized nodes

Figure 131 illustrates the concept of cluster zones. The cluster in the figure consists of three zones:
one provides HTTP service, the other provides FTP service, and the third provides streaming service.
In (1) the FTP cluster zone is receiving traffic, while some nodes in the HTTP cluster zone are sitting
idle due to low traffic. The traffic manager running on the master node disconnects (2) two nodes
from the HTTP cluster zone and transition (3) them into the FTP cluster zone to accommodate the
increase in FTP traffic. There are several possible areas of investigations such as defining cluster

224
zones as logical entities in a larger cluster, dynamic node(s) selection to be part of a specialized
cluster zone, transitioning the node into the new zone, and investigating queuing theories suitable for
such usage models.

6.2.10 Peer-to-Peer Connection Synchronization

The peer-to-peer connection synchronization is the improved approach of the master/slave approach.
In the peer-to-peer approach to connection synchronization, each master node sends synchronization
information for connections that it is handling to the other master node. In the event that a master
node fails, connections are synchronized, and connections are able to continue after the multiple
failovers.

6.2.11 Eliminating Traffic Manager as a SPOF

The traffic managers running on the master nodes in the HA tier presents a SPOF. In the event a
traffic manager fails on the active node, the traffic client (running on traffic nodes) continues
reporting its load index to the failed traffic manager until it receives a timeout.

6.2.12 Eliminating Saru Module as a SPOF

The saru module is the system software that runs on master nodes. It distributes incoming traffic
between the two master nodes (when in the 1+1 active/active model) following the round robin
method. The saru module presents a SPOF. If the saru module fails on the master node, then the
routed process will not be able to send the saru module incoming requests and the cluster comes to
halt. The solution to this problem is to have the routed process aware that the second master node also
has an inactive copy of the saru module running on it. When the routed process fails to send requests
to the saru module running on master node 1 then it should send them to the saru module running on
master node 2.

6.2.13 Failure of LDirectord

The goal of this future work is to improve the current implementation of the LDirectord and make it
resilient to failures.

225
6.3 Conclusion
This dissertation covers a range of technologies for highly available and scalable web clusters. It
addresses the challenges of designing a scalable and highly available web server architecture that is
flexible, components based, reliable and robust under heavy loads.
The first chapter provides a background on Internet and web servers, scalability challenges, and
presents the objectives and scope of the study.
The second chapter looks at clustering technologies, scalability challenges, and related work. We
examine clustering technologies and techniques for designing and building Internet and web servers.
We argue that traditional standalone server architecture fails to address the scalability and high
availability need for large-scale Internet and web servers. We introduce software and hardware
clustering technologies, their advantages and drawbacks, and discuss our experience prototyping a
highly available, and scalable, clustered web server platform. We present and discuss the various
ongoing research projects in the industry and academia, their focus areas, results, and contributions.
In the third chapter, the thesis summarizes the preparatory technical work with the prototyped web
cluster that uses existing components and mechanisms.
Chapter four presents and discusses the HAS architecture, its components and their characteristics,
eliminating single points of failure, the conceptual, physical, and scenario architecture views,
redundancy models, cluster virtual interface and traffic distribution scheme. The HAS architecture
consists of a network of server nodes connected over highly available networks. A virtual IP interface
provides a single point of entry to the cluster. The software and hardware components of the
architecture do not present a SPOF. The HAS cluster architecture supports multiple redundancy
models for each of its tiers allowing you to choose the best redundancy model for your specific
deployment scenario. The HAS architecture manages incoming traffic from web clients through a
lightweight, efficient, and dynamic traffic distribution scheme that takes into consideration the
capacity of each traffic. This approach has proven to be an effective method to distribute traffic based
on the performance testing we conducted.
In chapter five, we validate the scalability of the architecture. Our results demonstrate that the HAS
architecture is able to reach close to linear scaling for up to 16 processors and attain high performance
levels with a robust behavior under heavy load. In addition, the chapter presents the results of the
availability validation testing the availability features in a HAS architecture.
The final chapter illustrates the contributions and future work.
226
The HAS architecture brings together aspects of high availability, concurrency, dynamic resource
management and scalability into a coherent framework. Our experience and evaluation of the
architecture demonstrate that the approach is an effective way to build robust, highly available, and
scalable web clusters. We have developed an operational prototype based on the HAS architecture;
the prototype focused on building a proof-of-concept for the HAS architecture that consists of a set of
necessary system software components.
The HAS architecture relies on the integration of many system components into a well-defined, and
generic cluster platform. It provides the infrastructure for cluster membership to recognize and
manage the nodes membership in the cluster, cluster storage service, fault management service to
recognize hardware and software faults and recovery mechanisms, and traffic distribution service to
distribute the incoming traffic across the nodes in the cluster. The HAS architecture represents a new
design point for large-scale Internet and web servers that support scalability, high availability, and
high performance.

227
Bibliography

[1] K. Coffman, A. Odlyzko, The Growth Rate of the Internet, Technical Report, First Monday,
Volume 3 Number 10, October 1998, http://www.firstmonday.dk/issues/issue3_10/coffman

[2] E. Brynjolfsson, B. Kahin, Understanding the Digital Economy: Data, Tool, and Research, MIT
Press, October 2000

[3] Yahoo! Fourth Quarter 2001 Financial Report, http://docs.yahoo.com/docs/pr/4q01pr.html

[4] America Online, Press Data Points, http://corp.aol.com/press/press_datapoints.html

[5] K. McNaughton, Is eBay too popular?, CNET News.com, March 1, 1999,

http://news.cnet.com/news/0-1007-200-339371.html

[6] E. Hansen, Email outage takes toll on Excite@Home, CNET News.com, June 28, 2000,
http://news.cnet.com/news/0-1005-200-2167721.html

[7] Bloomberg News, E*Trade hit by class-action suit, CNET News.com, February 9, 1999,
http://news.cnet.com/news/0-1007-200-338547.html

[8] W. LeFebvre, Facing a World Crisis, Invited talk at the 15th USENIX LISA System
Administration Conference, San Diego, California, USA, December 2-7, 2001

[9] British Broadcasting Corporation, Net surge for news sites, September 2001,
http://news.bbc.co.uk/hi/english/sci/tech/newsid_1538000/1538149.stm

[10] R. Lemos, Web worm targets White House, CNET News.com, July 2001,
http://news.com.com/2100-1001-270272.html

[11] The Hyper Text Transfer Protocol Standardization at the W3C, http://www.w3.org/Protocols

[12] The Common Gateway Interface, http://www.w3.org/CGI

[13] J. Nielsen, The Need for Speed, Technical Report, March 1997,
http://www.useit.com/alertbox/9703a.html

[14] R. B. Miller, Response Time in Man-Computer Conversational Transactions, Proceedings of

the FIPS Fall Joint Computer Conference, Vol. 33, 1968, pp. 267-277

[15] J. Nielsen, Usability Engineering, Morgan Kaufmann, San Francisco, 1994

228
[16] Inktomi Corporation, Web surpasses one billion documents, Press Release, January 2000
http://www.inktomi.com/new/press/2000/billion.html

[17] A. T. Saracevic, Quantifying the Internet, San Francisco Examiner, November 5, 2000,
http://www.sfgate.com

[18] Nielsen/NetRatings, Nielsen/NetRatings finds strong global growth in monthly Internet

sessions and time spent online between April 2001 and April 2002, June 2002,
http://www.nielsen-netratings.com/pr/pr_020610_global.pdf

[19] The WebBench Tool, http://www.etestinglabs.com/benchmarks/webbench/webbench.asp

[20] The Linux-HA Project: http://www.linux-ha.org

[21] The HA-OSCAR project, http://xcr.cenit.latech.edu/ha-oscar

[22] The Linux Operating System, http://www.kernel.org

[23] The Linux Virtual Server Project, http://www.linuxvirtualserver.org

[24] The Apache Project, http://httpd.apache.org

[25] The Tomcat Project, http://Jakarta.apache.org/tomcat

[26] The Carrier Grade Linux Initiative, http://www.osdl.org/lab_activities/carrier_grade_linux

[27] The Open Source Development Lab, http://www.osdl.org

[28] G. Pfister, In Search of Clusters, Second Edition, Prentice Hall PTR, 1998

[29] The Open Group, The UNIX® Operating System: A Robust, Standardized Foundation for
Cluster Architectures, White Paper, June 2001, http://www.unix.org/whitepapers/cluster.htm

[30] I. Haddad, E. Paquin, MOSIX: A Load Balancing Solution for Linux Clusters, Linux Journal,
May 2001

[31] The MOSIX Project, http://www.mosix.org/

[32] The ROCKS Clustering Package, http://www.rocksclusters.org

[33] Open Source Cluster Application Resources, http://oscar.openclustergroup.org

229
[34] M. J. Brim, T. G. Mattson, and S. L. Scott, OSCAR: Open Source Cluster Application
Resources, Ottawa Linux Symposium 2001, Ottawa, Canada, July 2001

[35] J. Hsieh, T. Leng, and Y.C. Fang, OSCAR: A Turnkey Solution for Cluster Computing, Dell
Power Solutions, Issue 1, 2001, pp. 138-140

[36] Ganglia Distributed Monitoring Tool, http://ganglia.sourceforge.net

[37] TurboLinux Cluster Server, http://www.turbolinux.com/products/middleware/tlcs8.html

[38] OpenSSI Clustering Solution, http://openssi.org/cgi-bin/view?page=openssi.html

[39] OpenGFS clustered file system, http://opengfs.sourceforge.net

[40] Oracle cluster file system, http://oss.oracle.com/projects/ocfs/

[41] IEEE Task Force on Cluster Computing High Availability, http://www.clustercomputing.org

[42] Advanced Research on Internet E-Servers, http://www.linux.ericsson.ca

[43] Trilliurn Digital Systems, Distributed Fault-Tolerant and High-Availability Systems White
Paper, http://www.trilliurn.com

[44] Intel Corporation, http://www.intel.com

[45] X. Gan, T. Schroeder, S. Goddard, and B. Ramamurthy, LSMAC vs. LSNAT: Scalable
cluster-based Web servers, IEEE Cluster Computing, November 2000, pp. 175-185

[46] O. Damani, P. Chung, Y. Huang, C. Kintala, and Y. M. Wang, ONE-IP: Techniques for
Hosting a Service on a Cluster of Machines, IEEE Computer Networks, Volume 29, Numbers 8-
13, September 1997, pp. 1019-1027

[47] G. Hunt, G. S. Goldszmidt, R. P. King, and R. Mukherjee, Network Dispatcher: A

Connection Router for Scalable Internet Services, IEEE Computer Networks, Volume 30,
Numbers 1-7, April 1999, pp. 347-357

[48] RFC 2391, Load Sharing using IP Network Address Translation (LSNAT),
http://www.faqs.org/rfcs/rfc2391.html

230
[49] X. Gan, T. Schroeder, S. Goddard, and B. Ramamurthy, LSMAC and LSNAT: Two
Approaches for Cluster-Based Scalable Web Servers, EEE International Conference on
Communications, June 2000, pp. 1164-1168

[50] Cisco Local Director, http://www.cisco.com/warp/public/cc/pd/cxsr/400/index.shtml

[51] DNS RFCs, http://www.dns.net/dnsrd/rfc

[52] The Public DNS Service, http://soa.granitecanyon.com

[53] The Microsoft Cluster Server,

http://www.microsoft.com/ntserver/support/faqs/Clusterfaq_deploy.asp

[54] The Microsoft cluster server FAQ,

http://www.microsoft.com/ntserver/support/faqs/Clustering_faq.asp

[55] S. Horman, Linux Virtual Server Tutorial, http://www.ultramonkey.org/papers/lvs_tutorial

[56] M. Williams, EBay, Amazon, Buy.com hit by Internet attacks, Network World, February 9,
2000, http://www.nwfusion.com/news/2000/0209attack.html

[57] G. Sandoval and T. Wolverton, Leading Web Sites Under Attack, News.com, February 9,
2000, http://news.cnet.com/news/0-1007-200-1545348.html

[58] Trilliurn Digital Systems, Distributed Fault-Tolerant/High-Availability Systems, White

Paper, http://www.trilliurn.com

[59] D. LaLiberte, and A. Braverman, A Protocol for Scalable Group and Public Annotations,
Computer Networks and ISDN Systems, Volume 27, Number 6, January 1995, pp. 911-918

[60] IMEX Research High Availability Report, http://www.imexresearch.com

[61] The Mobile Internet, http://www.ericsson.com/mobileinternet

[62] H. Soliman, IPv6 in 3G Technical Report, Ericsson Research, 2001,

http://www.ericsson.com/technology

[63] L. Aversa, and A. Bestavros, Load Balancing a Cluster of Web Servers Using Distributed
Packet Rewriting, Proceedings of IEEE International Performance Conference, Phoenix, Arizona,
USA, February 2000, pp. 24-29

231
[64] S. N. Budiarto, and S. Nishio, MASEMS: A Scalable and Extensible Multimedia Server, The
1999 International Symposium on Database Applications in Non-Traditional Environments,
Kyoto, Japan, November 1999, pp. 28-30

[65] C. Roe, and S. Gonik, Server-Side Design Principles for Scalable Internet Systems, IEEE
Software, Volume 19, Number 2, March/April 2002, pp. 34-41

[66] D. Norman, The Design of Everyday Things, Double-Day, New York, 1998

[67] D. Dias, W. Kish, R. Mukherjee, and R. Tewari, A Scalable and Highly Available Web
Server, Proceedings of the Forty-First IEEE Computer Society International Conference:
Technologies for the Information Superhighway, Santa Clara, California, USA, February 25-28,
1996, pp. 85-92

[68] E. Casalicchio, and S. Tucci, Static and Dynamic Scheduling Algorithms for Scalable Web
Server Farm, IEEE Network 2001, pp. 368-376

[69] A. S. Z. Belloum, E. C. Kaletas, A. W. Van Halderen, H. Afsarmanesh, and A. J. H.

Peddemors, A Scalable Web Server Architecture, IEEE World Wide Web Journal, Volume 5
Number 1, 2002, pp. 5-23

[70] H. Bryhni, E. Klovning, and O. Kure, A Comparison of Load Balancing Techniques for
Scalable Web Servers, IEEE Network, July/August 2000, pp. 58-64

[71] D. Kim, C. H. Park, and D. Park, Request Rate Adaptive Dispatching Architecture for
Scalable Internet Server, IEEE International Conference on Cluster Computing, Chemmnitz,
2000, pp. 289-296

[72] L. Aversa, and A. Bestavros, Load Balancing a Cluster of Web Servers Using Distributed
Packet Rewriting, Proceedings of the 2000 IEEE International Performance, Computing, and
Communications Conference, February 2000, pp. 24 - 29

[73] B. Ramamurthy, LSMAC vs. LSNAT: Scalable Cluster-based Web Servers, Seminar presented
at Rice University, http://www-ece.rice.edu/ece/colloq/00-01/Oct23br-00.html, October 23, 2000

[74] A. N. Murad, and H. Liu, Scalable Web Server Architectures, Technical Report BL0314500-
961216TM, Bell Labs, Lucent Technologies, December 1996

232
[75] E. D. Katz, M. Butler, and M. McGrawth, A Scalable HTTP Server: The NCSA Prototype,
Proceedings of the 1st International WWW Conference, Geneva, Switzerland, May 25-27, 1994,
pp. 155-164

[76] The NCSA HTTP server, http://hoohoo.ncsa.uiuc.edu

[77] D. Kim, C. H. Park, and D. Park, Request Rate Adaptive Dispatching Architecture for
Scalable Internet Server, IEEE Network 2000, pp. 289-296

[78] D. Anderson, T. Yang, V. Holmedahl, and O. Jbarra, SWEB: Towards a Scalable World Wide
Web Server on Multicomputers, Proceedings of the 10th International Parallel Processing
Symposium, Honolulu, Hawaii, USA, April 15-19, 1996, pp. 850-856

[79] E. Casalicchio, and M. Colajanni, Scalable Web Clusters with Static and Dynamic Contents,
Proceedings of the IEEE Conference on Cluster Computing, Chemnitz, Germany, November 28 –
December 1, 2000, pp. 170-177

[80] X. Gan, T. Schroeder, S. Goddard, and B. Ramamurthy, LSMAC and LSNAT: Two
Approaches for Cluster-based Scalable Web Servers, Proceedings of the 2000 IEEE International
Conference on Communications, New Orleans, USA, June 18-22, 2000, pp. 1164-1168

[81] X. Zhang, M. Barrientos, B. Chen, and M. Seltzer, HACC: An Architecture for Cluster-Based
Web Servers, Proceedings of the 3rd USENIX Windows NT Symposium, Seattle, Washington,
USA, July 12-15, 1999, pp. 155-164

[82] The HACC Web Server Project, http://www.eecs.harvard.edu/~cxzhang/projects/hacc

[83] The IBM RS/6000, http://www.rs6000.ibm.com

[84] G. D. H. Hunt, G.S. Goldszmidt, R. P. King, and R. Mukherjee, Network Dispatcher: A

Connection Router for Scalable Internet Services, Proceedings of 7th International World Wide
Web Conference, Bribane, Australia, April 1998, pp. 347-357

[85] The Alexandria Digital Library, http://alexandria.sdc.ucsb.edu

[86] WebStone Web Server Benchmarking Tool, http://www.mindcraft.com/webstone

[87] F5 Networks Big IP, http://www.f5.com/f5products/bigip

233
[88] A. Tucker, and A. Gupta, Process Control and Scheduling Issues for Multiprogrammed
Shared-Memory Multiprocessors, Proceedings of the 12th Symposium on Operating Systems
Principles, ACM, Litchfield Park, Arizona, USA, December 1989, pp. 159-166

[89] M. Squillante, and E. Lazowska, Using Processor-Cache Affinity Information in Shared-

Memory Multiprocessors Scheduling, IEEE Transactions on Parallel and Distributed Systems,
Volume 4 Number 2, February 1993, pp. 131-134

[90] Microsoft Developer Network Platform SDK, Performance Data Helper, Microsoft, July
1998

[91] W. B. Ligon III, and R. Ross, Server-Side Scheduling in Cluster Parallel I/O Systems, The
Calculateurs Parallèles Journal, October 2001

[92] W.B. Ligon III, and R. Ross, PVFS: Parallel Virtual File System, Beowulf Cluster
Computing with Linux, MIT Press, November 2001, pp. 391-430

[93] B. Nishio, and S. Nishio, MASEMS: A Scalable and Extensible Multimedia Server, IEEE
Network 2000, pp. 443-450

[94] A. Mourad, and H. Liu, Scalable Web Server Architectures, Proceedings of IEEE
International Symposium on Computers and Communications, Alexandria, Egypt, July 1997, pp.
12-16

[95] M. Andreolini, V. Cardellini, and M. Colajanni, Benchmarking, Models and Tools for
Distributed Web-Server System, Proceedings of the Performance 2002, Rome, Italy, July 24-26,
2002, pp. 208-235

[96] I. Haddad, An Overview of the HTTP, http://www.cs.concordia.ca/~i_haddad/phd.html

[97] P. Wainwright, Professional Apache, Wrox Press Inc, 1999

[98] B. Laurie, P. Laurie, and R. Denn, Apache: The Definitive Guide, O'Reilly & Associates,
1999

[99] I. Haddad, Apache User Authentication, Linux Journal, October 2000

[100] I. Haddad, Apache 2.0 Internals, Linux Journal, August 2001

[101] Netcraft Web Server Survey, http://Netcraft.com/survey

234
[102] The GNU General Public License, http://www.gnu.org/copyleft/gpl.html

[103] I. Haddad, W. Hassan, and L. Tao, XWPT: An X-based Web Servers Performance Tool, the
18th International Conference on Applied Informatics, Innsbruck, Austria, February 2000, pp. 50-
55

[104] The Software RAID How-to, http://www.tldp.org/HOWTO/Software-RAID-HOWTO.html

[105] The Parallel Virtual File System, http://www.parl.clemson.edu/pvfs

[106] A. Ching, A. Choudhary, W. Liao, R. Ross, and W. Gropp, Noncontiguous I/O through
PVFS, Proceedings of the 2002 IEEE International Conference on Cluster Computing, September
23-26, 2002, Chicago, Illinois, USA, pp. 405-414

[107] I. Haddad, and M. Pourzandi, Open Source Web Servers Performance on Carrier-Class
Linux Clusters, Linux Journal, April 2001, pp. 84-90

[108] I. Haddad, PVFS: A Parallel Virtual File System for Linux Clusters, Linux Journal,
December 2000

[109] P. Barford, and M. Crovella, Generating Representative WebWorkloads for Network and
Server Performance Evaluation, Proceedings of the ACM Sigmetrics Conference, Madison,
Wisconsin, USA, June 1998, pp. 151-160

[110] The S-Client, http://www.cs.rice.edu/CS/Systems/Web-measurement/paper/node9.html

[111] WebStone Web Server Benchmarking Tool, http://www.mindcraft.com/webstone

[112] SPEC Web Server Benchmarking Tool, http://www.spec.org/osg/web99

[113] TCP Westwood, http://www.cs.ucla.edu/NRL/hpi/tcpw/

[114] M. Andreolini, V. Cardellini, and M. Colajanni, Benchmarking Models and Tools for
Distributed Web-Server Systems, Proceedings of Performance 2002, Rome, Italy, July 24-26,
2002, pp. 208-235

[115] E. Marcus, and H. Sten, BluePrints for High Availability: Designing Resilient Distributed
Systems, Wiley, 2000

[116] S. Horman, Connection Synchronization, http://www.ultramonkey.org/papers/conn_sync

235
[117] I. Haddad, C. Leangsuksun, R. Libby, and S. Scott, HA-OSCAR: Towards Non-stop Services
in High End and Grid computing Environments, Poster Presentation at the Fifth Los Alamos
Computer Science Institute Symposium, New Mexico, USA, October 12-14, 2004

[118] The Open Cluster Group, How to Install an OSCAR Cluster, Technical Report, November 3,
2005, http://oscar.openclustergroup.org/public/docs/oscar4.2/oscar4.2-install.pdf

[119] A. S. Tanenbaum, and M. S. Van, Distributed Systems: Principles and Paradigms, Prentice
Hall, July 2001, pp. 371-375

[120] The rsync utility, http://samba.anu.edu.au/rsync

[121] The rsync algorithm, http://dp.samba.org/rsync/tech_report/node2.html

[122] The DRDB Tool, http://www.drbd.org

[123] The Router Advertisement Daemon Project, http://v6web.litech.org/radvd/

[124] RFC 2461, Neighbor Discovery for IP Version 6, http://www.faqs.org/rfcs/rfc2461.html

[125] The Linux-HA project, http://www.linux-ha.org

[126] L. Marowsky-Brée, A New Cluster Resource Manager for Heartbeat, UKUUG LISA/Winter
Conference High Availability and Reliability, Bournemouth, UK, February 2004

[127] A. L. Robertson, The Evolution of the Linux-HA Project, UKUUG LISA/Winter Conference
High-Availability and Reliability, Bournemouth, UK, February 25-26, 2004

[128] A. L. Robertson, Linux-HA Heartbeat Design, Proceedings of the 4th International Linux
Showcase and Conference, Atlanta, October 10-14, 2000

[129] S. Horman, Connection Synchronisation (TCP Fail-Over), Technical Paper, November 2003

[130] The BogoMIPS Representation, http://www.tldp.org/HOWTO/BogoMips/x78.html

[131] D. Gordon, and I. Haddad, Apache talking IPv6, Linux Journal, January 2003

[132] I. Haddad, The Future of IP Is Now, LinuxWorld Magazine, February 2004

[133] I. Haddad, IPv6 on Linux: Ongoing Development Effort and Tutorial, Linux User and
Developer, June 2003

236
[134] D. Gamerman, Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference,
Chapman & Hall/CRC Press, Boca Raton, Fl., USA, 1997

[135] G. Ciardo, and P. Darondeau, Applications and Theory of Petri Nets 2005, 26th International
Conference, ICATPN 2005, Miami, USA, June 20-25, 2005

[136] G Ciardo, J. Muppala, and K. Trivedi, SPNP: Stochastic Petri Net Package, Proceedings of
the International Workshop on Petri Nets and Performance Models, IEEE Computer Society
Press, Los Alamitos, Ca., USA, December 1989, pp. 142-150

[137] H. Choi, Markov Regenerative Stochastic Petri Nets, Computer Performance Evaluation,
Vienna 1994, pp. 337-357

[138] C. Hirel, R. Sahner, X. Zang, and K. S. Trivedi, Reliability and Performability Modeling
using SHARPE 2000, Computer Performance Evaluation/TOOLS 2000, Schaumburg, US, March
2000, pp. 345-349

[139] C.Hirel, B. Tuffin, and K.S. Trivedi, SPNP: Stochastic Petri Nets Version 6.0, Computer
Performance Evaluation/TOOLS 2000, Schaumburg, US, March 2000, pp. 354-357

[140] C. Leangsuksun, L. Shen, T. Liu, H. Song, and S. Scott, Availability Prediction and Modeling
of High Availability OSCAR Cluster, IEEE International Conference on Cluster Computing, Hong
Kong, China, December 2-4, 2003, pp. 227-230

[141] C. Leangsuksun, Highly Reliable Linux HPC Clusters: Self-awareness Approach,

Proceedings of the International Symposium on Parallel Architectures, Algorithms, and Networks
(I-SPAN), Hong Kong, China, May 2004

[142] C. Leangsuksun, L. Shen, T. Liu, H. Song, and S. Scott, Dependability Prediction of High
Availability OSCAR Cluster Server, The 2003 International Conference on Parallel and
Distributed Processing Techniques and Applications, Las Vegas, Nevada, USA, 2003, pp. 23-26

[143] C. Leangsuksun, L. Shen, T. Lui, and S. L. Scott, Achieving High Availability and
Performance Computing with an HA-OSCAR Cluster, Future Generation Computer System,
Volume 21, Number 1, January 2005, pp. 597-606

[144] The Open Cluster Group, http://www.openclustergroup.org

237
[145] C. Leangsuksun, L. Shen, H. Song, and S. Scott, I. Haddad, The Modeling and Dependability
Analysis of High Availability OSCAR Cluster System, The 17th Annual International Symposium
on High Performance Computing Systems and Applications, Sherbrooke, Quebec, Canada, May
11-14, 2003

[146] I. Haddad, C. Leangsuksun, R. Libby, T. Liu, Y. Liu, and S. Scott, Highly Reliable Linux
HPC Clusters: Self-awareness Approach, Proceedings of the 2nd International Symposium on
Parallel and Distributed Processing and Applications, Hong Kong, China, December 13-15, 2004,
pp. 217-222

[147] The SystemImager utility, http://www.systemimager.org

[148] I. Haddad, and C. Leangsuksun, Building Highly Available HPC Clusters with HA-OSCAR,
Tutorial Presentation, the 6th LCI International Conference on Clusters: The HPC Revolution
2005, Chapel Hill, NC, USA 2005, April 2005

[149] I. Haddad, and G. Butler, Experimental Studies of Scalability in Clustered Web Systems,
Proceedings of the International Parallel and Distributed Processing Symposium 2004, Santa Fe,
New Mexico, USA, April 2004

[150] I. Haddad, Keeping up with Carrier Grade, Linux Journal, August 2004

[151] I. Haddad, Carrier Grade Server Requirements, Linux User and Developer, August 2004

[152] I. Haddad, Moving Towards Open Platforms, LinuxWorld Magazine, May 2004

[153] I. Haddad, Linux Gains Momentum in Telecom, LinuxWorld Magazine, May 2004

[154] I. Haddad, OSDL Carrier Grade Linux, O'Reilly Network, April 2004

[155] I. Haddad, CGL Platforms: Characteristics and Development Efforts, Euro-Par 2003,
Klagenfurt, Austria, August 2003

[156] I. Haddad, C. Leangsuksun, M. Pourzandi, and A. Tikotekar, Feasibility Study and Early
Experimental Results Toward Cluster Survivability, Proceedings of Cluster Computing and Grid
2005, Cardiff, UK, May 9-12, 2005

[157] D. Gordon, and I. Haddad, Building an IPv6 DNS Server Node, Linux Journal, October 2003

238
[158] I. Haddad, C. Leangsuksun, M. Pourzandi, and A. Tikotekar, Experimental Results in
Survivability of Secure Clusters, Proceedings of the 6th International Conference on Linux
Clusters, Chapel Hill, NC, USA 2005

[159] I. Haddad, IPv6: The Essentials You Must Know, Linux User and Developer, May 2003

[160] I. Haddad, Using Freenet6 Service to Connect to the IPv6 Internet, Linux User and
Developer, July 2003

[161] I. Haddad, C. Leangsuksun, R. Libby, T. Liu, Y. Liu, and S. L. Scott , High-Availability and
Performance Clusters: Staging Strategy, Self-Healing Mechanisms, and Availability Analysis,
Proceedings of the IEEE Cluster Conference 2004, San Diego, USA, September 20-23, 2004

[162] I. Haddad, Streaming Video on Linux over IPv6, Linux User and Developer, August 2003

[163] I. Haddad, Voice over IPv6 on Linux, Linux User and Developer, September 2003

[164] I. Haddad, NAT-PT: IPv4/IPv6 and IPv6/IPv4 Address Translation, Linux User and
Developer, October 2003

[165] I. Haddad, Design and Implementation of HA Linux Clusters, IEEE Cluster 2001, Newport
Beach, USA, October 8-11, 2001

[166] I. Haddad, Designing Large Scale Benchmarking Environments, ACM Sigmetrics 2002,
Marina Del Rey, USA, June 2002

[167] I. Haddad, Supporting IPv6 on Linux Clusters, IEEE Cluster 2002, Chicago, USA, September
2002

[168] I. Haddad, IPv6: Characteristics and Ongoing Research, Internetworking 2003, San Jose,
USA, June 2003

[169] I. Haddad, CGL Platforms: Characteristics and Development Efforts, Euro-Par 2003,
Klagenfurt, Austria, August 2003

[170] I. Haddad, Carrier Grade Linux: Status and Ongoing Work, Real World Linux 2004,
Toronto, Canada, April 2004

[171] I. Haddad, Carrier Grade Platforms: Characteristics and Ongoing Efforts, ICETE 2004,
Setúbal, Portugal, August 2004
239
[172] I. Haddad, and C. Leangsuksun, Building HA/HPC Clusters with HA-OSCAR, Tutorial
Presentation at the IEEE Cluster Conference, San Diego, USA, September 2004

[173] I. Haddad, C. Leangsuksun, and S. Scott, Towards Highly Available, Scalable, and Secure
HPC Clusters with HA-OSCAR, the 6th International Conference on Linux Clusters, Chapel Hill,
NC, USA, April 2005

[174] I. Haddad, and S. Scott, HA Linux Clusters: Towards Platforms Providing Continuous
Service, Linux Symposium, Ottawa, Canada, July 2005

[175] I. Haddad, and C. Leangsuksun, HA-OSCAR: Highly Available Linux Cluster at your
Fingertips, IEEE Cluster 2005, Boston, USA, September 2005

[176] I. Haddad, HA Linux Clusters, Open Cluster Group 2001, Illinois, USA, March 2001

[177] I. Haddad, Combining HA and HPC, Open Cluster Group 2002, Montréal, Canada, June 2002

[178] I. Haddad, Towards Carrier Grade Linux Platforms, USENIX 2004, Boston, USA, June 2004

[179] I. Haddad, Towards Unified Clustering Infrastructure, Linux World Expo and Conference,
San Francisco, USA, August 2005

[180] I. Haddad, Carrier Grade Linux: Status and Ongoing Work, Real World Linux 2004,
Toronto, Canada, April 2004

[181] I. Haddad, Carrier Grade Platforms: Characteristics and Ongoing Efforts, ICETE 2004,
Setúbal, Portugal, August 2004

[182] C. Leangsuksun, A Failure Predictive and Policy-Based High Availability Strategy for Linux
High Performance Computing Cluster, The 5th LCI International Conference on Linux Clusters:
The HPC Revolution 2004, Austin, USA, May 18-20, 2004

[183] TechTarget, Dictionary for Internet and Computer Technologies, http://whatis.techtarget.com

240
Glossary

The definitions of the terms appearing in this glossary are referenced from [183].

3G 3G is short for third-generation wireless, and refers to near-future developments in

personal and business wireless technology, especially mobile communications.

AAA Authentication, authorization, and accounting (AAA) is a term for a framework for
intelligently controlling access to computer resources, enforcing policies, auditing usage, and
providing the information necessary to bill for services. These combined processes are considered
important for effective network management and security. Authentication, authorization, and
accounting services are often provided by a dedicated AAA server, a program that performs these
functions. A current standard by which network access servers interface with the AAA server is the
Remote Authentication Dial-In User Service (RADIUS).

Active Node A node that is providing (or capable of providing) a service.

Active/active A redundancy configuration where all servers in the cluster run their own
applications but are also ready to take over for failed server if needed.

Active/standby A redundancy configuration where one server is running the application while
another server in the cluster is idle but ready to take over if needed.

ASP Application service provider (ASP) is a company that offers individuals or

enterprises access over the Internet to applications and related services that would otherwise have to
be located in their own personal or enterprise computers.

Availability Availability is the amount of time that a system or service is provided in relation to
the amount of time the system or service is not provided. Availability is commonly expressed as a
percentage.

241
C++ C++ is an object-oriented programming language.

C C is a structured, procedural programming language that has been widely used for both
operating systems and applications and that has had a wide following in the academic community.

CGI The common gateway interface (CGI) is a standard way for a Web server to pass a
Web user's request to an application program and to receive data back to forward to the user.

Cluster/server The client/server describes the relationship between two computer programs in which
one program, the client, makes a service request from another program, the server, which fulfills the
request.

Cluster A cluster is a collection of cluster nodes that my change dynamically as nodes join or
leave the cluster.

COTS Commercial off-the-shelf describes ready-made products that can easily be obtained.

Cluster Two or more computer nodes in a system used as a single computing entity to
provide a service or run an application for the purpose of high availability, scalability, and
distribution of tasks.

CMS Cluster Management System (CMS) is a management layer that allows the whole
cluster to be managed as a single entity.

Daemon A program that runs continuously in the background, until activated by a

particular event. A daemon can constantly query for requests or await direct action from a user or
other process.

DRAM Dynamic random access memory (DRAM) is the most common random access
memory (RAM) for personal computers and workstations.

242
DRBD Disk Replication Block Device

DNS The domain name system (DNS) is the way that Internet domain name are located
and translated into IP addresses. A domain name is a meaningful and easy-to-remember "handle" for
an Internet address.

DIMM A DIMM (dual in-line memory module) is a double SIMM (single in-line memory
module). Like a SIMM, it is a module containing one or several random access memory (RAM) chips
on a small circuit board with pins that connect it to the computer motherboard.

EIDE Enhanced (sometimes "Expanded") IDE is a standard electronic interface between a

computer and its mass storage drives. EIDE's enhancements to Integrated Drive Electronics (IDE)
make it possible to address a hard disk larger than 528 Mbytes. EIDE also provides faster access to
the hard drive, support for Direct Memory Access (DMA), and support for additional drives.

Failover The ability to automatically switch a service or capability to a redundant node,

system, or network upon the failure or abnormal termination of the currently active node, system, or
network.

Failure The inability of a system or system component to perform a required function within
specified limits. A failure may be produced when a fault is encountered. Examples of failures include
invalid data being provided, slow response time, and the inability for a service to take a request.
Causes of failure can be hardware, firmware, software, network, or anything else that interrupts the
service.

Five nines is measured as 99.999% service availability. It is equivalent to 5 minutes a year of

total planned and unplanned downtime of the service provided by the system

FTP File Transfer Protocol (FTP) is a standard Internet protocol that defines one way of
exchanging files between computers on the Internet.

243
Gateways Gateways are bridges between two different technologies or administration domains.
A media gateway performs the critical function of converting voice messages from a native
telecommunications time-division-multiplexed network, to an Internet protocol packet-switched
network.

High Availability The state of a system having a very high ratio of service uptime compared to
service downtime. Highly available systems are typically rated in terms of number of nines such as
fivenines or sixnines.

HLR The Home Location Register (HLR) is the main database of permanent subscriber
information for a mobile network.

HTML Hypertext Markup Language (HTML) is the set of markup symbols or codes inserted
in a file intended for display on a World Wide Web browser page. The markup tells the web browser
how to display a web page's words and images for the user.

HTTP Hypertext Transfer Protocol (HTTP) is the set of rules for exchanging files (text,
graphic images, sound, video, and other multimedia files) on the World Wide Web.

HTTP-NG Hypertext Transfer Protocol – Next Generation

Internet The Internet is a worldwide system of computer networks - a network of

networks in which users at any one computer can, if they have permission, get information from any
other computer (and sometimes talk directly to users at other computers). It was conceived by the
Advanced Research Projects Agency (ARPA) of the U.S. government in 1969 and was first known as
the ARPANET. The Internet is a public, cooperative, and self-sustaining facility accessible to
hundreds of millions of people worldwide. Physically, the Internet uses a portion of the total
resources of the currently existing public telecommunication networks.

IP The Internet Protocol (IP) is the method or protocol by which data is sent from one
computer to another on the Internet.
244
iptables is a Linux command used to set up, maintain, and inspect the tables of IP packet filter
rules in the Linux kernel. There are several different tables, which may be defined, and each table
contains a number of built-in chains, and may contain user-defined chains. Each chain is a list of rules
which can match a set of packets: each rule specifies what to do with a packet which matches. This is
called a `target', which may be a jump to a user-defined chain in the same table.

IPv6 Internet Protocol Version 6 (IPv6) is the latest version of the Internet Protocol. IPv6
is a set of specifications from the Internet Engineering Task Force (IETF) that was designed as an
evolutionary set of improvements to the current IP Version 4.

ISDN Integrated Services Digital Network (ISDN) is a set of standards for digital
transmission over ordinary telephone copper wire as well as over other media.

I/O I/O describes any operation, program, or device that transfers data to or from a
computer.

ISP Internet service provider (ISP) is a company that provides individuals and other
companies access to the Internet and other related services such as web site building and virtual
hosting.

ISV Independent Software Vendors

LAN A local area network (LAN) is a group of computers and associated devices that
share a common communications line or wireless link and typically share the resources of a single
processor or server within a small geographic area.

Management Server Management servers handle traditional network management operations, as

well as service and customer management. These servers provide services such as Home Location
Register and Visitor Location Register (for wireless networks) or customer information, such as
personal preferences including features the customer is authorized to use.

245
MMP Massively Parallel Processors (MPP) is the coordinated processing of a program by
multiple processors that work on different parts of the program, with each processor using its own
operating system and memory. Typically, MPP processors communicate using some messaging
interface.

MP3 MP3 (MPEG-1 Audio Layer-3) is a standard technology and format for compression
a sound sequence into a very small file (about one-twelfth the size of the original file) while
preserving the original level of sound quality when it is played.

MSS Maximum Segment Size (MSS)

MTTF Mean Time To Failure (MTTF) is the interval in time which the system can provide
service without failure

MTTR Mean Time To Repair (MTTR) is the interval in time it takes to resume service after
a failure has been experienced.

NAS Network-attached storage (NAS) is hard disk storage that is set up with its own
network address rather than being attached to the department computer that is serving applications to
a network's workstation users.

NAT NAT (Network Address Translation) is the translation of an Internet Protocol address
(IP address) used within one network to a different IP address known within another network. One
network is designated the inside network and the other is the outside.

Network A connection of [nodes] which facilitates [communication] among them. Usually, the
connected nodes in a network use a well defined [network protocol] to communicate with each other.

Network Protocols Rules for determining the format and transmission of data. Examples of
network protocols include TCP/IP, and UDP.

246
NIC A network interface card (NIC) is a computer circuit board or card that is installed in
a computer so that it can be connected to a network.

Node A single computer unit, in a [network], that runs with one instance of a real or virtual
operating system.

NTP Network Time Protocol (NTP) is a protocol that is used to synchronize computer
clock times in a network of computers.

OSI The Open System Interconnection, model defines a networking framework for
implementing protocols in seven layers. Control is passed from one layer to the next, starting at the
application layer in one station, proceeding to the bottom layer, over the channel to the next station
and back up the hierarchy.

Performance The efficiency of a [system] while performing tasks. Performance characteristics

include total throughput of an operation and its impact to a [system]. The combination of these
characteristics determines the total number of activities that can be accomplished over a given amount
of time.

Perl Perl is a script programming language that is similar in syntax to the C language and
that includes a number of popular Unix facilities such as SED, awk, and tr.

PCI PCI (Peripheral Component Interconnect) is an interconnection system between a

microprocessor and attached devices in which expansion slots are spaced closely for high-speed
operation. Using PCI, a computer can support both new PCI cards while continuing to support
Industry Standard Architecture (ISA) expansion cards, an older standard.

PDA Personal digital assistant (PDA) is a term for any small mobile hand-held device that
provides computing and information storage and retrieval capabilities for personal or business use,
often for keeping schedule calendars and address book information handy.

247
Proxy Server A computer network service that allows clients to make indirect network connections
to other network services. A client connects to the proxy server, and then requests a connection, file,
or other resource available on a different server. The proxy provides the resource either by connecting
to the specified server or by serving it from a cache. In some cases, the proxy may alter the client's
request or the server's response for various purposes.

RAID Redundant array of independent disks (RAID) is a way of storing the same data in
different places (thus, redundantly) on multiple hard disks.

RAMDISK A RamDisk is a portion of memory that is allocated to be used as a hard disk

partition.
RAS Reliability, availability, and servicability

Recovery To return a failing component, node or system to a working state. A failing

component can be a hardware or a software component of a node or network. Recovery can also be
initiated to work around an fault that has been detected; ultimately restoring the service.

RTT Round-Trip Times (RTT) is the time required for a network communication to travel
from the source to the destination and back. RTT is used by routing algorithms to aid in calculating
optimal routes.

SAN Storage Area Network (SAN) is a high-speed special-purpose network (or sub-
network) that interconnects different kinds of data storage devices with associated data servers on
behalf of a larger network of users.

SCP A Service Control Point server is an entity in the intelligent network that implements
service control function that operation that affects the recording, processing, transmission, or
interpretation of data.

SCSI The Small Computer System Interface(SCSI) is a set of ANSI standard electronic
interfaces that allow personal computers to communicate with peripheral hardware such as disk
248
drives, tape drives, CD-ROM drives, printers, and scanners faster and more flexibly than previous
interfaces.

Service A set of functions provided by a computer system. Examples of Telco services

include media gateway, signal, or soft switch types of applications. Some general examples of
services include web based or database transaction types of applications.

Session Series of consecutive page requests to the web server from the same user

Signaling Servers Signaling servers handle call control, session control, and radio recourse
control. A signaling server handles the routing and maintains the status of calls over the network. It
takes the request of user agents who want to connect to other user agents and routes it to the
appropriate signaling.

SLA Service Level Agreement (SLA) is a contract between a network service provider and
a customer that specifies, usually in measurable terms, what services the network service provider
will furnish.

SPOF Single point of failure (SPOF) - Any component or communication path within a
computer system that would result in an interruption of the service if it failed.

SMP Symmetric Multiprocessors (SMP) is the processing of programs by multiple

processors that share a common operating system and memory. The processors share memory and the
I/O bus or data path. A single copy of the operating system is in charge of all the processors.

SSI Single System Image (SSI) is a form of distributed computing in which by using a
common interface multiple networks, distributed databases or servers appear to the user as one
system. In SSI systems, all nodes share the operating system environment in the system.

Sixnines is measured as 99.9999% [service] [availability]. It is equivalent to 30 seconds a year

of total planned and unplanned downtime of the [service] provided by the [system].

249
Standby Not currently providing service but prepared to take over the active state.

System A computer system that consists of one computer [node] or many nodes connected
via a computer network mechanism.

Switch-over The term switch-over is used to designate circumstances where the cluster moves the
active state of a particular component/node from one component/node to another, after the failure of
the active component/node. Switch-over operations are usually the consequence of administrative
operations or escalation of recovery procedures.

Tcl Tcl is an interpreted script language developed by Dr. John Ousterhout at the
University of California, Berkeley, and now developed and maintained by Sun Laboratories.

TCP TCP (Transmission Control Protocol) is a set of rules (protocol) used with the
Internet Protocol (IP) to send data in the form of message units between computers over the Internet.
While IP takes care of handling the actual delivery of the data, TCP takes care of keeping track of the
individual units of data (called packets) that a message is divided into for efficient routing through the
Internet.

TFTP Trivial File Transfer Protocol (TFTP) is an Internet software utility for transferring
files that is simpler to use than the File Transfer Protocol (FTP) but less capable. It is used where user
authentication and directory visibility are not required.

TTL Time-to-live (TTL) is a value in an Internet Protocol (IP) packet that tells a network
router whether the packet has been in the network too long and should be discarded.

QoS Quality of Service (QoS) is the idea that transmission rates, error rates, and other
characteristics can be measured, improved, and, to some extent, guaranteed in advance.

URI To paraphrase the World Wide Web Consortium, Internet space is inhabited by many
points of content. A URI (Uniform Resource Identifier; pronounced YEW-AHR-EYE) is the way you

250
identify any of those points of content, whether it be a page of text, a video or sound clip, a still or
animated image, or a program. The most common form of URI is the web page address, which is a
particular form or subset of URI called a Uniform Resource Locator (URL).

USB USB (Universal Serial Bus) is a plug-and-play interface between a computer and
add-on devices (such as audio players, joysticks, keyboards, telephones, scanners, and printers). With
USB, a new device can be added to your computer without having to add an adapter card or even
having to turn the computer off.

User An external entity that acquires service from a computer system. It can be a human
being, an external device, or another computer system.

Web Service Web services are loosely coupled software components delivered over Internet
standard technologies. A web service can also be defined as a self-contained, modular application that
can be described, published, located, and invoked over the web.

251
252