Making Applications Scalable With Load Balancing
Making Applications Scalable With Load Balancing
September 2006
Willy Tarreau
[email protected]
Revision 1.0
http://1wt.eu/articles/2006_lb/
Introduction
Originally, the Web was mostly static contents, quickly delivered to a few users which spent
most of their time reading between rare clicks. Now, we see real applications which hold users
for tens of minutes or hours, with little content to read between clicks and a lot of work
performed on the servers.
The users often visit the same sites, which they know perfectly and don't spend much time
reading. They expect immediate delivery while they unconsciously inflict huge loads on the
servers at every single click.
This new dynamics has developed a new need for high performance and permanent availability.
[1] http://haproxy.1wt.eu/
The easiest way to perform load balancing is to dedicate servers to predefined groups of users.
This is easy on Intranet servers, but not really on internet servers. On common approach relies
on DNS round robin. If a DNS server has several entries for a given hostname, it will return all
of them in a rotating order.
This way, various users will see different addresses for the same name and will be able to
reach different servers. This is very commonly used for multi-site load balancing, but it
requires that the application is not impacted to the lack of server context. For this reason, this
is generally used to by search engines, POP servers, or to deliver static contents.
This method does not provide any means of availability. It requires additional measures to
permanently check the servers status and switch a failed server's IP to another server.
For this reason, it is generally used as a complementary solution, not as a primary one.
$ host -t a google.com
google.com. has address 64.233.167.99
google.com. has address 64.233.187.99
google.com. has address 72.14.207.99
$ host -t a google.com
google.com. has address 72.14.207.99
google.com. has address 64.233.167.99
google.com. has address 64.233.187.99
$ host -t a mail.aol.com
mail.aol.com. has address 205.188.162.57
mail.aol.com. has address 205.188.212.249
mail.aol.com. has address 64.12.136.57
mail.aol.com. has address 64.12.136.89
mail.aol.com. has address 64.12.168.119
mail.aol.com. has address 64.12.193.249
mail.aol.com. has address 152.163.179.57
mail.aol.com. has address 152.163.211.185
mail.aol.com. has address 152.163.211.250
mail.aol.com. has address 205.188.149.57
mail.aol.com. has address 205.188.160.249
Fig.1 : DNS round robin : multiple addresses assigned to a same name, returned in a rotating order
A more common approach is to split the population of users across multiple servers. This
involves a load balancer between the users and the servers. It can consist in a hardware
equipment, or in a software installed either on a dedicated front server, or on the application
servers themselves. Note that by deploying new components, the risk of failure increases, so a
common practise is to have a second load balancer acting as a backup for the primary one.
Generally, a hardware load balancer will work at the network packets level and will act on
routing, using one of the following methods :
• Direct routing : the load balancer routes the same service address through different
local, physical servers, which must be on the same network segment, and must all share
the same service address. It has the huge advantage of not modifying anything at the IP
level, so that the servers can reply directly to the user without passing again through
the load balancer. This is called "direct server return". As the processing power needed
for this method is minimal, this is often the one used on front servers on very high
traffic sites. On the other hand, it requires some solid knowledge of the TCP/IP model
to correctly configure all the equipments, including the servers.
• Tunnelling : it works exactly like direct routing, except that by establishing tunnels
between the load balancer and the servers, theses ones can be located on remote
networks. Direct server return is still possible.
• IP address translation (NAT) : the user connects to a virtual destination address, which
the load balancer translates to one of the servers' addresses. This is easier to deploy at
first glance, because there is less trouble in the server configuration. But this requires
stricter programming rules. One common error is application servers indicating their
internal addresses in some responses. Also, this requires more work on the load
balancer, which has to translate addresses back and forth, to maintain a session table,
and it requires that the return traffic goes through the load balancer as well.
Sometimes, too short session timeouts on the load balancer induce side effects known
as ACK storms. In this case, the only solution is to increase the timeouts, at the risk of
saturating the load balancer's session table.
On the opposite side, we find software load balancers. They most often act like reverse
proxies, pretending to be the server and forwarding them the traffic. This implies that the
servers themselves cannot be reached directly from the users, and that some protocols might
never get load-balanced.
They need more processing power than those acting at the network level, but since they splice
the communication between the users and the servers, they provide a first level of security by
only forwarding what they understand. This is the reason why we often find URL filtering
capabilities on those products.
To select a server, a load balancer must know which ones are available. For this, it will
periodically send them pings, connection attempts, requests, or anything the administrator
considers a valid measure to qualify their state. These tests are called "health checks".
A crashed server might respond to ping but not to TCP connections, and a hung server might
respond to TCP connections but not to HTTP requests. When a multi-layer Web application
server is involved, some HTTP requests will provide instant responses while others will fail.
So there is a real interest in choosing the most representative tests permitted by the
application and the load balancer. Some tests might retrieve data from a database to ensure
the whole chain is valid. The drawback is that those tests will consume a certain amount of
server resources (CPU, threads, etc...).
They will have to be spaced enough in time to avoid loading the servers too much, but still be
close enough to quickly detect a dead server. Health checking is one of the most complicated
aspect of load balancing, and it's very common that after a few tests, the application
developers finally implement a special request dedicated to the load balancer, which performs
a number of internal representative tests.
For this matter, software load balancers are by far the most flexible because they often
provide scripting capabilities, and if one check requires code modifications, the software's
editor can generally provide them within a short timeframe.
There are many ways for the load balancer to distribute the load. A common misconception is
to expect it to send a request to "the first server to respond".
This practise is wrong because if a server has any reason to reply even slightly faster, it will
unbalance the farm by getting most of the requests.
A second idea is often to send a request to "the least loaded server". Although this is useful in
environments involving very long sessions, it is not well suited for web servers where the load
can vary by two digits factors within seconds.
For homogeneous server farms, the "round robin" method is most often the best one.
It uses every server in turn. If servers are of unequal capacity, then a "weighted round robin"
algorithm will assign the traffic to the servers according to their configured relative capacities.
These algorithms present a drawback : they are non-deterministic. This means that two
consecutive requests from the same user will have a high chance of reaching two different
servers. If user contexts are stored on the application servers, they will be lost between two
consecutive requests. And when complex session setup happens (eg: SSL key negociation), it
will have to be performed again and again at each connection.
To work around this problem, a very simple algorithm is often involved : address hashing. To
summarize it, the user's IP address is divided by the number of servers and the result
determines the right server for this particular user. This works well as long as the number of
servers does not change and as the user's IP address is stable, which is not always the case.
In the event of any server failure, all users are rebalanced and lose their sessions. And the few
percent of the users browsing through proxy farms will not be able to use the application either
(usually 5-10%). This method is not always applicable though, because for a good distribution, a
high number of source IP addresses is required.
This is true on the Internet, but not always within small companies or even ISP's
infrastructures. However, this is very efficient to avoid recomputing SSL session keys too often.
COOKIE LEARNING
Cookie learning is the least intrusive solution. The load balancer is configured to learn the
application cookie (eg. "JSESSIONID"). When it receives the user's request, it checks if it
contains this cookie and a known value. If this is not the case, it will direct the request to any
server, according to the load balancing algorithm. It will then extract the cookie value from
the response and add it along with the server's identifier to a local table. When the user comes
back again, the load balancer sees the cookie, lookups the table and finds the server to which
it forwards the request. This method, although easily deployed, has two minor drawbacks,
implied by the learning aspect of this method :
• The load balancer has finite memory, so it might saturate, and the only solution to
avoid this is to limit the cookie lifetime in the table. This implies that if a user comes
back after the cookie expiration, he will be directed to a wrong server.
• If the load balancer dies and its backup takes over, it will not know any association and
will again direct users to wrong servers. Of course, it will not be possible to use the
load balancers in active-active combinations either. A solution to this is real-time
session synchronization, which is difficult to guarantee.
COOKIE INSERTION
If the users support cookies, why not add another one with fixed text ? This method, called
"cookie insertion", consists in inserting the server's identifier in a cookie added to the response.
This way, the load balancer has nothing to learn, it will simply reuse the value presented by
the user to select the right server.
We've reviewed methods involving snooping on the user-server exchanges, and sometimes even
modifications. But with the ever growing number of applications relying on SSL for their
security, the load balancer will not always be able to access HTTP contents. For this reason,
we see more and more software load balancers providing support for SSL.
The principle is rather simple : the load balancer acts as a reverse proxy and serves as the SSL
server end. It holds the servers' certificates, deciphers the request, accesses the contents and
directs the request to the servers (either clear text or slightly re-ciphered).
This gives a new performance boost to the application servers which do not have to manage SSL
anymore.
However, the load balancer becomes the bottleneck : a single software-based load balancer
will not be faster to process SSL than an eight-servers farm. Because of this architectural error,
the load balancer will saturate before the application servers, and the only remedy will be to
put another level of load balancers in front of it, and adding more load balancers to handle the
SSL load. Obviously, this is wrong. Maintaining a farm of load balancers is very difficult, and
the health-checks sent by all those load balancers will induce a noticeable load to the servers.
The right solution is then to have a dedicated SSL farm.
The most scalable solution when applications require SSL is to dedicate a farm of reverse-
proxies only for this matter. There are only advantages to this solution :
1. SSL reverse proxies are very cheap yet powerful. Anyone can build powerful Apache-
based SSL reverse proxies for less than $1000. This has to be compared to the cost of a
same performance SSL-enabled load balancer (more than $15000), and to the cost of
multi-processor servers used for the application, which will not have to be wasted to do
SSL anymore.
2. SSL reverse proxies almost always provide caching, and sometimes even compression.
Most e-business applications present a cacheability rate between 80 and 90%, which
means that these reverse proxy caches will offload the servers by at least 80% of the
requests.
3. adding this to the fact that SSL proxies can be shared between multiple applications,
the overall gain reduces the need for big and numerous application servers, which
impacts maintenance and licensing costs too.
4. the SSL farm can grow as the traffic increases, without requiring load balancer upgrades
nor replacements. Load balancers' capacities are often well over most site's
requirements, provided that the architecture is properly designed.
5. the first level of proxy provides a very good level of security by filtering invalid requests
and often providing the ability to add URL filters for known attacks.
6. it is very easy to replace the product involved in the SSL farm when very specific needs
are identified (eg: strong authentication). On the opposite, replacing the SSL-enabled
load balancer for this might have terrible impacts on the application's behaviour
because of different health-checks methods, load balancing algorithms and means of
persistence.
7. the load balancer choice will be based only on load balancing capabilities and not on its
SSL performance. It will always be more adapted and far cheaper than an all-in-one
solution.
An SSL-proxy-cache farm consists in a set of identical servers, almost always running any
flavour of Apache + ModSSL (even in disguised commercial solutions), which are load-balanced
either by an external network load balancer, or by an internal software load balancer such as
LVS under Linux.
These servers require nearly no maintenance, as their only tasks is to turn the HTTPS traffic
into HTTP, check the cache, and forward the non-cacheable requests to the application server
farm, represented by the load balancer and the application servers. Interesting to note, the
same load balancer may be shared between the reverse proxies and the application servers.
• performance
• reliability
• features
Performance is critical, because when the load balancer becomes the bottleneck, nothing can
be done. Reliability is very important because the load balancer takes all the traffic, so its
reliability must be orders of magnitude above the servers', in terms of availability as well as in
terms of processing quality. Features are determinant to choose a solution, but not critical. It
depends on what tradeoffs are acceptable regarding changes to the application server.
One of the most common questions when comparing hardware-based to proxy-based load
balancers is why is there such a gap between their session count. While proxies generally
announce thousands of concurrent sessions, hardware speaks in millions.
Obviously, few people have 20000 Apache servers to drain 4 millions of concurrent sessions ! In
fact, it depends whether the load balancer has to manage TCP or not. TCP requires that once a
session terminates, it stays in the table in TIME_WAIT state long enough to catch late
retransmits, which can be seen several minutes after the session has been closed. After that
delay, the session is automatically removed.
In practise, delays between 15 and 60 seconds are encountered depending on implementations.
This causes a real problem on systems supporting high session rates, because all terminated
sessions have to be kept in the table. A 60 seconds delay would consume 1.5 million entries at
25000 sessions per second.
Fortunately, sessions in this state do not carry any data and are very cheap. Since they are
transparently handled by the OS, a proxy never sees them and only announces how many active
sessions it supports. But when the load balancer has to manage TCP, it must support very large
session tables to store those sessions, and announces the global limit.
5 TIME_WAIT
0.1s / 20ms = 5 active sessions
5
4 TIME_WAIT
3 TIME_WAIT
2 TIME_WAIT
1 TIME_WAIT
0 TIME_WAIT
Obviously, the best performance can be reached by adding the lowest overhead, which means
processing the packets at the network level. For this reason, network-based load balancers can
reach very high bandwidths when it comes to layer-3 load balancing (eg: direct routing with a
hashing algorithm). As layer-3 is almost always not enough, a network load balancer has to
consider layer-4. Most of the cost lies in the session setup and tear down, which can still be
performed at very high rates because they are a cheap operations.
More complex operations such as address translation require more processing power to
recompute checksums and look up tables. Everywhere we see hardware acceleration providing
performance boosts, and load balancing is no exception.
A dedicated ASIC can route network packets at wire speed, set up and tear down sessions at
very high rates without the overhead imposed by a general purpose operating system.
Conversely, using a software proxy for layer-4 processing leads to very limited performance,
because for simple tasks that could be performed at the packet level, the system has to
decode packets, decapsulate layers to find the data, allocate buffer memory to queue them,
pass them to the process, establish connections to remote servers, manage system ressources,
etc... For this reason, a proxy-based layer-4 load balancer will be 5 to 10 times slower than its
network-based counterpart on the same hardware.
Memory is also a limited resource : the network-based load balancer needs memory to store in-
flight packets, and sessions (which basically are addresses, protocols, ports and a few
parameters, which globally require a few hundreds of bytes per session). The proxy-based load
balancer needs buffers to communicate with the system. Per-session system buffers are
measured in kilobytes, which implies practical limits around a few tens of thousands of active
sessions.
Network-based load balancers also permit a lot of useful fantasies such as virtual MAC
addresses, mapping to entire hosts or networks, etc... which are not possible in standard
proxies relying on general purpose operating systems.
One note on reliability though. Network-based load balancers can do nasty things such as
forwarding invalid packets just as they received them, or confuse sessions (especially when
doing NAT). We talked earlier about the ACK storms which sometimes happen when doing NAT
with too low session timeouts. They are caused by too early re-use of recently terminated
sessions, and cause network saturation between one given user and a server. This cannot
happen on proxies because they rely on standards-compliant TCP stacks offered by the OS, and
they completely cut the session between the client and the server.
Overall, a well tuned network-based load balancer generally provides the best layer-3/4
solution in terms of performance and features.
Layer-7 load balancing involves cookie-based persistence, URL switching and such useful
features. While a proxy relying on a general purpose TCP/IP stack obviously has no problem
doing the job, the network-based load balancers have to resort to a lot of tricks which don't
always play well with standards and often cause various trouble. The most difficult problems to
solve are :
1. multi-packet headers : the load balancer looks for strings within packets. When the
awaited data is not on the first packet, the first ones have to be memorized and
consume memory. When the string starts at the end of one packet and continues on the
next one, it is very difficult to extract. This is the standard mode of operation for TCP
streams though. For reference, a 8 kB request as supported by Apache will commonly
span over 6 packets.
TCP
IP
IP
Cookie: profile=1; PHPSE SSIONID=0123456789abcdef
TCP
IP
IP
2. reverse-ordered packets : when a big and a small packets are sent over the Internet,
it's very common for the small one to reach the target before the large one. The load
balancer must handle this gracefully.
SSIONID=0123456789abcdef SSIONID=0123456789abcdef
Client
6) The client receives a
response without the cookie
The best combination is to use a first level of network-based load balancers (possibly
hardware-accelerated) to perform layer-4 balancing between both SSL reverse proxies and a
second level of proxy-based layer-7 load balancers. First, the checks performed by the layer-4
load balancer on the layer-7 load balancers will always be more reliable than the self-analysis
performed by proxies. Second, this provides the best overall scalability because when the
proxies will saturate, it will always be possible to add more, till the layer-4 load balancer
saturates in turn. At this stage, we are talking about filling multi-gigabit pipes. It will then be
time to use DNS round robin techniques described above to present multiple layer-4 load
balancers, preferably spread over multiple sites.
Internet
Access :
Can be redundant using BGP.
Multiple sites possible with DNS.
Scalable front−end :
Web acceleration and security :
Applications :
It might sound very common, but it is still rarely done. On many applications, about 25% of the
requests are for dynamic objects, the remaining 75% being for static objects. Each Apache+PHP
process on the application server will consume between 15 and 50 MB of memory depending on
the application. It is absolutely abnormal to dedicate such a monster to transfer a small icon to
a user over the Internet, with all the latencies and lost packets keeping it active longer. And it
is even worse when this process is monopolized for several minutes because the user downloads
a large file such as a PDF !
The easiest solution consists in relying on a reverse proxy cache in front of the server farm,
which will directly return cacheable contents to the users without querying the application
servers. The cleanest solution consists in dedicating a lightweight HTTP server to serve static
contents. It can be installed on the same servers and run on another port for instance.
Preferably, a single-threaded server such as Lighttpd or Thttpd should be used because their
per-session overhead is very low. The application will then only have to classify static contents
under an easily parsable directory such as "/static" so that the front load balancer can direct
the traffic to the dedicated servers. It is also possible to use a completely different host name
for the static servers, which will make it possible to install those servers at different locations,
sometimes closer to the users.
There are often a few tricks that can be performed on the servers and which will dramatically
improve their users capacity. With the tricks explained below, it is fairly common to get a two-
to three-fold increase in the number of simultaneous users on an Apache + PHP server without
requiring any hardware upgrade.
First, disable keep-alive. This is the nastiest thing against performance. It was designed at a
time sites were running NCSA httpd forked off inetd at every request. All those forks were
killing the servers, and keep-alive was a neat solution against this. Right now, things have
changed. The servers do not fork at each connection and the cost of each new connection is
minimal. Application servers run a limited number of threads or processes, often because of
either memory constraints, file descriptor limits or locking overhead. Having a user monopolize
a thread for seconds or even minutes doing nothing is pure waste.
The server will not use all of its CPU power, will consume insane amounts of memory and users
will wait for a connection to be free. If the keep-alive time is too short to maintain the session
between two clicks, it is useless. If it is long enough, then it means that the servers will need
roughly one process per simultaneous user, not counting the fact that most browsers commonly
establish 4 simultaneous sessions !
Simply speaking, a site running keep-alive with an Apache-like server has no chance of ever
serving more than a few hundreds users at a time.
http://www.zeus.com/news/pdf/white_papers/7_myths.pdf
http://www.foundrynet.com/pdf/wp-server-load-bal-web-enterprise.pdf