Capacity Planning for IT
Setting Goals for Capacity
Understand site’s requirements
• For example, serve your pages in less than three seconds
• Helps in determining the number of servers to meet this requirement
• Establishing acceptable speed and reliability – a huge undertaking
• But, pays off in long run in terms of both growth and standard to maintain
• But, requirements for stakeholders are different
• Managers
• End-users
• Clients
Different types of requirements and
measurements
• Performance
• External service monitoring
• Business requirements
• User expectations
• Capacity
• System metrics
• Resource ceilings
External Monitoring – Interpreting formal
measurements
• Site should be available in your facility as well as real visitors on other
continents with limited bandwidth
• Can take the help of performance monitoring services:
• http://keynote.com
• http://gomez.com
• Give handy-dandy dashboard with performance and uptime as they
appear from other remote locations
• Deemed to be objective and hence used to draw SLA
• Other less expensive services: pingdom.com, siteuptime.com,
alertra.com
What should they measure?
• Are they simulating human users?
• Are they caching web pages like normal web browser would?
• Can they determine how much time is spent due to network transfer vs
server time?
• Weather failure or unexpected wait time is due to geographical network
issues or measurement failures?
• May find their way to executive dashboard somewhere
• Double edge sword – can be used to ask for more resources as well as
prepared to explain the reason for failure
Service Level Agreement
• A metric to define how service should operate within agreed-upon
boundaries.
• Establishes a schedule of credits for meeting goals or possible
penalties if the service fails to achieve them
• Some SLA guarantees availability of service in percentage – like
99.99%
• That means, 0.01% times service is not available and is within the SLA
• Other SLAs, establish limit of demand like storage and upload limit
Example
• Acme Hosting, Inc. will use commercially reasonable efforts to make
the SuperHostingPlan available with a Monthly uptime percentage
(defined below) of at least 99.9% during any monthly billing cycle. In
the event Acme Hosting, Inc. does not meet this commitment, you will
be eligible to receive a service credit as described below.
It isn’t as great a number:
Means – 43.2 minutes every month , the service ca go down without penalty.
If site generates, $3000 worth sales every minute, we loose approx. $.12
million
SLA percentage and acceptable downtime
Business Capacity Requirements
• Web – services like API for other application developers
• Revenue scheme for unfettered access
• For example, providing postal codes –
• One API call per minute for non-commercial or regular user
• Contract with a shipping company for 10 calls per second.
Capacity is just a part in smooth user
experience
• Plenty of capacity, but page loading may still be slow.
• More to do with the construction of web page than capacity.
• Page may be heavy.
• Upon analysis problem can be solved either,
• By increasing capacity
• By changing page weight
• First solution prove to be expensive
Architecture Decisions
• Basic layout of how all the backend pieces – both hardware and software –
are joined
• Its design plays a crucial role in your ability to plan and manage capacity.
• Architecture affects nearly every part of performance, reliability, and
management.
• Establishing good architecture almost always translates to easier effort
when planning for capacity.
A simple, single-server web application
architecture
• To get the most bang for our buck, we have our web server and
our database residing on the same hardware server.
• you have to configure measurements for both system and
application-level statistics for your server
• you can’t easily distinguish which system statistics correspond
with the different pieces of the architecture
• you can’t answer basic questions that are likely to arise, such as:
• Is the disk utilization the result of the web server sending
out a lot of static content from the disk, or rather, the
database’s queries being disk-bound?
• How much of the filesystem cache, CPU, memory, and disk
utilization is being consumed by the web server, and how
much is being used for the database?
• Most of the times, processes don’t contend with other, but still difficult
with usage continuing to grow.
Separation of web server and database
• Splitting the nodes in this fashion makes it easier to understand the
capacity demands, as the resources on each server are now dedicated
to each piece of the architecture.
• this division of labor also produces performance gains, such as
preventing
• frontend client-side traffic from interfering with database traffic
• If we’re recording system and application-level statistics, we can
quantify what each unit of capacity means in terms of usage.
• We can answer a few questions that you couldn’t before, such as:
Database server
How do increases in database queries-per-second affect the following?
• Disk utilization • I/O Wait • RAM usage • CPU usage
Web server
How do increases in web server requests-per-second affect the following?
• Disk utilization • I/O Wait • RAM usage • CPU usage
Providing scaling points
• With the idea of what is required for this simple architecture, we can
guess hardware configuration
• For example, if majority of operations are disk bound, then there is
no point in investing more on high end CPUs.
• Instead, we spend money on more disk spindles and memory to help
with filesystem performance and caching.
• We have different configurations for our image serving machines, our
web servers, and our image processing machines
Resource Ceiling
• Is what drives capacity forecasting
• Question is - when will the database or web server die?
• Each server in our example possesses a finite amount of the following hardware
resources:
• Disk throughput; Disk storage; CPU; RAM; Network
• High loads will bump against the limits of one or more of those resources.
• Somewhere just below that critical level is where you’ll want to determine your
ceiling for each piece of your architecture.
• ceiling is the critical level of a particular resource (or resources) that cannot be
crossed without failure.
“small pieces, loosely joined”
Vertical, horizontal and diagonal scaling
• Choosing the right hardware for each component of your architecture
can greatly affect costs.
• Before perusing your vendor’s current pricing, be aware of what you’re
trying to achieve.
• Will this server be required to do a lot of CPU work?
• Will it need to perform a lot of memory work?
• Is it a network-bound gateway?
• Being able to scale horizontally means having an architecture that
allows for adding capacity by simply adding similarly functioning nodes
to the existing infrastructure.
• For instance, a second web server to share the burden of website visits.
Vertical, horizontal and diagonal scaling..
• Being able to scale vertically is the capability of adding capacity by
increasing the resources internal to a server, such as CPU, memory,
disk, and network.
• Horizontal vs Vertical scaling
• The danger of relying solely on vertical scaling is, as you continue to upgrade
components of a single computer, the cost rises dramatically. You also
introduce the risk of a single point of failure (SPOF).
• Horizontal scaling involves the more complex issue of increasing the potential
failure points as you expand the size of the server farm. In addition, there are
challenges surrounding any synchronization you’ll need between the nodes.
Diagonal Scaling
• is the process of vertically scaling the horizontally scaled nodes you
already have in your infrastructure
• Over time, CPU power and RAM become faster, cheaper, and cooler,
and disk storage becomes larger and less expensive, so it can be cost
effective to keep some vertical scaling as part of your plan, but
applied to horizontal nodes.
• For all of your nodes bound on CPU or RAM, you can “upgrade” to
fewer servers with more CPU and RAM.
• For disk-bound boxes, you may replace them with fewer machines
that have more disk spindles.
Disaster Recovery
• is saving business operations after a natural or human-induced catastrophe.
• Examples of such disasters include data center power or cooling outages, as well as
physical disasters, such as earthquakes. It can also include incidents, such as
construction accidents or explosions that affect the power, cooling, or network
connectivity relied upon by your site.
• Contingency planning clearly involves capacity management.
• In most cases, the solution is to deploy complete architectures in two (or more)
separate physical locations, which means multiplying your infrastructure costs.
• It also means multiplying the nodes you’ll need to manage, doubling all of the data
replication, code, and configuration deployment, and multiplying all of your
monitoring and measurement applications by the number of data centers you
deploy.