Chapter Contents
Chapter Contents
7.1 Introduction, 77
7.2 Aspects Of Performance, 77
7.3 Items That Can Be Measured, 78
7.4 Measures Of Network Performance, 78
7.5 Application And Endpoint Sensitivity, 79
7.6 Degraded Service, Variance In Traffic, And Congestion, 80
7.7 Congestion, Delay, And Utilization, 81
7.8 Local And End-To-End Measurements, 81
7.9 Passive Observation Vs. Active Probing, 82
7.10 Bottlenecks And Future Planning, 83
7.11 Capacity Planning, 83
7.12 Planning The Capacity Of A Switch, 84
7.13 Planning The Capacity Of A Router, 84
7.14 Planning The Capacity Of An Internet Connection, 85
7.15 Measuring Peak And Average Traffic On A Link, 86
7.16 Estimated Peak Utilization And 95th Percentile, 87
7.17 Relationship Between Average And Peak Utilization, 87
7.18 Consequences For Management And The 50/80 Rule, 88
7.19 Capacity Planning For A Complex Topology, 89
7.20 A Capacity Planning Process, 89
7.21 Route Changes And Traffic Engineering, 94
7.22 Failure Scenarios And Availability, 94
7.23 Summary, 95
7
Performance Assessment
And Optimization
7.1 Introduction
Chapters in this part of the text provide general background and define the problem
of network management. The three previous chapters explain aspects of the FCAPS
model.
This chapter continues the discussion of FCAPS by focusing on evaluation of net-
work performance. Unlike the discussion of monitoring in Chapter 6 that focuses on
data needed for accounting and billing, our discussion considers performance in a
broader sense. In particular, we examine the important question of how measurements
are used for capacity assessment and planning.
Three aspects of performance are relevant to network managers. The first two per-
tain to assessment, and the third pertains to optimization. The aspects can be expressed
as determining:
d What to measure
d How to obtain measurements
d What to do with the measurements
77
78 Performance Assessment And Optimization Chap. 7
Although each aspect is important, the discussion in this chapter emphasizes concepts; a
later chapter discusses the SNMP technology that can be used to obtain measurements.
In the broadest sense, a network manager can choose to measure specific entities
such as:
d Individual links
d Network elements
d Network services
d Applications
Individual Links. A manager can measure the traffic on an individual link and cal-
culate link utilization over time. Chapter 5 discusses monitoring individual links to
detect anomalous behavior; the load on links can also be used in some forms of capacity
planning†.
Network Elements. The performance of a network element, such as a switch or a
router, is easy to assess. Even basic elements provide statistics about packets pro-
cessed; more powerful devices include sophisticated internal performance monitoring
mechanisms supplied by the vendor. For example, some elements keep a count of
packets that travel between each pair of ports.
Network Services. Both enterprises and providers measure basic network services
such as domain name lookup, VPNs, and authentication service. As with applications,
providers are primarily concerned with how the service performs from a customer’s
point of view.
Applications. The measurement of applications is primarily of concern to enter-
prise networks. An enterprise manager assesses application performance for both inter-
nal and external users. Thus, a manager might measure the response time of a company
database system when employees make requests as well the response time of a company
web server when outsiders submit requests.
†Although link performance is conceptually separate from network element performance, measurement of
link performance is usually obtained from network elements attached to the link.
Sec. 7.4 Measures Of Network Performance 79
d Latency
d Throughput
d Packet loss
d Jitter
d Availability
Network latency refers to the delay a packet experiences when traversing a net-
work, and is measured in milliseconds (ms). The throughput of a network refers to the
data transfer rate measured in bits transferred per unit time (e.g., megabits per second or
gigabits per second). A packet loss statistic specifies the percentage of packets that the
network drops. When a network is operating correctly, packet loss usually results from
congestion, so loss statistics can reveal whether a network is overloaded†. The jitter a
network introduces is a measure of the variance in latency. Jitter is only important for
real-time applications, such as Voice over IP (VoIP), because accurate playback requires
a stream of packets to arrive smoothly. Thus, if a network supports real-time audio or
video applications, a manager is concerned with jitter. Finally, availability, which
measures how long a network remains operational and how quickly a network can re-
cover from problems, is especially important for a business that depends on the network
(e.g., a service provider).
†Loss rates can also reveal problems such as interference on a wireless network.
80 Performance Assessment And Optimization Chap. 7
D0 (7.1)
D ∼
∼
1 − U
where D is the effective delay, U is the utilization, and D0 is the hardware delay in the
absence of traffic (i.e., the delay when no packets are waiting to be sent).
Although it is a first-order approximation, Equation 7.1 helps us understand the re-
lationship between congestion and delay: as utilization approaches 100%, congestion
causes the effective delay to become arbitrarily large. For now, it is sufficient to under-
stand that utilization is related to performance; in later sections, we will revisit the ques-
tion and see how utilization can be used for capacity planning.
Measurements can be divided into two broad categories, depending on the scope of
the measurement. A local measurement assesses the performance of a single resource
such as a single link or a single network element. Local measurements are the easiest
to perform, and many tools are available. For example, tools exist that allow a manager
to measure latency, throughput, loss, and congestion on a link.
Unfortunately, a local measurement is often meaningless to network users because
users are not interested in the performance of individual network resources. Instead,
users are interested in end-to-end measurements (e.g., end-to-end latency, throughput,
and jitter). That is, a user cares about the behavior observed from an application on an
end system. End-to-end measurements include the performance of software running on
end systems as well as the performance of data crossing an entire network. For exam-
ple, end-to-end measurement of a web site includes measurement of the server as well
as the network.
It may seem that the end-to-end performance of a network could be computed from
measurements of local resources. Such a computation would be ideal because local
82 Performance Assessment And Optimization Chap. 7
measurements are easier to make than end-to-end measurements. Unfortunately, the re-
lationship between local and end-to-end performance is complex, which makes it impos-
sible to draw conclusions about end-to-end performance even if the performance of each
network element and link is known. For example, even if the packet loss rate is known
for each link, the overall packet loss rate is difficult to compute. We can summarize:
d Passive observation
d Active probing
We use the term capacity planning to refer to network management activities con-
cerned with estimating future needs. In cases of large networks, the planning task is
complex. A manager begins by measuring existing traffic and estimating future traffic
increases. Once estimates have been generated, a manager must translate the predicted
loads into effects on individual resources in the network. Finally, a manager considers
possible scenarios for enhancing specific network resources, and chooses a plan as a
tradeoff among performance, reliability, and cost.
To summarize:
A major challenge in capacity planning arises because the underlying networks can
be enhanced in many ways. For example, a manager can: increase the number of ports
on an existing network element, add new network elements, increase the capacity of ex-
isting links, or add additional links. Thus, a manager must consider many alternatives.
Estimating the capacity needed for a switch is among the most straightforward
capacity planning tasks. The only variables are the number of connections needed and
the speed of each connection. In many cases, the speed required for each connection is
predetermined, either by policy, assumptions about traffic, or the equipment to be at-
tached to the switch. For example, an enterprise might have a policy that specifies each
desktop connection uses wired Ethernet running at 100 mbps. As an example of equip-
ment, the connection between a switch and a router might operate at 1 gbps.
To plan the capacity of a switch, a manager estimates the number of connections,
N, along with the capacity of each. For most switches, planning capacity is further sim-
plified because modern switches employ 10/100/1000 hardware that allows each port to
select a speed automatically. Thus, a manager merely needs to estimate N, the number
of ports. A slow growth rate further simplifies planning switch capacity — additional
ports are only needed when new users or new network elements are added to the net-
work. The point is:
Planning router capacity is more complex than planning switch capacity for three
reasons. First, because a router can provide services other than packet forwarding (e.g.,
DHCP), a manager needs to plan capacity for each service. Second, because the speed
of each connection between a router and a network can be changed, planning router
capacity entails planning the capacity of connections. Third, and most important, the
traffic a router must handle depends on the way a manager configures routing. Thus, to
predict the router capacity that will be needed, a manager must plan the capacity of sur-
rounding systems and routing services. To summarize:
Sec. 7.13 Planning The Capacity Of A Router 85
Another capacity planning task focuses on the capacity of a single link between an
organization and an upstream service provider. For example, consider a link between
an enterprise customer and an ISP, which can be managed by the ISP or the customer.
If the ISP manages the link, the ISP monitors traffic and uses traffic increases to en-
courage the customer to pay a higher fee to upgrade the speed of the link. If the custo-
mer manages the link, the customer monitors traffic and uses increases to determine if
the link will soon become a bottleneck.
In theory, planning link capacity should be straightforward: use link utilization, an
easy quantity to measure, in place of delay, throughput, loss, and jitter. That is, com-
pute the percentage of the underlying link capacity that is currently being used, track
untilization over many weeks, and increase link capacity when utilization becomes too
high.
In practice, two difficult question arise:
To understand why the questions are difficult, recall from the discussion in Chapter
5 that the amount of traffic on a link varies considerably (e.g., is often much lower dur-
ing nights and weekends). Variations in traffic make measurement difficult because
measurements are only meaningful when coordinated with external events such as holi-
days. Variations make the decision about increasing link capacity difficult because a
manager must choose an objective. The primary question concerns packet loss: is the
objective to ensure that no packets are lost at any time or to compromise by choosing a
lower-cost link that handles most traffic, but may experience minor loss during times of
highest traffic? The point is:
How should a link be measured? The idea is to divide a week into small intervals,
and measure the amount of data sent on the link during each interval. From the meas-
urements, it is possible to compute both the maximum and average utilization during
the week. The measurements can be repeated during successive weeks to produce a
baseline.
How large should measurement intervals be? Choosing a large interval size has
the advantage of producing fewer measurements, which means management traffic in-
troduces less load on links, intermediate routers, and the management system. Choos-
ing a small interval size has the advantage of giving more accuracy. For example,
choosing the interval size to be one minute allows a manager to assess very short
bursts; choosing the interval size to be one hour produces much less data, but hides
short bursts by averaging all the traffic in an hour.
As a compromise, a manager can choose an interval size of 5, 10, or 15 minutes,
depending on how variable the manager expects traffic to be. For typical sites, where
traffic is expected to be relatively smooth, using a 15-minute interval represents a rea-
sonable choice. With 7 days of 24 hours per day and 4 intervals per hour, a week is di-
vided into 672 intervals. Thus, measurement data collected over an entire week consists
of only 672 values. Even if the procedure is repeated for an entire year, the total data
consists of only 34,944 values. To further reduce the amount of data, progressively old-
er data can be aggregated.
We said that measurement data can be used to compute peak utilization. To be
precise, we should say that it is possible to compute the utilization during the 15-minute
interval with the most traffic. For capacity planning purposes, such an estimate is quite
adequate. If additional accuracy is needed, a manager can reduce the interval size. To
summarize:
Of course, the computation described above only estimates utilization for data trav-
eling in one direction over a connection (e.g., from the Internet to an enterprise). To
understand utilization in both directions, a manager must also measure traffic in the re-
verse direction. In fact, in a typical enterprise, managers expect traffic traveling
between the enterprise and the Internet will be asymmetric, with more data flowing
from the Internet to the enterprise than from the enterprise to the Internet.
Sec. 7.16 Estimated Peak Utilization And 95 th Percentile 87
Once a manager has collected statistics for average and peak utilization in each
direction over many weeks, how can the statistics be used to determine when a capacity
increase is needed? There is no easy answer. To prevent all packet loss, a manager
must track the change in absolute maximum utilization over time, and must upgrade the
capacity of the link before peak utilization reaches 100%.
Unfortunately, the absolute maximum utilization can be deceptive because packet
traffic tends to be bursty and peak utilization may only occur for a short period. Events
such as error conditions or route changes can cause small spikes that are not indicative
of normal traffic. In addition, legitimate short-lived traffic bursts can occur. Many sites
decide that minor packet loss is tolerable during spikes. To mitigate the effects of short
spikes, a manager can follow a statistical approach that smooths measurements and
avoids upgrading a link too early: instead of using the absolute maximum, use the 95th
percentile of traffic to compute peak utilization. That is, take traffic measurements in
15 minute intervals as usual, but instead of selecting one interval as the peak, sort the
list, select intervals at the 95th percentile and higher, and use the selected intervals to
compute an estimated peak utilization. To summarize:
Experience has shown that for backbone links, traffic grows and falls fairly steadly
over time. A corresponding result holds for traffic on a link connecting a large organi-
zation to the rest of the Internet. If similar conditions are observed on a given link and
absolute precision is not needed, a manager can simplify the calculation of peak utiliza-
tion.
One more observation is required for the simplification: traffic on a heavily used
link (e.g., a backbone link at a provider) follows a pattern where the ratio between the
estimated peak utilization of a link, calculated as a 95th percentile, and the average utili-
zation of the link is almost constant. The constant ratio depends on the organization.
According to one Internet backbone provider, the ratio can be approximated by Equa-
tion 7.2:
Figure 7.1 Interpretation of peak utilization for various values of average uti-
lization assuming a constant ratio. Although utilization is limited
to 100%, peak demand can exceed capacity.
As the figure shows, a link with average utilization of 50% is running at approxi-
mately two-thirds of capacity during peak times, a comfortable level that allows for
unusual circumstances such as a national emergency that creates unexpected load.
When the average utilization is less than 50% and extra link capacity is not reserved for
backup, the link is underutilized. However, if the average utilization is 70%, only 9%
of the link capacity remains available during peak times. Thus, a manager can track
changes in average utilization over time, and use the result to determine when to up-
grade the link. In particular, by the time the average utilization climbs to 80%, a
manager can assume the peak utilization has reached 100% (i.e., the link is saturated).
Thus, the goal is to maintain average capacity between 50% and 80%. The bounds are
known as the 50/80 Rule:
Capacity planning for a large network is much more complex than capacity plan-
ning for individual elements and links. That is, because links can be added and routes
can be changed, capacity planning must consider the performance of the entire network
and not just the performance of each individual link or element. The point is:
Capacity planning in a large network requires six steps. Figure 7.2 summarizes the
overall planning process by listing the steps a network management team performs.
The next sections explain the steps.
hand, a manager can obtain estimates of new users (either internal users or external cus-
tomers), and calculate additional increases in load that will result.
Estimating new traffic patterns is difficult, especially in a rapidly expanding pro-
vider business. The load introduced by new service offerings may differ from the past
load and may grow to dominate quickly. An ISP sets sales targets for both existing and
new services, so a manager should be able to use the estimates to calculate the resulting
increase in load. Before a manager can do so, however, the manager must translate
from the marketing definition of a service to a meaningful statement of network load.
In particular, sales quotas cannot be used in capacity planning until the quantity, types,
and destinations of the resulting packets can be determined. Thus, when marketing de-
fines and sells a service, a network manager must determine how many new additional
packets will be sent from point A to point B in the network.
trix, Tij gives the rate at which data is expected to arrive on external connection i des-
tined to leave over external connection j. Figure 7.3 illustrates the concept.
A traffic matrix is an especially appropriate starting point for future planning be-
cause a matrix specifies an expected external traffic load without making any assump-
tions about the interior of a network or routing. Thus, planners can build a single traffic
matrix, and then run a set of simulations that compare performance of a variety of topo-
logies and routing architectures; the matrix does not need to change for each new simu-
lation.
ingress2
ingress3
. . .
..
.
ingressn
Figure 7.3 The concept of a traffic matrix. Each entry in the matrix stores
the rate of traffic from a source (ingress) to a destination (egress).
For bidirectional network connections, row i corresponds to the
same connection as column i.
A traffic matrix is most useful for modeling the backbone of a large network where
aggregate traffic flows can be averaged. It is more difficult to devise a traffic matrix for
a network that connects many end-user computers because the individual traffic from
each computer must be specified. Thus, for a large service provider, each input or out-
put of the traffic matrix represents a smaller ISP or a peer provider. For an enterprise, a
traffic matrix can model the corporate backbone, which means an input or output
corresponds to either an external Internet connection or an internal source of traffic,
such as a group of offices.
What values should be placed in a traffic matrix? The question arises because traf-
fic varies over time. Should a traffic matrix give the average traffic expected for each
pair of external connections, the peak traffic, or some combination? The ultimate goal
is to understand how traffic affects individual resources, but peak loads on individual
resources do not always occur under the same traffic conditions. As we have seen, for
heavily-loaded backbone links, average and peak utilization is related. In other cases,
planning is focused on the worst case — a network manager needs to ensure sufficient
resources to handle the worst possible combinations of traffic. Thus, the manager uses
peak load as the measure stored in the traffic matrix.
92 Performance Assessment And Optimization Chap. 7
Unfortunately, we know that traffic varies over time. Furthermore, peak traffic
between a given pair of external connections may occur at a different time than the peak
traffic between another pair of connections. The notion of independent temporal varia-
tions can impact capacity planning. For example, consider an ISP that serves residential
customers and businesses. If business traffic is high during business hours and residen-
tial traffic is high in the evening or on weekends, it may be possible to use a single net-
work infrastructure for both types of traffic. However, if a traffic matrix represents the
peak traffic for each pair of connections without specifying times, a manager can plan a
network with twice the capacity actually required. The point can be summarized:
How can a manager create a load model that accounts for temporal variation?
There are two possibilities:
To use multiple matrices, a manager divides time into blocks, and specifies a traf-
fic matrix for each block. For example, in the ISP mentioned above, a manager might
choose to create a traffic matrix for business hours and a separate traffic matrix for oth-
er hours. The chief disadvantage of using multiple matrices is that a manager must
spend time creating each.
The single-matrix approach works best for networks in which a manager can easily
identify a time slot during which peak demand occurs. For, example, some ISPs experi-
ence peak demand when business hours from multiple time zones overlap.
We said that a traffic matrix can represent the peak traffic from each ingress to
each egress. Another complication arises when managers add estimates of future traffic
to a traffic matrix: instead of starting with aggregates of traffic for each combination of
ingress and egress, a manager may be given estimates of traffic from specific flows.
For example, suppose an enterprise plans to install a new application. If a manager can
estimate the amount of traffic each user will generate, the estimates must be combined
to create aggregates. Similarly, if an ISP plans to sell new services, the traffic resulting
from the new service must be aggregated with other traffic.
Sec. 7.20 A Capacity Planning Process 93
It may be difficult to generate aggregate estimates from individual flows for two
reasons. First, in many cases, the actual destination of a new flow is unknown. For ex-
ample, if an ISP offers a VPN service that encrypts data and expects to sell N subscrip-
tions to the service, a manager may find it difficult to estimate the locations of the end-
points. Second, peaks from flows may not coincide even if the flows cross the same
link. Thus, to build a traffic matrix, a manager must estimate the combined effect of
new flows on aggregate traffic without knowing exactly how the peaks will coincide.
Once a traffic matrix has been produced, a manager can use the matrix plus a
description of the network topology and routing architecture to compute peak resource
demands for individual links and network elements. In essence, traffic for each (source,
destination) pair is mapped onto network paths, and the traffic is summed for each link
or network element in the path. The capacity needed for a link is given by the total
traffic that is assigned to the link. The switching capacity needed for a device such as
an IP router can be computed by calculating the number of packets that can arrive per
second (i.e., the sum over all inputs).
One of the key ideas underlying the process of capacity planning involves valida-
tion of the traffic matrix: before adding estimates of new traffic, a manager uses meas-
ures of current traffic, maps data from the traffic matrix to individual resources, and
compares the calculated load to the actual measured load on the resource. If the esti-
mates for individual resources are close to the measured values, the model is valid and a
manager can proceed to add estimates for new traffic with some confidence that the
results will also be valid.
If a traffic model does not agree with reality, the model must be tuned. In addition
to checking basic traffic measurements, a manager checks assumptions about the times
of peak traffic with peak times for individual links. In each case where an estimate was
used, the manager notes the uncertainty of the estimate, and concentrates on improving
items that have the greatest uncertainty.
The chief advantage of a traffic model lies in the ability of managers to investigate
possible network enhancements without disrupting the network. That is, a manager pos-
tulates a change in the network, and then uses the traffic matrix to see how the change
affects behavior. There are three types of changes:
Changes in topology are the easiest to imagine. For example, a manager can ex-
plore the effect of increasing the capacity of one or more links or the effect of adding
extra links. If the calculation of resource usage is automated, a manager can experiment
with several possibilities easily. The next sections discuss changing the routing archi-
tecture and changing assumptions about failure.
One of the more subtle aspects of capacity planning concerns routing: a change in
the Layer 3 routing structure can change resource untilization dramatically. In particu-
lar, one alternative to increasing the capacity of a link involves routing some traffic
along an alternative path.
Managers who are concerned with routing often follow an approach known as traf-
fic engineering in which a manager controls the routing used for individual traffic
flows. In particular, a manager can specify that some of the network traffic from a
given ingress to a given egress can be forwarded along a different path than other traffic
traveling between the same ingress and egress.
The most widely recognized traffic engineering technology is Multi-Protocol Label
Switching (MPLS), which allows a manager to establish a path for a specific type of
traffic passing from a given ingress to a given egress. Many large ISPs establish a full
mesh of MPLS paths among routers in the core of their network. That is, each pair of
core routers has an MPLS path between them†.
†A later chapter on tools for network management discusses MPLS traffic engineering.
Sec. 7.22 Failure Scenarios And Availability 95
7.23 Summary